Dr. Ulrich Schimmack’s Blog about Replicability

For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication” (Cohen, 1994).

DEFINITION OF REPLICABILITYIn empirical studies with sampling error, replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion (Schimmack, 2017).

BLOGS BY YEAR:  20192018, 2017, 2016, 2015, 2014

Featured Blog of the Month (January, 2020): Z-Curve.2.0 (with R-package) 

 

TOP TEN BLOGS

RR.Logo

  1. 2018 Replicability Rankings of 117 Psychology Journals (2010-2018)

Rankings of 117 Psychology Journals according to the average replicability of a published significant result. Also includes detailed analysis of time trends in replicability from 2010 to 2018). 

Golden2.  Introduction to Z-Curve with R-Code

This post presented the first replicability ranking and explains the methodology that is used to estimate the typical power of a significant result published in a journal.  The post provides an explanation of the new method to estimate observed power based on the distribution of test statistics converted into absolute z-scores.  The method has been developed further to estimate power for a wider range of z-scores by developing a model that allows for heterogeneity in power across tests.  A description of the new method will be published when extensive simulation studies are completed.

Say-No-to-Doping-Test-Image

3. An Introduction to the R-Index

 

The R-Index can be used to predict whether a set of published results will replicate in a set of exact replication studies. It combines information about the observed power of the original studies with information about the amount of inflation in observed power due to publication bias (R-Index = Observed Median Power – Inflation). The R-Index has predicted the outcome of actual replication studies.

Featured Image -- 203

4.  The Test of Insufficient Variance (TIVA)

 

The Test of Insufficient Variance is the most powerful test of publication bias and/or dishonest reporting practices. It can be used even if only two independent statistical results are available, although power to detect bias increases with the number of studies. After converting test results into z-scores, z-scores are expected to have a variance of one.   Unless power is very high, some of these z-scores will not be statistically significant (z .05 two-tailed).  If these non-significant results are missing, the variance shrinks, and TIVA detects that the variance is insufficient.  The observed variance is compared against the expected variance of 1 with a left-tailed chi-square test. The usefulness of TIVA is illustrated with Bem’s (2011) “Feeling the Future” data.

train-wreck-15.  MOST VIEWED POST (with comment by Noble Laureate Daniel Kahneman)

Reconstruction of a Train Wreck: How Priming Research Went off the Rails

This blog post examines the replicability of priming studies cited in Daniel Kahneman’s popular book “Thinking fast and slow.”   The results suggest that many of the cited findings are difficult to replicate.

http://schoolsnapshots.org/blog/2014/09/30/math-prize-for-girls-at-m-i-t/6. How robust are Stereotype-Threat Effects on Women’s Math Performance?

Stereotype-threat has been used by social psychologists to explain gender differences in math performance. Accordingly, the stereotype that men are better at math than women is threatening to women and threat leads to lower performance.  This theory has produced a large number of studies, but a recent meta-analysis showed that the literature suffers from publication bias and dishonest reporting.  After correcting for these effects, the stereotype-threat effect was negligible.  This blog post shows a low R-Index for the first article that appeared to provide strong support for stereotype-threat.  These results show that the R-Index can warn readers and researchers that reported results are too good to be true.

GPower7.  An attempt at explaining null-hypothesis testing and statistical power with 1 figure and 1500 words.   Null-hypothesis significance testing is old, widely used, and confusing. Many false claims have been used to suggest that NHST is a flawed statistical method. Others argue that the method is fine, but often misunderstood. Here I try to explain NHST and why it is important to consider power (type-II errors) using a picture from the free software GPower.

snake-oil

8.  The Problem with Bayesian Null-Hypothesis Testing

 

Some Bayesian statisticians have proposed Bayes-Factors to provide evidence for a Null-Hypothesis (i.e., there is no effect).  They used Bem’s (2011) “Feeling the Future” data to argue that Bayes-Factors would have demonstrated that extra-sensory perception does not exist.  This blog post shows that Bayes-Factors depend on the specification of the alternative hypothesis and that support for the null-hypothesis is often obtained by choosing an unrealistic alternative hypothesis (e.g., there is a 25% probability that effect size is greater than one standard deviation, d > 1).  As a result, Bayes-Factors can favor the null-hypothesis when there is an effect, but the effect size is small (d = .2).  A Bayes-Factor in favor of the null is more appropriately interpreted as evidence that the alternative hypothesis needs to decrease the probabilities assigned to large effect sizes. The post also shows that Bayes-Factors based on a meta-analysis of Bem’s data provide misleading evidence that an effect is present because Bayesian statistics do not take publication bias and dishonest reporting practices into account.

hidden9. Hidden figures: Replication failures in the stereotype threat literature.  A widespread problem is that failed replication studies are often not published. This blog post shows that another problem is that failed replication studies are ignored even when they are published.  Selective publishing of confirmatory results undermines the credibility of science and claims about the importance of stereotype threat to explain gender differences in mathematics.

20170620_14554410. My journey towards estimation of replicability.  In this blog post I explain how I got interested in statistical power and replicability and how I developed statistical methods to reveal selection bias and to estimate replicability.

US police are 12 times more likely to draw a gun in encounters with unarmed Black versus White civilians

It is a well-known fact among criminologists and other social sciences that Black US citizens are killed by police in disproportionate numbers. That is, relative to the percentage in the US population, Black civilians are killed 2 to 3 times more often than White civilians. This is about the only solid fact in the social sciences on racial disparities in lethal use of force.

Researchers vehemently disagree about the causes of this disparity. Some suggest that it is at least partially caused by racial biases in policing and the decision to use lethal force. Others argue that it is explained by the fact that police is more likely to use lethal force with violent criminals and that Black citizens are more likely to be violent criminals. Some of this disagreement can be explained by different ways to look at the same statistics and confusion about the meaning of the results.

A study by Wheeler, Philips, Worrall, and Bishopp (2017) illustrates the problem of poor communication of results. The authors report the results of an important study that provides much needed information about the frequency of use of force. The researches had access to nearly 2,000 incidences where an officer from the Dallas Police Department drew a weapon. In about 10% of these incidences, officers fired at least one shot (207 out of 1909). They also had information about the ethnicity of the civilian involved. Their abstract states a clear conclusion. “African Americans are less likely than Whites to be shot” (p. 49). The discussion section elaborates on this main finding. “Contrary to the national implicit bias narrative, our analysis found that African Americans were less likely to be shot than White subjects” (p. 65).

The next paragraph highlights that the issue is more complex.

It cannot be overemphasized that the addition of don’t shoot control cases to police
shooting cases dramatically alters the findings. With a simple census comparison (see
Results and Discussion and Conclusion), African Americans were overrepresented in
the shootings compared to Whites and Latinos. Similarly, when only examining
shooting incidents (see first column of Table 4 and accompanying narrative), of those
shot, African Americans had a higher probability of being unarmed compared to
White suspects. However, by incorporating control cases in which officers did not
shoot, we reached completely opposite inferences, namely, that African Americans
have a lower probability of being shot relative to Whites.

This paragraph is followed by a reaffirmation that “neither analysis hints at racial bias against African Americans” (p. 66).

They authors than point out that their conclusion in the abstract is severely limited to a very narrow definition of racial bias.

As previously mentioned, an important limitation of the study is the fact that such an analysis is only relevant to officer decision-making after they have drawn their firearm” (p. 66).

This restrictive definition of racial bias explains why the authors main conclusion results in a paradox. On the one hand, there is strong and clear evidence that more Black US citizens die at the hand of police than White US citizens. On the other hand, the authors claim that there is no racial bias against Black US citizens in the decision to shoot. This leaves the open question how there can be racial disparity in deaths without racial bias in shots fired. The answer is simple. Officers are much more likely to draw a weapon in encounters with Black civilians. This information is provided in Table 2

Officers drew a weapon in 1082 (57%) encounters with Black civilians compared to 273 (14%) encounters with White civilians, a disparity of 4:1. The abstract ignores this fact and focuses on the conditional probability that shots are fired when a gun is drawn (9% vs. 12%), a disparity of 1:1.3. Given the much larger racial disparity in decisions to draw a gun versus to shoot when a gun is drawn, Black civilians are actually shot disproportionally more than White civilians (100 vs. 34) at a ratio of 3:1. This is consistent with national statistics that show a 2-3:1 racial disparity in lethal use of force.

It is absolutely misleading to conclude from these data that there is no racial bias in policing or the use of force and to suggest that these results are inconsistent with the idea that racial biases in policing lead to a disproportionate number of unarmed Black civilians being killed by police.

Even more relevant information is contained in Table 4 that shows incidences in which the civilian was unarmed.

Despite being unarmed, officers drew their weapons in 239 incidences compared to 38 incidences for White civilians, a disparity of 6:1. As a result, Black civilians are also much more likely to be shot by police than White civilians (22 vs. 5), 4:1 ratio.

To fully understand the extent of racial disparities in the drawing of a weapon and shots being fired it is necessary to take the ethnic composition of Dallas into account. Wikipedia suggests that the ratio is 2:1 for Whites (50% White, 25% Black). Thus, the racial disparity in police officers drawing a gun on an unarmed civilian is 12:1 and the racial disparity of shooting an unarmed civilian is 8:1.

In conclusion, Wheeler et al.’s analyses suggest that racial bias in decisions to shoot when a gun is drawn are unlikely to explain racial disparities in lethal use of force. However, their data also suggest that racial biases in the decision to draw a gun on Black civilians may very well contribute to the disproportionate killing of unarmed civilians. The authors racial bias is revealed by their emphasize on the decision to shoot after drawing a gun, while ignoring the large racial disparity in the decision to draw a gun in the first place.

Political bias in the social sciences is a major problem. In an increasingly polarized political world, especially in the United States, scientists should try to unit a country by creating a body of solid empirical facts that only the fringe extremists and willful ignorant continue to ignore. Abuse of science to produce misleading false claims only fuels the division and gives extremists false facts to cement their ideology. It is time to take a look at systemic racism in criminology to ensure credibility especially with Black civilians. Distrust in institutions like the police or science will only fuel the division and endanger lives of all colors. It is therefore extremely unfortunate that Wheeler et al. explicitly use their article to discredit valid concerns by the BlackLivesMatter movement about racial disparities in policing.

The tragic and avoidable death of Atatiana Jefferson in neighboring Fort Worth is only one example that shows how Weehler et al.’s conclusions disregard evidence of racial in lethal use of force. A young, poorly trained officers drew a gun with deadly consequences. Called for a wellness-check (!!!) in a Black neighborhood, the White officer unannounced entered the dark backyard. The female victim heard some noise in the backyard, got her legal gun, and went to the window to examine the situation. Spooked, the police officer fired at the window and killed the homeowner. In Wheeler’s statistics, this incidence would be coded as the decision to shoot after drawing a gun with an armed Black civilian. The real question is what he was thinking to search a dark backyard with his gun drawn.

https://www.texastribune.org/2019/10/13/fort-worth-police-officer-shoots-and-kills-black-woman-home/

These all-to-common incidences are not only tragic for the victims and their relatives. They are also likely to have dramatic consequences for police officers. In this case, the officer was indicted for murder.

https://abcnews.go.com/US/officer-fatally-shot-atatiana-jefferson-home-indicted-murder/story?id=67858348

The goal of social science should be to analyze the causes of deadly encounters between police and civilians with officers or civilians as victims to create interventions that reduce the 1000 deaths a year in these encounters. Wheeler et al.’s (2016) data and tables provide a valuable piece of information. Their conclusions do not. Future research should focus on factors that determine the drawing of a weapon, especially when civilians are unarmed. All to often, these incidences end with a dead Black body on the pavement.

Kahneman talks to Mischel about Traits and Self-Control

I found this video on YouTube (Christan G.) with little information about the source of the discussion. I think it is a valuable historic document and I am reposting it here because I am afraid that it may be deleted from YouTube and be lost.

Highlights

Kahneman “We are all Mischelians.”

Kahneman “You [Mischel] showed convincingly that traits do not exist but you also provided the most convincing evidence for stable traits [when children delay eating a marshmallow become good students who do not drink and smoke.]

Here is Mischel’s answer to a question I always wanted him to answer. In short, self-control is not a trait. It is a skill.

The Dunning-Kruger Effect Explained

“These responses to our work have also furnished us moments of delicious irony, in that each critique makes the basic claim that our account of the data displays an incompetence that we somehow were ignorant of.” (Dunning, 2011, p. 247).

In 1999, Kruger and Dunning published an influential article. With 2258 citations in WebofScience it ranks #28 in citations for articles in the Journal of Personality and Social Psychology. The main contributions of the article were (a) to demonstrate that overestimation of performance is not equally distributed across different levels of performance, and (b) to provide a theory that explain why low-performers are especially prone to overestimate their performance. The finding that low-performers overestimate their performance, while high-performers are more accurate or even underestimate their performance has been dubbed the Dunning-Kruger effect (DKE). It is one of the few effects in social psychology that is named in honor of the researchers who discovered it.

The effect is robust and has been replicated in hundreds of studies (Khalid, 2016; Pennycook et al., 2017). Interestingly, it can even be observed with judgments about physical attributes like attractiveness (Greitemeyer, 2020).

While there is general consensus that the DKE is a robust phenomenon, researchers disagree about the explanation for the DKE. Kruger and Dunning (1999) proposed a meta-cognitive theory. Accordingly, individuals have no introspective access to their shortcomings. For example, a student who picked one answer from a set of options in a multiple-choice test thinks that they picked the most reasonable option. After all, they would have picked a different option if they had considered another option as more reasonable. Students only become aware that they picked the wrong option when they are given feedback about their performance. As a result, they are overly confident that they picked the right answer before they are given feedback. This mistake will occur more frequently for low-performers than for high performers. It can not occur for students who ace their exam (i.e, get all answers correct). The only mistake top-performers could make is to doubt their right answers and underestimate their performance. Thus, lack of insight into mistakes coupled with a high frequency of mistakes leads to higher overconfidence among low-performers.

In contrast, critiques of the meta-cognitive theory have argued that DKE is a statistical necessity. As long as individuals are not perfectly aware of their actual performance, low-performers are bound to overestimate their performance and high-performers are bound to underestimate their performance (Ackermann et al., 2002; Gignac & Zajenkowski, 2020; Krueger & Mueller, 2002). This account has been called regression to the mean. Unfortunately, this label has produced a lot of confusion because the statistical phenomenon of regression to the mean is poorly understood by many psychologists.

Misunderstanding of Regression to the Mean

Wikipedia explains that “in statistics, regression toward the mean (or regression to the mean) is the phenomenon that arises if a sample point of a random variable is extreme (nearly an outlier), a future point will be closer to the mean or average on further measurements.”

Wikipedia also provides a familiar example.

Consider a simple example: a class of students takes a 100-item true/false test on a subject. Suppose that all students choose randomly on all questions. Then, each student’s score would be a realization of one of a set of independent and identically distributed random variables, with an expected mean of 50. Naturally, some students will score substantially above 50 and some substantially below 50 just by chance. If one selects only the top scoring 10% of the students and gives them a second test on which they again choose randomly on all items, the mean score would again be expected to be close to 50. Thus the mean of these students would “regress” all the way back to the mean of all students who took the original test. No matter what a student scores on the original test, the best prediction of their score on the second test is 50.

This is probably the context that most psychologists have in mind when they think about regression to the mean. The same measurement procedure is repeated twice. In this scenario, students who performed lower the first time are likely to increase their performance the second time and students who performed well the first time are bound to decrease in their performance the second time. How much students regress towards the mean depends on the influence of their actual abilities on performance on the two tests. The more strongly the two tests are correlated, the less regression to the mean occurs. In the extreme case, where performance is fully determined by ability, the retest correlation is 1 and there is no regression to the mean because there are no residuals (i.e., deviations of individuals between their two performances).

The focus on the specific example of repeated measurements created a lot of confusion in the DKE literature. It probably started with Krueger and Mueller’s critique. First, they emphasize statistical regression and even provide a formula that shows a deterministic relationship between a predictor variable x and the discrepancies between the predictor variable and criterion, r(x,x-y) that is bound to be negative. It follows that low performers are bound to have larger positive deviations. However, they then proceed to discuss reliability of the performance measures.

Thus far, we have assumed that actual percentiles are perfectly reliable measures of ability. As in any psychometric test, however, the present test scores involved both true variance and error
variance (Feldt & Brennan, 1989). With repeated testing, high and low test scores regress toward the group average, and the magnitude of these regression effects is proportional to the size of the error variance and the extremity of the initial score (Campbell & Kenny, 1999). In the Kruger and Dunning (1999) paradigm, unreliable actual percentiles mean that the poorest performers are not as deficient as they seem and that the highest performers are not as able as they seem.

This passage implies that regression to the mean plays a different role in the DKE. Performance on any particular test is not only a function of ability, but also (random) situational factors. This means that performance scores are biased estimates of ability. Low performers’ scores are more likely to be biased in a negative direction than high performers. If performance judgments are based on self-knowledge of ability, the comparison of judgments with performance scores is biased and may show an illusory DKE. To address this problem, Krueger and Mueller propose to estimate the reliability of test scores and to correct for the bias introduced by unreliability.

In the following 18 years, it has been neglected that Krueger and Mueller made two independent arguments against the meta-cognitive theory. One minor problem is unreliability in the performance measure as a measure of ability. The major problem is that the DKE effect is a statistical necessity that applies to difference scores.

Unreliability in the Performance Measure Does not Explain the DKE

Kruger and Dunning (2002) responded to Krueger and Mueller’s (2002) critique. Their response focused exclusively on the problem of unreliable performance measures.

They found that correcting for test unreliability reduces or eliminates the apparent asymmetry in calibration between top and bottom performers.” (p. 189).

They then conflate statistical regression and unreliability when they ask “Does regression explain the results?”

The central point of Krueger and Mueller’s (2002) critique is that a regression artifact, coupled with a general BTA effect, can explain the results of Kruger and Dunning (1999). As they noted,
all psychometric tests involve error variance, thus “with repeated testing, high and low test scores regress toward the group average, and the magnitude of these regression effects is proportional to the size of the error variance and the extremity of the initial score” (Krueger & Mueller, 2002, p. 184). They go on to point out that “in the Kruger and Dunning (1999) paradigm, unreliable actual percentiles mean that the poorest performers are not as deficient as they seem and that the highest performers are not as able as they seem” (p. 184). Although we agree that test unreliability can contribute to the apparent miscalibration of top and bottom performers, it cannot fully explain this miscalibration” (p. 189)

This argument has convinced many researchers in this area that the key problem is unreliability in the performance measure and that this issue can be addressed empirically by controlling for unreliability. Doing so, typically does not remove the DKE (Ehrlinger, Johnson, Banner, Dunning, & Kruger, 2008).

The problem is that unreliability in the performance measure is not the major concern. It is not even clear how it applies when participants are asked to estimate their performance on a specific test. A test is an unreliable measure of an unobserved construct like ability, but a student who got 60% of multiple choice question correct got 60% of the question correct. There is no unreliability in manifest scores.

The confusion between these two issues has led to the false impression that the regression explanation has been examined and empirically falsified as a sufficient explanation of DKE. For example, in a review paper Dunning (2011) wrote.

Fortunately, there are ways to estimate the degree of measurement unreliability and then correct for it. One can then assess what the relation is between perception and reality once unreliability in measuring actual performance has been eliminated. See Fig. 5.3, which displays students’ estimates of exam performance, in both percentile and raw terms, for a different college class (Ehrlinger et al., 2008, Study 1). As can be seen in the figure, correcting for measurement unreliability has only a negligible impact on the degree to which bottom performers overestimate their performance (see also Kruger & Dunning, 2002). The phenomenon remains largely intact. (p. 266).

The Dunning-Kruger Effect is a Statistical Necessity

Another article in 2002 also made the point that DKE is a statistical necessity, although the authors called it an artifact (Ackerman et al., 2002). The authors made their point with a simple simulation.

To understand whether this effect could be accounted for by regression to the mean, we simulated this analysis using two random variables (one representing objective knowledge and the other self-reported knowledge) and 500 observations (representing an N of 500). As in the Kruger and Dunning (1999) comparison, these random variables were correlated r=0.19. The observations were then divided into quartiles based on the simulated scores for the objective knowledge variable (n=125 observations per quartile). Simulated self-report and objective knowledge were then compared by quartile. As can be seen in Fig. 1, the plotting of simulated data for 500 subjects resulted in exactly the same phenomenon reported by Kruger and Dunning (1999)—an overestimation for those in the lowest quartile and an underestimation for those in the top quartile. Further analysis comparing the means of self-report and objective knowledge for each quartile revealed that the difference between the simulated self-reported (M=-0.21) and objective (M=-1.22) scores for the bottom quartile was significant t (124)= -10.09, P<0.001 (which would be ‘‘interpreted’’ as overestimation of performance). The difference between simulated self-reported (M=0.27) and objective (M=1.36) scores for the top quartile was also significant, t(124)=11.09, P<0.001, (‘‘interpreted’’ as underestimation by top performers). This illustration demonstrates the measurement problems associated with interpreting statistical significance when two variables are compared across groups selected for performance on one of the variables, and there is a low correlation between the two variables.

Unaware of Ackerman et al.’s (2002) article, Gignac & Zajenkowski (2020) used simulations to make the same point.

Here is an R-Script to perform the same simulation.

N = 500
accuracy = .6
obj = rnorm(N)
sub = objaccuracy + rnorm(N)sqrt(1-accuracy^2)
summary(sub)
plot(obj,sub,xlim=c(-3,3),ylim=c(-3,3),
xlab=”Performance”,ylab=”Estimates”
)
abline(h = 0)
quarts = quantile(obj,c(0,.25,.5,.75,1))
abline(v = quarts,lty=2)
quarts
x = tapply(obj,cut(obj,quarts),mean)
y = tapply(sub,cut(obj,quarts),mean)
x
y
par(new=TRUE)
plot(x,x,pch=19,cex=2,xlim=c(-3,3),ylim=c(-3,3),col=”blue”,
xlab=”Performance”,ylab=”Estimates”
)
par(new=TRUE)
plot(x,y,pch=15,cex=2,xlim=c(-3,3),ylim=c(-3,3),col=”blue”,
xlab=”Performance”,ylab=”Estimates”
)

It is reassuring that empirical studies mostly found support for a pattern that is predicted by a purely mathematical relationship. However, it is not clear that we need a term for it and naming it the Dunning-Kruger effect is misleading because Kruger and Dunning provided a psychological explanation for this statistically determined pattern.

Does the Simulation Provide Ironic Support for the Dunning-Kruger Effect?

Dunning (2011) observed that any valid criticism of the DKE would provide ironic support for the DKE. After all, the authors confidently proposed a false theory of the effect in full ignorance of their incompetence to realize that their graphs reveal a statistic relationship between any two variables rather than a profound insight into humans’ limited self-awareness.

I disagree. The difference is that students after an exam before they get the results have no feedback or other valid information that might help them to make more accurate judgments about their performance. It is a rather different situation when other researchers propose alternative explanations and these explanations are ignored. This is akin to students who come to complain about ambiguous exam questions that other students answered correctly in large numbers. Resistent to valid feedback is not the DKE.

As noted above, Kruger and Dunning (2002) responded to Krueger and Mueller’s criticism and it is possible that they misunderstood Krueger and Mueller’s critique because it did not clearly distinguish between the statistical regression explanation and the unreliability explanation for the effect. However, in 2015 Dunning does cite Ackerman et al.’s article, but claims that the regression explanation has been addressed by controlling for unreliability.

To be sure, these findings and our analysis of them are not without critics. Other researchers have asserted that the Dunning-Kruger pattern of self-error is mere statistical artifact. For example, some researchers have argued that the pattern is simply a regression-to-the-mean effect (Ackerman, Beier, & Bowen, 2002; Burson, Larrick, & Klayman, 2006; Krueger & Mueller,
2002). Simply because of measurement error, perceptions of performance will fail to correlate perfectly with actual performance. This dissociation due to measurement error will cause poor performers to overestimate their performance and top performers to underestimate theirs, the pattern found, for example, in Fig. 1. In response, we have conducted studies in which we
estimate and correct for measurement error, asking what the perception/ reality link would look like if we had perfectly reliable instruments assessing performance and perception. We find that such a procedure reduces our pattern of self-judgment errors only trivially (Ehrlinger et al., 2008; Kruger & Dunning, 2002). (p. 157)
.

Either Dunning cited, but did not read Ackerman et al.’s article, or he was unable to realize that statistical regression and unreliable measures are two distinct explanations for the DKE.

Does it Matter?

In 2011, Dunning alludes to the fact that there are two distinct regression effects that may explain the DKE.

There are actually two different versions of this “regression effect” account of our data. Some scholars observe that Fig. 5.2 looks like a regression effect, and then claim that this constitutes a complete explanation for the Dunning–Kruger phenomenon. What these critics miss, however, is that just dismissing the Dunning–Kruger effect as a regression effect is not so much explaining the phenomenon as it is merely relabeling it. What one has to do is to go further to elucidate why perception and reality of performance are associated so imperfectly. Why is the relation so regressive? What drives such a disconnect for top and bottom performers between what they think they have achieved and what they actually have? (p. 266)

Here Dunning seems to be aware that unreliability in the performance measure is not necessary for regression to the mean. His response to this criticism is less than satisfactory. The main point of the regression to the mean model is that low-performers are bound to overestimate their performance because they are low performers. No additional explanation is needed other than uncertainty about one’s actual performance. Most important, the regression model assumes that low-performers and high-performers are no different in their meta-cognitive abilities to guess their actual performance. The DKE emerges even if errors are simulated as random noise.

In contrast, Kruger and Dunning’s main claim is that low-performers suffer from two short-comings.

My colleagues and I have laid blame for this lack of self-insight among poor performers on a double-curse—their deficits in expertise cause them not only to make errors but also leave them unable to recognize the flaws in their reasoning. (Dunning, 2011, p. 265).

This review of the main arguments in this debate shows that the key criticism of Kruger and Dunning’s account of their findings has never been seriously addressed. As a result, hundreds of studies have been published as empirical support for an effect that follows from a statistical relationship between two imperfectly correlated variables.

This does not mean that the regression model implies that limited self-awareness is not a problem. The model still implies that low performers are bound to overestimate their performance and high performers are bound to underestimate their performance. The discrepancies between actual and estimated performance are real. The difference is just not due to differences in lack of insight into one’s abilities. Although this may be the case, it is difficult to test the influence of additional factors because regression to the mean alone will always produce the predicted pattern.

It is disconcerting that researchers have spend 20 years on studying a statistical phenomenon as if it provides insights into human’s ability to know themselves. The real question is not why low-performers overestimate their performance more than others. This has to be the case. The real question is why individuals often try to avoid feedback that provides them with more accurate knowledge of themselves. Of course, this question has been addressed in other lines of research on self-verification and positive illusions that rarely connects with the Dunning-Kruger literature. The reason may be that research on these topics is much more difficult and produces more inconsistent results than plotting aggregated difference scores for two variables.

Psychologists are not immune to the Dunning-Kruger Effect

Background

Bar-Anan and Vianello (2018) published a structural equation model in support of a dual-attitude model that postulates explicit and implicit attitudes towards racial groups, political parties, and the self. I used their data to argue against a dual-attitude model. Vianello and Bar-Anan (2020) wrote a commentary that challenged my conclusions. I was a reviewer of their commentary and pointed out several problems with their new model (Schimmack, 2020). They did not respond to my review and their commentary was published without changes. I wrote a reply to their commentary. In the reply, I merely pointed to my criticism of their new model. Vianello and Bar-Anan wrote a review of my reply, in which they continue to claim that my model is wrong. I invited them to discuss the differences between our models, but they declined. In this blog post, I show that Vianello and Bar-Anan lack insight into the shortcomings of their model, which is consistent with the Dunning-Kruger effect that incompetent individuals lack insight into their own incompetence. On top of this, Vianello and Bar-Anan show willful ignorance by resisting arguments that undermine their motivated belief in dual-attitude models. As I show below, Vianello and Bar-Anan’s model has several unexplained results (e.g, negative loadings on method factors), worse fit than my model, and produces false evidence of incremental predictive validity for the implicit attitude factors.

Introduction

The skill set of psychology researchers is fairly limited. In some areas expertise is needed to create creative experimental setups. In other areas, some expertise in the use of measurement instruments (e.g., EEG) is required. However, for the most part, once data are collected, little expertise is needed. Data are analyzed with simple statistical tools like t-tests, ANOVAs, or multiple regression. These statistical methods are implemented in simple commands and no expertise is required to obtain results from statistics programs like SPSS or R.

Structural equation modeling is different because researchers have to specify a model that is fitted to the data. With complex data sets, the number of possible models that can be specified increases exponentially and it is not possible to specify all models and to simply pick the model with the best fit. Moreover, there will be many models with similar fit and it requires expertise to pick plausible models. Unfortunately, psychologists receive little formal training in structural equation modeling because graduate training relies heavily on training by supervisors rather than formal training. As most supervisors never received training in structural equation modeling, they cannot teach their graduate student how to perform these analyses. This means that expertise in structural equation modeling varies widely.

An inevitable consequence of wide variation in expertise is that individuals with low expertise have little insight into their limited abilities. This is known as the Dunning-Kruger effect that has been replicated in numerous studies. Even incentives to provide accurate performance estimates do not eliminate the overconfidence of individuals with low levels of expertise (Ehrlinger et al., 2008).

The Dunning-Kruger effect explains Vianello and Bar-Anan’s (2020) response to my article that presents another ill-fitting model that makes little theoretical sense. This overconfidence may also explain why they are unwilling to engage in a discussion of their model with me. They may not realize that my model is superior because they were unable to compare the models or to run more direct comparisons of the models. As their commentary is published in the influential journal Perspectives on Psychological Science and as many readers lack the expertise to evaluate the merits of their criticism, it is necessary to explain clearly why their criticism of my models is invalid and why their new alternative model is flawed.

Reproducing Vianello and Bar-Anan’s Model

I learned the hard way that the best way to fit a structural equation model is to start with small models of parts of the data and then to add variables or other partial models to build a complex model. The reason is that bad fit in smaller models can be easily identified and lead to important model modifications, whereas bad fit in a complex model can have thousands of reasons that are difficult to diagnose. In this particular case, I saw new reason to even fit a complex model for attitudes to political parties, racial groups, and the self. Instead I fitted separate models for each attitude domain. Vianello and Bar-Anan (2020) take issue with this decision.

As for estimating method variance across attitude domains, that is the very logic behind an MTMM design (Campbell & Fiske, 1959; Widaman, 1985): Method variance is shared across measures of different traits that use the same method (e.g., among indirect measures
of automatic racial bias and political preferences). Trait variance is shared across measures of the same trait that use different methods (e.g., among direct and indirect measures of racial attitude). Separating the MTMM matrix into three separate submatrices (one for each
trait), as Schimmack did in his article, misses a main advantage of an MTMM design.

This criticism is based on an outdated notion of validation by means of correlations in a multi-trait-multi-method matrix. In this MTMM tables, every trait is measured with all methods. For example, the Big Five traits are measured with students’ self-ratings, mothers’ ratings, and fathers’ ratings (5 traits x 3 methods). This is not possible for validation studies of explicit and implicit measures because it is assumed that explicit measures measure explicit constructs and implicit measures measure implicit constructs. Thus, it is not possible to fully cross traits and methods. This problem is evident in all models by Bar-Anan and Vianello and myself. Bar-Anan and Vianello make the mistake to assume that using implicit measures for several attitude domains solves this problem, but their assumption that we can use correlations between implicit measures in one domain and implicit measures in another domain to solve this problem is wrong. In fact, it makes matters worse because they fail to model method variance within a single attitude domain properly.

To show this problem, I first constructed measurement models for each attitude domain and then show that combining well-fitting models of three three domains produces a better fitting model than Vianello and Bar-Anan’s model.

Racial Bias

In their revised model, Vianello and Bar-Anan postulate three method factors. One for explicit measures, one for IAT-related measures, and one for the Affective Missatribution Paradigm and the Evaluative Priming Task. It is not possible to estimate a separate method factor for all explicit measures, but it is possible to allow for method factors that are unique to the IAT-related measures and one that is unique to the AMP and EPT. In the first model, I fitted this model to the measures of racial bias. The model appears to have good fit, RMSEA = .013, CFI = 973. In this model, the correlation between the explicit and implicit racial bias factors is r = .80.

However, it would be premature to stop the analysis here because overall fit values in models with many missing values are misleading (Zhang & Savaley, 2019). Even if fit were good, it is good practice to examine the modification indices to see whether some parameters are misspecified.

Inspection of the fit indices shows one very large Modification Index of 146.04 for the residual correlation between the feeling thermometer and the preference ratings. There is a very plausible explanation for this finding. These two measures are very similar and can share method variance. For example, social desirable responding could have the same effect on both ratings. This was the reason why I included only one of the two measures in my model. An alternative is to include both ratings and allow for the correlated residual to model shared method variance.

As predicted by the MI, model fit improved, RMSEA = .006, CFI = .995. Vianello and Bar-Anan (2020) might object that this finding is post-hoc after peeking at the data, while their model is specified theoretically. However, this argument is weak. If they really theoretically predicted that feeling thermometer and direct ratings share no method variance, it is not clear what theory they have in mind. After all, shared rating biases are very common. Moreover, their model also assumes shared method variance between these factors, but it also predicts that this method variance also influences dissimilar measures like the Modern Racism Scale and even ratings of other attitude objects. In short, neither their model nor my models are based on theories, in part because psychologists have ignored to develop and validate measurement theories. Even if it were theoretically predicted that feeling-thermometer and preference ratings do not share method variance, the large MI for this parameter would indicate that this theory is wrong. Thus, the data falsify this prediction. In the modified model, the implicit-explicit correlation increases from .80 to .90, providing even less support for the dual-attitude model.

Further inspection of the MI showed no plausible further improvements of the model. One important finding in this partial model is that there is no evidence of shared method variance between the AMP and EPT, r = -.04. Thus, closer inspection of the correlations among the racial attitude domain suggests two problems for Vianello and Bar-Anan’s model. There is evidence of shared method variance between two explicit measures and there is no evidence of shared method variance between two implicit measures, namely the AMP and EPT.

Next, I built a model for the political orientation domain starting with the specification in Vianello and Bar-Anan’s model. Once more, overall fit appears to be good, RMSEA = .014, CFI = .989. In this model, the correlation between the implicit and explicit factor is r = .9. However, inspection of the MI replicates a residual correlation between feeling thermometer and preference ratings. MI = 91.91. Allowing for this shared method variance improved model fit, RMSEA = .012, CFI = .993, but had little effect on the implicit-explicit correlation, r = .91. In this model, there was some evidence of shared method variance between the AMP and EPT, r = .13.

Next, I put these two well-fitting models together, leaving each model unchanged. The only new question is how measures of racial bias should be related to measures of political orientation. It is common to allow trait factors to correlate freely. This is also what Vianello and Bar-Anan did and I followed this common practices. Thus, there is no theoretical structure imposed on the trait correlations. I did not specify any additional relations for the method factors. If such relationships exist, this should lead to low fit. Model fit seemed to be good, RMSEA = .009, CFI = .982. The biggest MI was observed for the loading of the Modern Racism Scale (MRS) on the explicit political orientation factor, MI = 197.69. This is consistent with the item content of the MRS that combines racism with conservative politics (e.g., being against affirmative action). For that reason, I included the MRS in my measurement model of political orientation (Schimmack, 2020).

Vianello and Bar-Anan (2020) criticize my use of the MRS. “For instance, Schimmack chose to omit one of the indirect measures—the SPF—from the models, to include the Modern Racism Scale (McConahay, 1983) as an indicator of political evaluation, and to omit the thermometer scales from two of his models. We assume that Schimmack had good practical or theoretical reasons for his modelling decisions; unfortunately, however, he did not include those reasons.” If they had inspected the MI, they would have seen that my decision to use the MRS as a different method to measure political orientation was justified by the data as well as by the item-content of the scale.

After allowing for this theoretically expected relationship, model fit improves, chi2(df = 231) = 506.93, RMSEA = .007, CFI = .990. Next I examined whether the IAT method factor for racial bias is related to the IAT method factor for political orientation. Adding this relationship did not improve fit, chi2(230) = 506.65 = RMSEA = .007, CFI = .990. More important, the correlation was not significant, r = -.06. This is a problem for Vianello and Bar-Anan’s model that assumes the two method factors are identical. To test this hypothesis, I fitted a model with a single IAT method factor. This model had worse fit, chi2(231) = 526.99, RMSEA = .007, CFI = .989. Thus, there is no evidence for a general IAT method factor.

I next explored the possibility of a method factor for the explicit measures. I had identified shared method variance for the feeling thermometer and preference ratings for racial bias and for political orientation. I now modeled this shared method variance with method factors and let the two method factors correlate with each other. The addition of a correlation did not improve model fit, chi2(230) = 506.93, RMSEA = .007, CFI = .990 and the correlation between the two explicit method factors was not significant, r = .00. Imposing a single method factor for both attitude domains reduced model fit, chi2(df = 229) = 568.27, RMSEA = .008, CFI = .987.

I also tried to fit a single method factor for the AMP and EPT. The model only converged by constraining two loadings. Then model fit improved slightly, chi2(df = 230) = 501.75, RMSEA = .007, CFI = .990. The problem for Vianello and Bar-Anan is that the better fit was achieved with a negative loading on the method factor. This is inconsistent with the idea that a general method factor inflates correlations across attitude domains.

In sum, there is no evidence that method factors are consistent across the two attitude domains. Therefore I retained the basic model that specified method variance within attitude domains. I then added the three criterion variables to the model. As in Vianello and Bar-Anan’s model, contact was regressed on the explicit and implicit racial bias factor and previous voting and intention to vote were regressed on the explicit and implicit political orientation factors. The residuals were allowed to correlate freely, as in Vianello and Bar-Anan’s model.

Overall model fit decreased slightly for CFI, chi2(df = 297) = 668.61, RMSEA = .007, CFI = .988. MI suggested an additional relationship between the explicit political orientation factor and racial contact. Modifying the model accordingly improved fit slightly, chi2(df = 296) = 660.59, RMSEA = .007, CFI = .988. There were no additional MI involving the two voting measures.

Results were different from Vianello and Bar-Anan’s results. They reported that the implicit factors had incremental predictive validity for all three criterion measures.

In contrast, the model I am developing here shows no incremental predictive validity for the implicit factors.

It is important to note that I create the measurement model before I examined predictive validity. After the measurement model was created, criterion variables were added and the data determined the pattern of results. It is unclear how Vianello and Bar-Anan developed a measurement model with non-existing method factors that produced the desired outcome of significant incremental validity.

To try to reproduce their full result, I also added self-esteem measures to the model. To do so, I first created a measurement model for the self-esteem measures. The basic measurement model had poor fit, chi2(df = 58) = 434.49, RMSEA = .019, CFI = .885. Once more, the MI suggested that feeling-thermometer and preference ratings shared method variance. Allowing for this residual correlation increased model fit, chi2(df = 57) = 165.77, RMSEA = .010, CFI = .967. Another MI suggested a loading of the speeded task on the implicit factor, MI = 54.59. Allowing for this loading further improved model fit, chi2(df = 56) = 110.01, RMSEA = .007, CFI = .983. The crucial correlation between the explicit and implicit factor was r = .36. The correlation in Vianello and Bar-Anan’s model was r = .30.

I then added the self-esteem model to the model with the other two attitude domains, chi2(df = 695) = 1309.59, RMSEA = .006, CFI = .982. Next I added correlations of the IAT method factor for self-esteem with the two other IAT-method factors. This improved model fit, chi2(df = 693) = 1274.59, RMSEA = .006, CFI = .983. The reason was a significant correlation between the IAT method factors for self-esteem and racial bias. I offered an explanation for this finding in my article. Most White respondents associate self with good and White with good. If some respondents are better able to control their automatic tendencies, they will show less pro-self and pro-White biases. In contrast, Vianello and Bar-Anan have no theoretical explanation for a shared method factor across attitude domains. There was no significant correlation between IAT method factors for self-esteem and political orientation. The reason is that political orientation has more balanced automatic tendencies so that method variance does not favor one direction over the other.

This model had better fit with fewer parameters than Vianello and Bar-Anan’s model, chi2(df = 679) = 1719.39, RMSEA = .008, CFI = .970. The critical results of predictive validity remained unchanged.

I also fitted Vianello and Bar-Anan’s model and added four parameters that I identified as missing from their model: (a) the loading of the MRS on the explicit political orientation factor and (b) the correlations between feeling-thermometer and preference ratings for each domain. Making these adjustments improved model fit considerably, chi2(df = 675) = 1235.59, RMSEA = .006, CFI = .984. This modest adjustment altered the pattern of results for the prediction of the three criterion variables. Unlike Vianello and Bar-Anan’s model, the implicit factors no longer predicted any of the three criterion variables.

Conclusion

My interaction with Vianello and Bar-Anan are symptomatic of social psychologists misapplication of the scientific method. Rather than using data to test theories, data are being abused to confirm pre-existing beliefs. This confirmation bias goes against philosophies of science that have demonstrated the need to subject theories to strong tests and to allow data to falsify theories. Verificationism is so ingrained in social psychology that Vianello and Bar-Anan ended up with a model that showed significant incremental predictive validity for all three criterion measures in their model, when this model made several questionable assumptions. They may object that I am biased in the opposite direction, but I presented clear justifications for modeling decisions and my model fits better than their model. In my 2020 article, I showed that Bar-Anan also co-authored another article that exaggerated evidence of predictive validity that disappeared when I reanalyzed the data (Greenwald, Smith, Sriram, Bar-Anan, & Nosek, 2009). Ten years later, social psychologists claim that they have improved their research methods, but Vianello and Bar-Anan’s commentary in 2020 shows that social psychologists have a long way to go. If social psychologists want to (re)gain trust, they need to be willing to discard cherished theories that are not supported by data.

References

Bar-Anan, Y., & Vianello, M. (2018). A multi-method multi-trait test of the dual-attitude perspective. Journal of Experimental Psychology: General, 147(8), 1264–1272. https://doi.org/10.1037/xge0000383

Ehrlinger, J., Johnson, K., Banner, M., Dunning, D., & Kruger, J. (2008). Why the unskilled are unaware: Further explorations of (absent) self-insight among the incompetent. Organizational Behavior and Human Decision Processes, 105(1), 98–121. https://doi.org/10.1016/j.obhdp.2007.05.002

Greenwald, A. G., Smith, C. T., Sriram, N., Bar-Anan, Y., & Nosek, B. A. (2009). Implicit race attitudes predicted vote in the 2008 U.S. Presidential election. Analyses of Social Issues and Public Policy (ASAP), 9(1), 241–253. https://doi.org/10.1111/j.1530-2415.2009.01195.x

Schimmack U. The Implicit Association Test: A Method in Search of a Construct. Perspectives on Psychological Science. October 2019. doi:10.1177/1745691619863798

Vianello M, Bar-Anan Y. Can the Implicit Association Test Measure Automatic Judgment? The Validation Continues. Perspectives on Psychological Science. February 2020. doi:10.1177/1745691619897960

Zhang, X. & Savalei, V. (2020) Examining the effect of missing data on RMSEA and CFI under normal theory full-information maximum likelihood, Structural Equation Modeling: A Multidisciplinary Journal, 27:2, 219-239, DOI: 10.1080/10705511.2019.1642111

Cross-Cultural Comparisons of Personality: Beware of Method Factors

Ulrich Schimmack
Shigehiro Oishi

Abstract

Personality ratings on a 25-item Big Five measures by two national samples (US, Japan) were analyzed with an item-level measurement model that separates method factors (acquiescence, halo bias) and trait factors. Results reveal a strong influence of halo bias on US responses that distort cultural comparisons in personality. After correcting for halo bias, Japanese were more conscientious, extraverted, open to experience and less neurotic and agreeable. The results support cultural differences in positive illusions and raises questions about the validity of studies that rely on scale means to examine cultural differences in personality.

Introduction

Cultural stereotypes imply cross-cultural differences in personality traits. However, cross-cultural studies of personality do not support the validity of these cultural stereotypes (Terracciano et al., 2005). Whenever two measures produce divergent results, it is necessary to examine the sources of these discrepancies. One obvious reason could be that cultural stereotypes are simply wrong. It is also possible that scientific studies of personalty across culture produce misleading results (Perugini & Richetin, 2007). One problem for empirical studies of cross-cultural differences in personality is that cultural differences tend to be small. Culture explains at most 10% of the variance and often the percentages are much smaller. For example, McCrae et al. (2010) found that culture explained only 1.5% of the variance in agreeableness ratings. As some of this variance is method variance, the variance due to actual differences in agreeableness is likely to be less than 1%. With small amounts of valid variance, method factors can have a strong influence on the pattern of mean differences across cultures.

One methodological problem in cross-cultural studies of personality is that personalty measures are developed with a focus on the correlation of items with each other within a population. The item means are not relevant with the exception that items should avoid floor or ceiling effects. However, cross-cultural comparisons rely on differences in the item means. As item means have not been subjected to psychometric evaluations, it is possible that item means lack construct validity. Take “working hard” as an example. How hard people work could be influenced by culture. For example, in poor cultures people have to work harder to make a living. The item “working hard” may correctly reflect variation in conscientiousness within poor cultures and within rich cultures, but the differences between cultures would reflect environmental conditions rather than conscientiousness. As a result, it is necessary to demonstrate that cultural differences in item means are valid measures of cultural differences in personality.

Unfortunately, obtaining data from a large sample of nations is difficult and sample sizes are often rather small. For example, McCrae et al. (2010) examined convergent validity of Big Five scores with 18 nations. The only significant evidence of convergent validity was obtained for neuroticism, r = .44, and extraversion, r = .45. Openness and agreeableness even produced small negative correlations, r = -.27, r = -.05, respectively. The largest cross-cultural studies of personality had 36 overlapping nations (Allik et al., 2017; Schmitt et al., 2007). The highest convergent validity was r = .4 for extraversion and conscientiousness. Low convergent validity, r = .2, was observed for neuroticism and agreeableness, and the convergent validity for openness was 0 (Schimmack, 2020). These results show the difficulty of measuring personality across cultures and the lack of validated measures of cultures’ personality profiles.

Method Factors in Personality Measurement

It is well-known that self-ratings of personality are influenced by method factors. One factor is a stylistic factor in the use of response formats known as acquiescence bias (Cronbach, 1942, 1965). The other factor reflects individual differences in responding to the evaluative meaning of items known as halo bias (Thorndike, 1920). Both method factors can distort cross-cultural comparisons. For example, national stereotypes suggest that Japanese individuals are more conscientious than US American individuals, but mean scores of conscientiousness in cross-cultural studies do not confirm this stereotype (Oishi & Roth, 2009). Both method factors may artificially lower Japan’s mean score because Japanese respondents are less likely to use extreme scores (Min, Cortina, & Miller, 2016) and Asians are less likely to inflate their scores on desirable traits (Kim, Schimmack, & Oishi, 2012). In this article, we used structural equation modeling to separate method variance from trait variance to distinguish cultural differences in response tendencies from cultural differences in personality traits.

Convenience Samples versus National Samples

Another problem for empirical studies of national differences is that psychologists often rely on convenience samples. The problem with convenience samples is that personality can change with age and that there are regional differences in personality within nations (). For example, a sample of students at New York University may differ dramatically from a student sample at Mississippi State University or Iowa State University. Although regional differences tend to be small, national differences are also small. Thus, small regional differences can bias national comparisons. To avoid these biases it is preferable to compare national samples that cover all regions of a nation and a broad age range.

Modeling Approach

The purpose of our study is to advance research on cultural differences in personality by comparing a Japanese and a US national sample that completed the same Big Five personality questionnaire using a measurement model that distinguishes personality factors and method factors. The measurement model is an improved version of Anusic et al.’s (2009) halo-alpha-beta model (Schimmack, 2019). The model is essentially a tri-factor model.

Figure 1

That is, each item loads on three factor, namely (a) a primary loading on one of the Big Five factors, (b) a loading on an acquiescence bias factor, and (c) a loading on the evaluative bias/halo factor. As Big Five measures typically do not show a simple structure, the model also can include secondary loadings on other Big Five factors. This measurement model has been successfully fitted to several Big Five questionnaires (Schimmack, 2019). This is the first time, the model is applied to a multiple-group model to compare measurement models for US and Japanese samples.

We first fitted a very restrictive model that assumed invariance across the two factors. Given the lack of psychometric cross-cultural comparisons, we expected that this model would not have acceptable fit. We then modified the model to allow for cultural differences in some primary factor loadings, secondary factor loadings, and item intercepts. This step makes our work exploratory. However, we believe that this exploratory work is needed as a first step towards psychometrically sound measurement of cultural differences.

Participants

Participants (N = 952 Japanese, 891 US) were recruited by Nikkei Research Inc. and its U.S. affiliate using a national probabilistic sampling method based on gender and age. The mean age was 44. The data have been used before to compare the influence of personality on life-satisfaction judgments, but without comparing mean levels in personality and life-satisfaction (Kim, Schimmack, Oishi, & Tsutsui, 2018).

Measures

The Big Five items were taken from the International Personality Item Pool (Goldberg et al., 2006). There were five items for each of the Big Five dimensions (Table 1).

Results

We first fitted a model without mean structure to the data. A model with strict invariance for the two samples did not have acceptable fit using RMSEA < .06 and CFI > .95 as criterion values, RMSEA = .064, CFI = .834. However, CFI values should not be expected to reach .95 in models with single-item indicators (Anusic et al., 2009). Therefore, the focus is on RMSEA. We first examined modification indices (MI) of primary loadings. We used MI > 30 as a criterion to free parameters to avoid overfitting the model. We found seven primary loadings that would improve model fit considerably (n4, e3, a1, a2, a3, a4, c4). Freeing these parameter improved the model (RMSEA = .060, CFI = .857). We next examined loadings on the halo factor because it is likely that some items differ in their connotative meaning across languages. However, we found only two notable MIs (o1, c4). Freeing these parameters improved model fit (RMSEA = .057, CFI = .871). We identified six secondary loadings that differed notably across cultures. One was a secondary loading on neuroticism (e4) and four were secondary loadings on agreeableness (n5, e1, e3, o4), and one was a secondary loading on conscientiousness (n3). Freeing these parameters improved model fit (RMSEA = .052, CFI = .894). We were satisfied with this measurement model and continued with the means model. The first model fixed the item intercepts and factor means to be identical. This model had worse fit than the model without a means structure (RMSEA = .070, CFI = .803). The biggest MI was observed for the mean of the halo factor. Allowing for mean differences in halo improved model fit considerably (RMSEA = .060, CFI = .849). MIs next suggested to allow for mean differences in extraversion and agreeableness. We next allowed for mean differences in the other factors. This further improved model fit (RMSEA = .058, CFI = .864), but not as much. MIs suggested seven items with different item intercepts (n1, n5, e3, o3, a5, c3 c5). Relaxing these parameters improved model fit close to the level for the model without a mean structure (RMSEA = .053, CFI = .888).

Table 1 shows the primary loadings and the loadings on the halo factor for the 25 items.

Table 1

The results show very similar primary loadings for most items. This means that factors have similar meaning in the two samples and that it is possible to compare the two cultures. Nevertheless, there are some differences that could bias comparisons based on item-sum-scores. The item “feeling comfortable around people” loads much more strongly on the extraversion factor in the US than in Japan. The agreeableness items “insult people” and “sympathize with others’ feelings” also load more strongly in the US than in Japan. Finally, “making a mess of things” is a conscientiousness item in the US, but not in Japan. The fact that item loadings are more consistent with the theoretical structure can be attributed to the development of the items in the US.

A novel and important finding is that most loadings on the halo factor are also very similar across nations. For example, the item “have excellent ideas” shows a high loading for the US and Japan. This finding contradicts the idea that evaluative biases are culture-specific (Church et al., 2014). The only notable difference is the item “make a mess of things” that has no notable loading on the halo factor in Japan. Even in English, the meaning of this item is ambiguous and future studies should replace this item with a better item. The correlation between the halo loadings for the two samples is high, r = .96.

Table 2 shows the item means and the item intercepts of the model.

Table 2

The item means of the US sample are strongly correlated with the loadings on the halo factor, r = .81. This is a robust finding in Western samples. More desirable items are endorsed more. The reason could be that individuals actually act in desirable ways most of the time and that halo bias influences item means. Surprisingly, there is no notable correlation between item means and loadings on the halo factor for the Japanese sample, r = .08. This pattern of results suggests that US means are much more strongly influenced by halo bias than Japanese means. Further evidence is provided by inspecting the mean differences. For desirable items (low N, high E, O, A, & C) US means are always higher than Japanese’ means. For undesirable items, the US means are always lower than Japanese’ means, except for the item “stay in the background” where the means are identical. The difference scores are also positively correlated with the halo loadings, r = .90. In conclusion, there is strong evidence that halo bias distorts the comparison of personality in these two samples.

The item intercepts show cultural differences in items after taking cultural differences in halo and the other factors into account. Notable differences were observed from some items. Even after controlling for halo and extraversion, US respondents report higher levels of being comfortable around people than Japanese. This difference fits cultural stereotypes. After correcting for halo bias, Japanese now score higher on getting chores done right away than Americans. This also fits cultural stereotypes. However, Americans still report paying more attention to detail than Japanese, which is inconsistent with cultural stereotypes. Extensive validation research is needed to examine whether these results reflect actual cultural differences in personality and behaviours.

Figure 2 shows the mean differences on the Big Five factors and the two bias factors.

Figure 2

Figure 2 shows a very large difference in halo bias. The difference is so large that it seems implausible. Maybe the model is overcorrecting, which would bias the mean differences for the actual traits in the opposite direction. There is little evidence of cultural differences in acquiescence bias. One open question is whether the strong halo effect is entirely due to evaluative biases. It is also possible that a modesty bias plays a role because modesty implies less extreme responses to desirable items and less extreme responses to undesirable items. To separate the two, it would be necessary to include frequent and infrequent behaviours that are not evaluative.

The most interesting result for the Big Five factors is that the Japanese sample scores higher in conscientiousness than the US sample after halo bias is removed. This reverses the mean differences in this sample and previous studies that show higher conscientiousness for US than Japanese samples (). The present results suggest that halo bias masks the actual difference in conscientiousness. However, other results are more surprising. In particular, the present results suggest that Japanese people are more extraverted than Americans. This contradicts cultural stereotypes and previous studies. The problem is that cultural stereotypes could be wrong and that previous studies did not control for halo bias. More research with actual behaviours and less evaluative items is needed to draw strong conclusions about personality differences between cultures.

Discussion

It has been known for 100 years that self-ratings of personality are biased by connotative meaning. At least in North America it is common to see a strong correlation between the desirability of items and the means of self-ratings. There is also consistent evidence that Americans rate themselves in a more desirable manner than the average American (). However, this does not mean that Americans are seeing themselves as better than everybody else. In fact, self-ratings tend to be slightly less favorable than ratings of friends or family members (), indicating a general evaluative biases to rate oneself and close others favorably.

Given the pervasiveness of evaluative biases in personality ratings it is surprising that halo bias has received so little attention in cross-cultural studies of personality. One reason could be the lack of a good method to measure and remove halo variance from personality ratings. Despite early attempts to detect socially desirable responding, lie scales have shown little validity as bias measures (ref). The problem is that manifest scores on lie scales contain as much valid personality variance as bias variance. Thus, correcting for scores on these scales literally throws out the baby (valid variance) with the bathwater (bias variance). Structural equation modeling (SEM) solves this problem by spitting observed variances into unobserved or latent variances. However, personality psychologists have been reluctant to take advantage of SEM because item models require large samples and theoretical models were too simplistic and produced bad fit. Informed by multi-rater studies that emerged in the 1990s, we developed a measurement model of the Big Five that separates personality variance from evaluative bias variance (Anusic, et al., 2009; Schimmack, Kim, & 2012; Schimmack, 2019). Here we applied this model for the first time to cross-cultural data to examine whether cultures differ in halo bias. The result suggest that halo bias has a strong influence on personality ratings in the US, but not in Japan. The differences in halo bias distort comparisons on the actual personality traits. While raw scores suggest that Japanese people are less conscientious than Americans, the corrected factor means suggest the opposite. Japanese participants also appeared to be less neurotic, more extraverted and open to experiences, which was a surprising result. Correcting for halo bias did not change the cultural differences in agreeableness. Americans were more agreeable than Japanese with and without correction for halo bias. Our results do not provide a conclusive answer about cultural differences in personality, but they shed a new light on several questions in personality research.

Cultural Differences in Self-enhancement

One unresolved question in personality psychology is whether positive biases in self-perceptions also known as self-enhancement are unique to American or Western cultures or whether they are a universal phenomenon (Church et al., 2016). One problem are different approaches to the measurement of self-enhancement. The most widely used method are social comparisons where individuals compare themselves to an average person. These studies tend to show a persistent better-than-average effect in all cultures (ref). However, this finding does not imply that halo biases are equally strong in all cultures. Brown and Kobayashi (2002) found better-than-average effects in the US and Japan, but Japanese ratings of the self and others were less favorable than those in the US. Kim et al. (2012) explain this pattern with a general norm to be positive in North America that influences ratings of the self as well as ratings of others. Our results are consistent with this view and suggests that self-enhancement is not a universal tendency. More research with other cultures is needed to examine which cultural factors moderate halo biases.

Rating Biases or Self-Perception Biases

An open question is whether halo biases are mere rating biases or reflect distorted self-perceptions. One model suggests that participants are well aware of their true personality, but merely present themselves in a more positive light to others. Another model suggests that individuals truly believe that their personality is more desirable than it actually is. It is not easy to distinguish between these two models empirically. zzz

Halo Bias and the Reference Group Effect

In an influential article, Heine et al. (2002) criticized cross-cultural comparisons in personality ratings as invalid. The main argument was that respondents adjust the response categories to cultural norms. This adjustment was called the reference group effect. For example, the item “insult people” is not answered based on the frequency of insults or a comparison of the frequency of insults to other behaviours. Rather it is answered in comparison to the typical frequency of insults in a particular culture. The main prediction made by the reference group effect is that responses in all cultures should cluster around the mid-point of a Likert-scale that represents the typical frequency of insults. As a result, cultures could differ dramatically in the actual frequency of insults, while means on the subjective rating scales are identical.

The present results are inconsistent with a simple reference group effect. Specifically, the US sample showed notable variation in item means that was related to item desirability. As a result, undesirable items like “insult people” had a much lower mean, M = 1.83, than the mid-point of the scale (3), and desirable items “have excellent ideas” had a higher mean (M = 3.73) than the midpoint of the scale. This finding suggests that halo bias rather than a reference group effect threatens the validity of cross-cultural comparisons.

Reference group effects may play a bigger role in Japan. Here item means were not related to item desirabilty and clustered more closely around the mid-point of the scale. The highest mean was 3.56 for worry and the lowest mean was 2.45 for feeling comfortable around people. However, other evidence contradicts this hypothesis. After removing effects of halo and the other personality factors, item intercepts were still highly correlated across the two national samples, r = .91. This finding is inconsistent with culture-specific reference groups that would not produce consistent item intercepts.

Our results also provide a new explanation for the low conscientiousness of Japanese samples. A reference group effect would not predict a significantly lower level of conscientiousness. However, a stronger halo effect in the US explains this finding because conscientiousness is typically assessed with desirable items. Our results are also consistent with the finding that self-esteem and self-enhancement are more pronounced in the US than in Japan (Heine & Buchtel, 2009). These aforementioned biases inflate conscientiousness scores in the US. After removing this bias, Japanese rate themselves as more conscientious than US Americans.

Limitations and Future Directions

We echo previous calls for validation of personality scores of nations (Heine & Buchtel, 2009). The current results are inconsistent across questionnaires and even the low level of convergent validity may be inflated by cultural differences in response styles. Future studies should try to measure personality with items that minimize social desirability and use response formats that avoid the use of reference groups (e.g., frequency estimates). Moreover, results based on ratings should be validated with objective indicators of behaviours.

Future research also needs to take advantage of developments in psychological measurement and use models that can identify and control for response artifacts. The present model shows the ability of separating evaluative biases or halo variance from actual personality variance. Future studies should use this model to compare a larger number of nations.

The main limitation of our study is the relatively small number of items. The larger the number of items, the easier it is to distinguish item-specific variance, method variance, and trait variance. The measure also did not properly take into account that the Big Five are higher-order factors of more basic traits called facets. Measures like the BFI-2 or the NEO-PI3 should be used to study cultural differences at the facet level, which often shows unique influences of culture that are different from effects on the Big Five (Schimmack, 2020).

We conclude with a statement of scientific humility. The present results should not be taken as clear evidence about cultural differences in personality. Our article is merely a little step towards the goal of measuring personality differences across cultures. One obstacle in revealing such differences is that national differences appear to be relatively small compared to the variation in personality within nations. One possible explanation for this is that variation in personality is caused more by biological than cultural factors. For example, twin studies suggest that 40% of the variance in personality traits is caused by genetic variation within a population, whereas cross-cultural studies suggest that at most 10% of the variance is caused by cultural influences on population means. Thus, while uncovering cultural variation in personality is of great scientific interest, evidence of cultural differences between nations should not be used to stereotype individuals from different nations. Finally, it is important to distinguish between personality traits that are captured by Big Five traits and other personality attributes like attitudes, values, or goals that may be more strongly influenced by culture. The key novel contribution of this article is to demonstrate that cultural differences in response styles exists and distort national comparisons of personality with simple scale means. Future studies need to take response styles into account.

References

Cronbach, L. J. (1942). Studies of acquiescence as a factor in the true-false test. Journal of Educational Psychology, 33(6), 401–415. https://doi.org/10.1037/h0054677

Heine, S. J., & Buchtel, E. E. (2009). Personality: The universal and the culturally specific. Annual Review of Psychology, 60, 369–394. https://doi.org/10.1146/annurev.psych.60.110707.163655

Perugini, M., & Richetin, J. (2007). In the land of the blind, the one-eyed man is king. European Journal of Personality, 21(8), 977–981. https://doi.org/10.1002/per.649

Schimmack, U. (2020). Personality science: The science of human diversity. TopHat, 978-1-77412-253-2.    https://tophat.com/marketplace/social-science/psychology/full-course/personality-science-the-science-of-human-diversity-ulrich-schimmack/4303/

Terracciano, A. et al. (2005). National character does not reflect mean personality
trait levels in 49 cultures. Science, 310, 96–100.

JPSP:PPID = Journal of Pseudo-Scientific Psychology: Pushing Paradigms – Ignoring Data

Abstract

Ulrich Orth, Angus Clark, Brent Donnellan, Richard W. Robins (DOI: 10.1037/pspp0000358) present 10 studies that show the cross-lagged panel model (CLPM) does not fit the data. This does not stop them from interpreting a statistical artifact of the CLPM as evidence for their vulnerability model of depression. Here I explain in great detail why the CLPM does not fit the data and why it creates an artifactual cross-lagged path from self-esteem to depression. It is sad that the authors, reviewers, and editors were blind to the simple truth that a bad-fitting model should be rejected and that it is unscientific to interpret parameters of models with bad fit. Ignorance of basic scientific principles in a high-profile article reveals poor training and understanding of the scientific method among psychologists. If psychology wants to gain respect and credibility, it needs to take scientific principles more seriously.

Introduction

Psychology is in a crisis. Researchers are trained within narrow paradigms, methods, and theories that populate small islands of researchers. The aim is to grow the island and to become a leading and popular island. This competition between islands is rewarded by an incentive structure that imposes the reward structure of capitalism on science. The winner gets to dominate the top journals that are mistaken as outlets of quality. However, just like Coke is not superior to Pepsi (sorry Coke fans), the winner is not better than the losers. They are just market leaders for some time. No progress is being made because the dominant theories and practices are never challenged and replaced with superior ones. Even the past decade that has focused on replication failures has changed little in the way research is conducted and rewarded. Quantity of production is rewarded, even if the products fail to meet basic quality standards as long as naive consumers of researchers are happy.

This post is about the lack of training in the analysis of longitudinal data with a panel structure. A panel study essentially repeats the measurement of one or several attributes several times. Nine years of undergradute and graduate training leave most psychologists without any training how to analyze these data. This explains why the cross-lagged panel model (CLPM) was criticized four decades ago (Rogosa, 1980), but researchers continue to use it with the naive assumption that it is a plausible model to analyze panel data. Critical articles are simply ignored. This is the preferred way of dealing with criticism by psychologists. Here, I provide a detailed critique of CLPM using Orth et al.’s data (https://osf.io/5rjsm/) and simulations.

Step 1: Examine your data

Psychologists are not trained to examine correlation matrices for patterns. They are trained to submit their data to pre-specified (cookie-cutter) models and hope that the data fit the model. Even if the model does not fit, results are interpreted because researchers are not trained in modifying cookie cutter models to explore reasons for bad fit. To understand why a model does not fit the data, it is useful to inspect the actual pattern of correlations.

To illustrate the benefits of visual inspection of the actual data, I am using the data from the Berkeley Longitudinal Study (BLS), which is the first dataset listed in Orth et al.’s (2020) table that lists 10 datasets.

To ease interpretation, I break up the correlation table into three components, namely (a) correlations among self-esteem measures (se1-se4 with se1-se4), correlations among depression measures (de1-de4 with de1-de4), and correlations of self-esteem measures with depression measures (se1-se4 with de1-de4);

Table 1

Table 1 shows the correlation matrix for the four repeated measurements of self-esteem. The most important information in this table is how much the magnitude of the correlations decreases along the diagonals that represent different time lags. For example, the lag-1 correlations are .76, .79, and .74, which approximately average to a value of .76. The lag-2 correlations are .65 and .69, which averages to .67. The lag-3 correlation is .60.

The first observation is that correlations are getting weaker as the time-lag gets longer. This is what we would expect from a model that assumes self-esteem actually changes over time, rather than just fluctuating around a fixed set-point. The latter model implies that retest correlations remain the same over different time lags. So, we do have evidence that self-esteem changes over time, as predicted by the cross-lagged panel model.

The next question is how much retest correlations decrease with increasing time lags. The difference from lag-1 to lag-2 is .74 – .67 = .07. The difference from lag-2 to lag-3 is .67 – .60, which is also .07. This shows no leveling off of the decrease in these data. It is possible that the next wave would produce a lag-4 correlation of .53, which would be .07 lower than then lag-3 correlation. However, a difference of .07 is not very different from 0, which would imply that change asymptotes at .60. The data are simply insufficient to provide strong information about this.

The third observation is that the lag-2 correlation is much stronger than the square of the lag-1 correlations, .67 > .74^2 = .55. Similarly, the lag-3 correlation is stronger than the product of the lag-1 and lag-2 correlations, .60 > .74 * .67 = .50 This means that a simple autoregressive model with observed variables does not fit the data. However, this is exactly the model of Orth et al.’s CLPM.

It is easy to examine the fit of this part of the CLPM model, by fitting an autoregressive model to the self-esteem panel data.

Model:
se2-se4 PON se1-3 ! This command regresses each measure on the previous measure (n on n-1).
! There is one thing I learned from Orth et al., and it was the PON command of MPLUS

Table 2

Table 2 shows the fit of the autoregressive model. While CFI meets the conventional threshold of .95 (higher is better), RMSEA shows terrible fit of the model (.06 or lower are considered acceptable). This is a problem for cookie-cutter researchers who think CLPM is a generic model that fits all data. Here we see that the model makes unrealistic assumptions and we already know what the problem is based on our inspection of the correlation table. The model predicts more change than the data actually show. We are therefore in a good position to reject the CLPM as a viable model for these data. This is actually a positive outcome. The biggest problem in correlational research are data that fit all kinds of models. Here we have data that actually disconfirm some models. Progress can be made, but only if we are willing to abandon the CLPM.

Now let’s take a look at the depression data, following the same steps as for the self-esteem data.

Table 3

The average lag-1 correlation is .43. The average lag-2 correlaiton is .45, and the lag-3 correlation is .4. These results are problematic for an autoregressive model because the lag-2 correlation is not even lower than the lag-1 correlation.

Once more it is hard to tell, whether retest-correlations are approaching an asymptote. In this case, the lag-2 minus lag-1 difference is -.02 and the lag-3 minus lag-2 difference is .05.

Finally, it is clear that the autoregressive model with manifest variables overestimates change. The lag-2 correlation is stronger than the square of the lag-1 correlations, .45 > .43^2 = .18, and the lag-3 correlation is stronger than the lag-1 * lag-2 correlation, .40 > .43*.45 = .19.

Given these results, it is not surprising that the autoregressive model fits the data even less than for the self-esteem measures (Table 4).

Model:
de2-de4 PON de1-de3 ! regress each depression measure on the previous one.

Talble 4

Even the CFI value is now in the toilet and the RMSEA value is totally unacceptable. Thus, the basic model of stability and change implemented in CLPM is inconsistent with the data. Nobody should proceed to build a more complex, bivariate model if the univariate models are inconsistent with the data. The only reason why psychologists do so all the time is that they do not think about CLPM as a model. They think CLPM is like a t-test that can be fitted to any panel data without thinking. No wonder psychology is not making any progress.

Step 2: Find a Model That Fits the Data

The second step may seem uncontroversial. If one model does not fit the data, there is probably another model that does fit the data and this model has a higher chance of being the model that reflects the causal processes that produced the data. However, psychologists have an uncanny ability to mess up even the simplest steps in data analysis. They have convinced themselves that it is wrong to fit models to data. The model has to come first so that the results can be presented as confirming a theory. However, what is the theoretical rational of the CLPM? It is not motivated by any theory of development, stability, or change. It is as atheoretical as any other model. It only has the advantage that it became popular on an island of psychology and now people use it without being questioned about it. Convention and conformity are not pillars of science.

There are many alternative models to CLPM that can be tried. One model is 60 years old and was introduced by Heise (1969). It is also an autoregressive model, but it also allows for occassion specific variance. That is, some factors may temporarily change individuals’ self-esteem or depression without any lasting effects on future measurements. This is a particularly appealing idea for a symptom checklist of depression that asks about depressive symptoms in the past four weeks. Maybe somebody’s cat died or it was a midterm period and depressive symptoms were present for a brief period, but these factors have no influence on depressive symptoms a year later.

I first fitted Heise’s model to the self-esteem data.

MODEL:
sse1 BY se1@1;
sse2 BY se2@1;
sse3 BY se3@1;
sse4 BY se4@1;
sse2-sse4 PON sse1-sse3 (stability);
se1-se4 (se_osv) ! occasion specific variance in self-esteem

Model fit for this model is perfect. Even the chi-square test is not significant (which in SEM is a good thing, because it means the model closely fits the data).

Model results show that there is significant occasion specific variance. After taking this variance into account the stability of the variance that is not occassion-specific, called state variance by Heise, is around r = .9 from one occasion to the next.

Fit for the depression data is also perfect.

There is even more occasion specific variance in depressive symptoms, but the non-occasion-specific variance is even more stable as the non-occasion-specific variance in self-esteem.

These results make perfect sense if we think about the way self-esteem and depression are measured. Self-esteem is measured with a trait measure of how individuals see themselves in general, ignoring ups and downs and temporary shifts in self-esteem. In contrast, depression is assessed with questions about a specific time period and respondents are supposed to focus on their current ups and downs. Their general disposition should be reflected in these judgments only to the extent that it influences their actual symptoms in the past weeks. These episodic measures are expected to have more occasion specific variance if they are valid. These results show that participants are responding to the different questions in different ways.

In conclusion, model fit and the results favor Heise’s model over the cookie-cutter CLPM.

Step 3: Putting the two autoregressive models together

Let’s first examine the correlations of self-esteem measures with depression measures.

The first observation is that the same-occasion correlations are stronger (more negative) than the cross-occasion correlations. This suggests that occasion specific variance in self-esteem is correlated with occasion specific variance in depression.

The second observation is that the lagged self-esteem to depression correlations (e.g., se1 with de2) do not become weaker (less negative) with increasing time lag, lag-1 r = -.36, lag-2 r = -.32, lag-3 r = .33.

The third observation is that the lagged depression to self-esteem correlations (e.g., de1 with se2) do not decrease from lag-1 to lag-2, although they do become weaker from lag-2 to lag-3, lag-1 r = -.44, lag-2 r = -.45, lag-3 r = -.35.

The fourth observation is that the lagged self-esteem to depression correlations (se1 with de2) are weaker than the lagged depression to self-esteem (de1 with se2) correlations . This pattern is expected because self-esteem is more stable than depressive symptoms. As illustrated in the Figure below, the path from de1-se4 is stronger than the path form se1 to de4 because the path from se1 to se4 is stronger than the path from de1 to de4.

Regression analysis or structural equation modeling is needed to examine whether there are any additional lagged effects of self-esteem on depressive symptoms. However, a strong cross-lagged path from se1 to de4 would produce a stronger correlation of se1 and de4, if stability were equal or if the effect is strong. So, a stronger lagged self-esteem to depression correlation than a lagged depression to self-esteem correlation would imply a cross-lagged effect from self-esteem to depression, but the reverse pattern is inconclusive because self-esteem is more stable.

Like Orth et al. (2020) I found that Heise’s model did not converge. However, unlike Orth et al. I did not conclude from this finding that the CLPM model is preferable. After all, it does not fit the data. Model convergence is sometimes simply a problem of default starting values that work for most models but not for all models. In this case, the high stability of self-esteem produced a problem with default starting values. Just setting this starting value to 1 solved the convergence problem and produced a well-fitting result.

The model results show no negative lagged prediction of depression from self-esteem. In fact, a small positive relationship emerged, but it was not statistically significant.

It is instructive to compare these results with the CLPM results. The CLPM model is nested in the Heise model. The only difference is that the occasion-specific variances of depression and self-esteem are fixed to zero. As these parameters were constrained across occasions, this model has two fewer parameters and the model df increase from 24 to 26. Model fit decreased in the more parsimonious model. However, the overall fit is not terrible, although RMSEA should be below .06 [Interestingly, the CFI value changed from a value over .95 to a value .94 when I estimated the model with MPLUS8.2, whereas Orth et al. used MPLUS8]. This shows the problem of relying on overall fit to endorse models. Overall fit is often good with longitudinal data because all models predict weaker correlations over longer time intervals. The direct model comparison shows that the Heise model is the better model.

In the CLPM model, self-esteem is a negative lagged predictor of depression. This is the key finding that Orth and colleagues have been using to support the vulnerability model of depression (low self-esteem leads to depression).

Why does the CLPM model produce negative lagged effects of self-esteem on depression. The reason is that the model underestimates the long-term stability of depression from time 1 to time 3 and time 4. To compensate for this it can use self-esteem that is more stable and then link self-esteem at time 2 with depression at time 3 (.745 * -.191) and self-esteem at time 3 with depression at time 4 (.742 * .739 * -.190). But even this is not sufficient to compensate for the misprediction of depression over time. Hence, the worse fit of the model. This can be seen by examining the model reproduced correlation matrix in the MPLUS Tech1 output.

Even with the additional cross-lagged path, the model predicts only a correlation of r = .157 from de1 to de4, while the observed correlation was r = .403. This discrepancy merely confirms what the univariate models showed. A model without occasion-specific variances underestimates long-term stability.

Interem Conclusion

Closer inspection of Orth et al.’s data shows that the CLPM does not fit the data. This is not surprising because it is well-known that the cross-lagged panel model often underestimates long-term stability. Even Orth has published univariate analyses of self-esteem that show a simple autoregressive model does not fit the data (Kuster & Orth, 2013). Here I showed that using the wrong model of stability creates statistical artifacts in the estimation of cross-lagged path coefficients. The only empirical support for the vulnerability model of depression is a statistical artifact.

Replication Study

I picked the My Work and I (MWI) dataset for a replication study. I picked it because it used the same measures and had a relatively large sample size (N = 663). However, the study is not an exact or direct replication of the previous one. One important difference is that measurements were repeated every two months rather than every year. The length of the time interval can influence the pattern of correlations.

There are two notable differences in the correlation table. First, the correlations increase with each measurement from .782 for se1 with se2 to .871 from se4 to se5. This suggests a response artifact, such as a stereotypic response styles that inflates consistency over time. This is more likely to happen for shorter intervals. Second, the difference between correlations with different lags are much smaller. They were .07 in the previous study. Here the differences are .02 to .03. This means there is hardly any autoregressive structure, suggesting that a trait model may fit the data better.

The pattern for depression is also different from the previous study. First, the correlations are stronger, which makes sense, because the retest interval is shorter. Somebody who suffers from depressive symptoms is more likely to still suffer two months later than a year later.

There is a clearer autoregressive structure for depression and no sign of stereotypic responding. The reason could be that depression was assessed with a symptom checklist that asks about the previous four weeks. As this question covers a new time period each time, participants may avoid stereotypic responding.

The depression-self-esteem correlations also become stronger (more negative) over time from r = -.538 to r = -.675. This means that a model with constrained coefficients may not fit the data.

The higher stability of depression explains why there is no longer a consistent pattern of stronger lagged depression to self-esteem correlations (de1 with se2) above the diagonal than self-esteem to depression correlations (se1 with de2) below the diagonal. Five correlations are stronger one way and five correlations are stronger the other way.

For self-esteem, the autoregressive model without occasion-specific variance had poor fit (RMSEA = .170, CFI = .920). Allowing for occasion-specific variance improved fit and fit was excellent (RMSEA = .002, CFI = .999). For depression, the autoregressive model without occasion-specific variance had poor fit (RMSEA = .113, CFI = .918). The model with occasion-specific variance fit better and had excellent fit (RMSEA = .029, CFI = .995). These results replicate the previous results and show that CLPM does not fit because it underestimates stability of self-esteem and depression.

The CLPM model also had bad fit in the original article (RMSEA = .105, CFI = .932). In comparison, the model with occasion specific variances had much better fit (RMSEA = .038, CFI = .991). Interestingly, this model did show a small, but statistically significant path from self-esteem to depression (effect size r = -.08). This raises the possibility that the vulnerability effect may exist for shorter time intervals of a few months, but not for longer time intervals of a year or more. However, Orth et al. do not consider this possibility. Rather, they try to justify the use of the CLPM to analyze panel data even though the model does not fit.

FITTING MODELS TO THEORIES RATHER THAN DATA

Orth et al. note “fit values were lowest for the CLPM” (p. 21) with a footnote that recognizes the problem of the CLPM, “As discussed in the Introduction, the CLPM underestimates the long-term stability of constructs, and this issue leads to misfit as the number of waves increases” (p. 63).

Orth et al. also note correctly that the cross-lagged effect of self-esteem on depression emerges more consistently with the CLPM model. By now it is clear why this is the case. It emerges consistently because it is a statistical artifact produced by the underestimation of stability in depression in the CLPM model. However, Orth et al.’s belief in the vulnerability effect is so strong that they are unable to come to a rational conclusion. Instead they propose that the CLPM model, despite its bad fit, shows something meaningful.

We argue that precisely because the prospective effects tested in the CLPM are also based on between-person variance, it may answer questions that cannot be assessed with models that focus on within-person effects. For example, consider the possible effects of warm parenting on children’s self-esteem (Krauss, Orth, & Robins, 2019): A cross-lagged effect in the CLPM would indicate that children raised by warm parents would be more likely to develop high self-esteem than children raised by less warm parents. A cross-lagged effect in the RI-CLPM would indicate that children who experience more parental warmth than usual at a particular time point will show a subsequent increase in self-esteem at the next time point, whereas children who experience less parental warmth than usual at a particular time point will show a subsequent drop in self-esteem at the next time point

Orth et al. then point out correctly that the CLPM is nested in other models and makes more restrictive assumptions about the absence of occasion specific variance or trait variance, but they convince themselves that this is not a problem.

As was evident also in the present analyses, the fit of the CLPM is typically not as good as the fit of the RI-CLPM (Hamaker et al., 2015; Masselink, Van Roekel, Hankin, et al., 2018). It is important to note that the CLPM is nested in the RI-CLPM (for further information about how the models examined in this research are nested, see Usami, Murayama, et al., 2019). That is, the CLPM is a special case of the RI-CLPM, where the variances of the two random intercept factors and the covariance between the random intercept factors are constrained to zero (thus, the CLPM has three additional degrees of freedom). Consequently, with increasing sample size, the RI-CLPM necessarily fits significantly better than the CLPM (MacCallum, Browne, & Cai, 2006). However, does this mean that the RI-CLPM should be preferred in model selection? Given that the two models differ in their conceptual meaning (see the discussion on between- and within-person effects above), we believe that the decision between the CLPM and RI-CLPM should not be based on model fit, but rather on theoretical considerations.

As shown here, the bad fit of CLPM is not an unfair punishment of a parsimonious model. The bad fit reveals that the model fails to model stability correctly. To disregard bad fit and to favor the more parsimonious model even if it doesn’t fit makes no sense. By the same logic, a model without cross-lagged paths would be more parsimonious than a model with cross-lagged paths and we could reject the vulnerability model simply because it is not parsimonious. For example, when I fitted the model with occasion specific variances and without cross-lagged paths, model fit was better than model fit of the CLPM (RMSEA = .041 vs. RMSEA = .109) and only slightly worse than model fit of the model with occasion specific variance and cross-lagged paths (RMSEA = .040).

It is incomprehensible to methodologists that anybody would try to argue in favor of a model that does not fit the data. If a model consistently produces bad fit, it is not a proper model of the data and has to be rejected. To prefer a model because it produces a consistent artifact that fits theoretical preferences is not science.

Replication II

Although the first replication mostly confirmed the results of the first study, one notable difference was the presence of statistically significant cross-lagged effects in the second study. There are a variety of explanations for this inconsistency. The lack of an effect in the first study could be a type-II error. The presence of an effect in the first replication study could be a type-I errror. Finally, the difference in time intervals could be a moderator.

I picked the Your Personality (YP) dataset because it was the only dataset that used the same measures as the previous two studies. The time interval was 6 months, which is in the middle of the other two intervals. This made it interesting to see whether results would be more consistent with the 2-month or the 1-year intervals.

For self-esteem, the autoregressive model with occasion specific variance had a good fit to the data (RMSEA = .016, CFI = .999). Constraining the occasion specific variance to zero reduced model fit considerably (RMSEA = .160, CFI = .912). Results for depression were unexpected. The model with occasion specific variance showed non-significant and slightly negative residuals for the state variances. This finding implies that there are no detectable changes in depression over time and that depression scores only have a stable trait and occasion specific variance. Thus, I fixed the autoregressive parameters to 1 and the residual state variances to zero. This model is equivalent to a model that specifies a trait factor. Even this model had barely acceptable fit (RMSEA = .062, CFI = .962). Fit could be increased by relaxing the constraints on the occasion specific variance (RMSEA = .060, CFI = .978). However, a simple trait model fit the data even better (RMSEA = .000, CFI = 1.000). The lack of an autoregressive structure makes it implausible that there are cross-lagged effects on depression. If there is no new state variance, self-esteem cannot be a predictor of new state variance.

The presence of a trait factor for depression suggests that there could also be a trait factor for self-esteem and that some of the correlations between self-esteem and depression are due to correlated traits. Therefore I added a trait factor to the measurement model of self-esteem. This model had good fit (RMSEA = .043, .993) and fit was superior to the CLPM (RMSEA = .123, CFI = .883). The model showed no significant cross-lagged effect from self-esteem to depression and the parameter estimate was positive rather than negative, .07. This finding is not surprising given the lack of decreasing correlations over time for depression.

Replication III

The last openly shared datasets are from the California Families Project (CFP). I first examined the children’s data (CFP-C) because Orth et al. (2020) reported a significant vulnerability effect with the RI-CLPM.

For self-esteem, the autoregressive model without occasion-specific variance had bad fit (RMSEA = .108, CFI = .908). Even the model with occasion-specific variance had poor fit (RMSEA = .091, CFI = .945). In contrast, a model with a trait factor and without occasion specific variance had good fit (RMSEA = .023, CFI = .997). This finding suggests that it is necessary to include a stable trait factor to model stability of self-esteem correctly in this dataset.

For depression, the autoregressive model without occasion-specific variance had bad fit (RMSEA = .104, CFI = .878). Even the model with occasion-specific variance had poor fit (RMSEA = .103, CFI = .897). Adding a trait factor produced a model with acceptable fit (RMSEA = .051, CFI = .983).

The trait-state model fit the data well (RMSEA = .989, CFI = .032) and much better than the CLPM (RMSEA = .079, CFI = .914). The autoregressive effect of self-esteem on depression was not significant, and only have the size of the effect size in the RI-CLPM ( -.05 vs. -.09). The difference is due to the constraint on the trait factor. Relaxing these constraints improves model fit and the vulnerability effect becomes non-significant.

Replication IV

The last dataset is based on the mothers’ self-reports in the California Families Project (CFP-M).

For self-esteem, the autoregressive model without occasion-specific variance had bad fit (RMSEA = .139, CFI = .885). The model with occasion specific variance improved fit (RMSEA = .049, CFI = .988). However, the trait-state model had even better fit (RMSEA = .046, CFI = .993).

For depression, the autoregressive model without occasion-specific variance had bad fit (RMSEA = .127, CFI = .880). The model with occasion-specific variance had excellent fit (RMSEA = .000, CFI = 1.000). The trait-state model also had excellent fit (RMSEA = .000, CFI = 1.000).

The CLPM had bad fit to the data (RMSEA = .092, CFI = .913). The Heise model improved fit (RMSEA = .038, CFI = .987). The trait-state model had even better fit (RMSEA = .031, CFI = .992). The cross-lagged effect of self-esteem on depression was negative, but small and not significant, -.05 (95%CI = -.13 to .02).

Simulation Study 1

The first simulation demonstrates that a cross-lagged effect emerges when the CLPM is fitted to data with a trait factor and one of the constructs has more trait variance which produces more stability over time.

I simulated 64% trait variance and 36% occasion-specific variance for self-esteem.

I simulated 36% trait variance and 64% occasion-specific variance for depression.

The correlation between the two trait factors was r = -.7. This produced manifest correlations of r = -.71*sqrt(.36)*sqrt(.64) = -.7 * .6 * .8 = -.34.

For self-esteem the autoregressive model without occasion specific variance had bad fit (). For depression, the autoregressive model without occasion specific variance had bad fit. The CLPM model also had bad fit (RMSEA = .141, CFI = .820). Although the simulation did not include cross-lagged paths, the CLPM showed a significant cross-lagged effect from self-esteem to depression (-.25) and a weaker cross-lagged effect from depression to self-esteem (-.14).

Needless to say, the trait-state model had perfect fit to the data and showed cross-lagged path coefficients of zero.

This simulation shows that CLPM produces artificial cross-lagged effects because it underestimates long-term stability. This problem is well-known, but Orth et al. (2020) deliberately ignore it when they interpret cross-lagged parameters in CLPM with bad fit.

Simulation Study 2

The second simulation shows that a model with a significant cross-lagged path can fit the data, if this path is actually present in the data. The cross-lagged effect was specified as a moderate effect with b = .3. Inspection of the correlation matrix shows the expected pattern that cross-lagged correlations from se to de (se1 with de2) are stronger than cross-lagged correlations from de to se (se2 with de1). The differences are strongest for lag-1.

The model with the cross-lagged paths had perfect fit (RMSEA = .000, CFI = 1.000). The model without cross-lagged paths had worse fit and RMSEA was above .06 (RMSEA = .073, CFI = .968).

Conclusion

The publication of Orth et al.’s (2020 article in JPSP is an embarrassment for the PPID section of JPSP. The authors did not make an innocent mistake. Their own analyses showed across 10 datasets that CLPM does not fit their data. One would expect that a team of researchers would be able to draw the correct conclusion from this finding. However, the power of motivated reasoning is strong. Rather than admitting that the vulnerability model of depression is based on a statistical artifact, the authors try to rationalize why the model with bad fit should not be rejected.

The authors write “the CLPM findings suggest that individual differences in self-esteem predict changes in individual differences in depression, consistent with the vulnerability model” (p. 39).

This conclusions is blatantly false. A finding in a model with bad fit should never be interpreted. After all, the purpose of fitting models to data and to examine model fit is to falsify models that are inconsistent with the data. However, psychologists have been brainwashed into thinking that the purpose of data analysis is only to confirm theoretical predictions and to ignore evidence that is inconsistent with theoretical models. It is therefore not a surprise that psychology has a theory crisis. Theories are nothing more than hunches that guided first explorations and are never challenged. Every discovery in psychology is considered to be true. This does not stop psychologists from developing and supporting contradictory models, which results in an every growing number of theories and confusion. It is like evolution without a selection mechanism. No wonder psychology is making little progress.

Numerous critics of psychology have pointed out that nil-hypothesis testing can be blamed for the lack of development because null-results are ambiguous. However, this excuse cannot be used here. Structural equation modeling is different from null-hypothesis testing because significant results like a high Chi-square value and derived fit indices provide clear and unambiguous evidence that a model does not fit the data. To ignore this evidence and to interpret parameters in these models is unscientific. The fact that authors, reviewers, and editors were willing to publish these unscientific claims in the top journal of personality psychology shows how poorly methods and statistics are understood by applied researchers. To gain respect and credibility, personality psychologists need to respect the scientific method.

Personality Science: The Science of Human Diversity

I wrote a textbook about personality psychology. The textbook is an e-textbook with online engagement for students. I am going to pilot the textbook this fall with my students and revise it with some additional chapters in 2021.

The book also provides an up-to-date review of the empirical literature. The content is freely accessible through a demo version of the course.

https://app.tophat.com/e/826754/assigned

Please provide comments, corrections, additional references, etc. in the comments section or email me directly at ulrich.schimmack@utoronto.ca

A review of “Low self-esteem prospectively predicts depression in adolescence and young adulthood”

In 2007, I was asked to review a ms. about the relationship between self-esteem and depression. The authors used a cross-lagged panel model to examine “prospective prediction” which is a code word for causal claims in a non-experimental study. The problem is that the cross-lagged model is fundamentally flawed because it ignores stable traits and underestimates stability. To compensate for this flaw, it uses cross-lagged paths which leads to false and inflated cross-lagged effects, especially from the more stable to the less stable construct.

I wrote a long and detailed review that was ignored by the editor and the authors and the flawed cross-lagged panel model was published (Orth, Robins, & Roberts, 2008). The article served as the basis for several follow up articles (Orth, Robins, Meier, & Conger, 2016; Rieger, Göllner, Trautwein, & Roberts, 2016; Orth, Robins, Widaman, & Conger, 2014; Orth & Robins, 2013; Sowislo & Orth, 2013; Kuster, Orth, Meier, 2012; Orth, Robins, Trzesniewski, Maes, & Schmitt, 2009; Orth, Robins, & Meier, 2009) and the main author continues to push the flawed cross-lagged panel model (Orth, Clark, Donnellan, & Robins, 2020), although he himself published a model with a trait factor to explain stability in self-esteem (Kuster & Orth, 2013). It is scientifically unjustified to omit this trait factor from bivariate models that relate self-esteem to depression, if ample evidence shows that a trait factor underlies stability of self-esteem (Kuster & Orth, 2013). So, an entire literature is based on a statistical artifact that has been well known four four decades (Rogosa, 1980).

I just found my old review while looking into a folder called “file drawer” and I thought I share it here. It just shows how peer-review doesn’t serve the purpose of quality control and that ambition often trumps the search for truth.

Review – Dec / 3 / 2017

This article tackles an important question: What is the causal relation between depression and self-esteem? As always, at the most abstract level there are three answers to this question. Self-esteem causes (low) depression. Depression causes (low) self-esteem. The correlation is due to a third unobserved variable. To complicate matters, these causal models are not mutually exclusive. It is possible that all three causal models contribute to the observed correlations between self-esteem and depression.

The authors hope to test causal models by means of longitudinal studies, and their empirical data are better than data from many previous studies to examine this question. However, the statistical analyses have some shortcomings that may lead to false inferences about causality.

The first important question is the definition of depression and self-esteem. Depression and self-esteem can be measured in different ways. Self-esteem measures can measure state self-esteem or self-esteem in general. Similarly, depression measures can ask about depressive symptoms over a short time interval (a few weeks) or dispositional depression. The nature of the measure will have a strong impact on the observed retest correlations, even after taking random measurement error into account.

In the empirical studies, self-esteem was measured with a questionnaire that asks about general tendencies (Rosenberg’s self-esteem scale). In contrast, depression was assessed by asking about symptoms within the preceding seven days (CES-D).  Surprisingly, Study 1 shows no differences in the retest correlations of depression and self-esteem. Less surprising is the fact that in the absence of different stabilities, cross-lagged effects are small and hardly different from each other, whereas Study 2 shows differences in stability and asymmetrical patterns of cross-lagged coefficients. This pattern of results suggests that the cross-lagged coefficients are driven by the stability of the measures (see Rogosa, 1980, for an excellent discussion of cross-lagged panel studies).

The good news is that the authors’ data are suitable to test alternative models. One important alternative model would be a model that postulates two latent dispositions for depression and self-esteem (not a single common factor). The latent disposition would produce stability in depression and self-esteem over time. The lower retest correlations of depression would be due to more situational fluctuation of depressive symptoms. The model could allow for a correlation between the latent trait factors of depression and self-esteem. Based on Watson’s model, one would predict a very strong negative correlation between the two trait factors (but less than -1), while situational fluctuation of depression could be relatively weakly related to fluctuation in self-esteem.

The main difference between the cross-lagged model and the trait model concerns the pattern of correlations across different retest intervals. The cross-lagged model predicts a simplex structure (i.e., the magnitude of correlations decreases with increasing retest intervals). In contrast, the trait model predicts that retest correlations are unrelated to the length of the retest interval. With adequate statistical power, it is therefore possible to test these models against each other. With more complex statistical methods it is even possible to test a hybrid model that allows for all three causal effects (Kenny & Zautra, 1995).

The present manuscript simply presents one model with adequate model fit. However, model fit is largely influenced by the measurement model. The measurement model fits the data well because it is based on parcels (i.e., parcels are made to be parallel indicators of a construct and are bound to fit well). Therefore, the fit indices are insensitive to the specification of the longitudinal pattern of correlations. To illustrate, global fit is based on the fit to a correlation matrix with 276 parameters (3 indicators * 2 constructs * 4 waves = 24 indicators , 24 * 23 / 2 = 276 correlations). At the latent level, there are only 28 parameters (2 constructs * 4 waves = 8 latent factors, 8 * 7 / 2 = 28 parameters). The cross-lagged model constrains only 12 of these parameters (12 / 276 < 5%). Thus, the fit of the causal model should be evaluated in terms of the relative fit of the measurement model to the structural model. Table 2 shows the relevant information. Surprisingly, it shows only a difference of 6 degrees of freedom between Model 2 and 3, where I would have expected 12 degrees of freedom difference (?). More important, with six degrees of freedom, the chi-square difference is quite large 59. Although the qui-square test may be overly sensitive, it would be important to know why the model fit is not better. My guess is that the model underestimates long-term stability due to the failure to include a trait component. The same test for Study 2 suggests a better fit of the cross-lagged model in Study 2. However, even a good fit does not indicate that the model is correct. A trait model may fit the data as well or even better.

Regarding Study 1, the authors commit the common fallacy to interpret null-effects as evidence for the lack of a significant effect. Even if in Study 1, self-esteem was a significant (p < .05) lagged predictor of depression, and depression was not a significant (p > .05) lagged predictor of self-esteem, it is incorrect to conclude that self-esteem has an effect, but depression does not have an effect. Indeed, given the small magnitude of the two effects (-.04 vs -.10 in Figure 1) it is likely that these effects are not significantly different from each other (it is good practice in SEM studies to report confidence intervals, which would make it easier to interpret the results).

The limitation section does acknowledge that “the study designs do not allow for strong conclusions regarding the causal influence of self-esteem on depression” However, without more detail and explicit discussion of alternative models, the importance of this disclaimer in the fine print is lost to most readers unfamiliar with structural equation modeling, and the statement seems to contradict the conclusions drawn in the abstract and causal interpretations of the results in the discussion (e.g., Future research should seek to identify the mediating processes of the effect of self esteem on depression).

I have no theoretical reasons to favor any causal model. I am simply trying to point out that alternative models are plausible and likely to fit the data as well as those presented in the manuscript. At a minimum a revision should acknowledge this, and present the actual empirical data (correlation tables) to allow other researchers to test alternative models.

Why do men report higher levels of self-esteem than woman?

Self-esteem is one of the most popular constructs in personality/social psychology. The common approach to study self-esteem is to give participants a questionnaire with one or more questions (items). To study gender differences, the scores of multiple items are added up or averaged separately for men and women, and then subtracted from each other. If this difference score is not zero, the data show a gender difference. Of course, the difference will never be exactly zero. So, it is impossible to confirm the nil-hypothesis that men and women are exactly the same. A more interesting question is whether gender differences in self-esteem are fairly consistent across different samples and how large gender differences, on average, are. To answer this question, psychologists conduct meta-analyses. A meta-analysis combines findings from small samples into one large sample.

The first comprehensive meta-analysis of self-esteem reported a small difference between men and women, with men reporting slightly higher levels of self-esteem than women (Kling et al., 1999). What does a small difference look like. First, imagine that you have to predict whether 50 men and 50 women are above or below the average (median in self-esteem, but the only information that you have is their gender. If there was no difference between men and women, you have no reliable information about gender and you might just flip a coin and have a 50% chance of guessing correctly. However, given the information that men are slightly more likely to be above average in self-esteem, you guess above-average for men and below average for women. This blatant stereotype helps you to be correct 54% of the time, but you are still incorrect in your guesses 46% of the time.

Another way to get a sense of the magnitude of the effect size is to compare it to well-known, large gender differences. One of the largest gender differences that is also easy to measure is height. Men are 1.33 standard deviations taller than women, while the difference in self-esteem ratings is only 0.21 standard deviations. This means the difference in self-esteem is only 15% of the difference in height.

A more recent meta-analysis found an even smaller difference of d = .11 (Zuckerman & Hall, 2016). A small difference increases the probability that gender differences in self-esteem ratings may be even smaller or even in the opposite direction in some populations. That is, while the difference in height is so large that it can be observed in all human populations, the difference in self-esteem is so small that it may not be universally observed.

Another problem with small effects is that they are more susceptible to the influence of systematic measurement error. Unfortunately, psychologists rarely examine the influence of measurement error on their measures. Thus, this possibility has not been explored.

Another problem is that psychologists tend to rely on convenience samples, which makes it difficult to generalize findings to the general population. For example, psychology undergraduate samples select for specific personality traits that may make male or female psychology students less representative of their respective gender.

It is therefore problematic to draw premature conclusions about gender differences in self-esteem on the basis of meta-analyses of self-esteem ratings in convenience samples.

What Explains Gender Differences in Self-Esteem Ratings?

The most common explanations for gender differences in self-esteem are gender roles (Zuckerman & Hall, 2016) or biological differences (Schmitt et al, 2016). However, there are few direct empirical tests of these hypotheses. Even biologically oriented researchers recognize that self-esteem is influenced by many different factors, including environmental ones. It is therefore unlikely that biological sex differences have a direct influence on self-esteem. A more plausible model would assume that gender differences in self-esteem are mediated by a trait that shows stronger gender differences and that predicts self-esteem. The same holds for social theories. It seems unlikely that women rely on gender stereotypes to evaluate their self-esteem. It is more plausible that they rely on attributes that show gender differences. For example, Western societies have different beauty standards for men and women and women tend to have lower self-esteem in ratings of their attractiveness (Gentile et al., 2009). Thus, a logical next step is to test mediation models. Surprisingly, few studies have explored well-known predictors of self-esteem as potential mediators of gender differences in self-esteem.

Personality Traits and Self-Esteem

Since the 1980s, thousands of studies have measured personality from the perspective of the Five Factor Model. The Big Five capture variation in negative emotionality (Neuroticism), positive energy (Extraversion), curiosity and creativity (Openness), cooperation and empathy (Agreeableness), and goal-striving and impulse-control (Conscientiousness). Given the popularity of self-esteem and the Big Five in personality research, many studies have examined the relationship between the Big Five and self-esteem, while other studies have examined gender differences in the Big Five traits.

Studies of gender differences show the biggest and most consistent differences for neuroticism and agreeableness. Women tend to score higher on both dimensions than men. The results for the Big Five and self-esteem are more complicated. Simple correlations show that higher self-esteem is associated with lower Neuroticism and higher Extraversion, Openness, Agreeableness, and Conscientiousness (Robins et al., 2001). The problem is that Big Five measures have a denotative and an evaluative component. Being neurotic does not only mean to respond more strongly with negative emotions; it also is undesirable. Using structural equation model, Anusic et al. (2009) separated the denotative and evaluative component and found that self-esteem was strongly related to the evaluative component of personality ratings. This evaluative factor in personality ratings was first discovered by Thorndike (1920) one-hundred years ago. The finding that self-esteem is related to overly positive self-ratings of personality is also consistent with a large literature on self-enhancement. Individuals with high self-esteem tend to exaggerate their positive qualities ().

Interestingly, there are very few published studies of gender differences in self-enhancement. One possible explanation for this is that there is only a weak relationship between gender and self-enhancement. The rational is that gender is routinely measured and that many studies of self-enhancement could have examined gender differences. It is also well known that psychologists are biased against null-findings. Thus, ample data without publications suggest that there is no strong relationship. However, a few studies have found stronger self-enhancement for men than for women. For example, one study showed that men overestimate their intelligence more than women (von Stumm et al., 2011). There is also evidence that self-enhancement and halo predict biases in intelligence ratings (Anusic, et al., 2009). However, it is not clear whether gender differences are related to halo or are specific to ratings of intelligence.

In short, a review of the literature on gender and personality and personality and self-esteem suggests three potential mediators of the gender differences in self-esteem. Men may report higher levels of self-esteem because they are lower in neuroticism, lower in agreeableness, or higher in self-enhancement.

Empirical Test of the Mediation Model

I used data from the Gosling–Potter Internet Personality Project (Gosling, Vazire, Srivastava,
& John, 2004
). Participants were visitors of a website who were interested in taking a personality test and receiving feedback about their personality. The advantage of this sampling approach is that it creates a very large dataset with millions of participants. The disadvantage is that men and women who visited this sight might differ in personality traits or self-esteem. The questionnaire included a single-item measure of self-esteem. This item shows the typical gender difference in self-esteem (Bleidorn et al., 2016).

To separate descriptive factors of the Big Five from evaluative bias and acquiescence bias, I fitted a measurement model to the 44-item Big Five Inventory. I demonstrated earlier that this measurement model has good fit for Canadian participants (Schimmack, 2019). To test the mediation model, I added gender and self-esteem to the model. In this study, gender was measured with a simple dichotomous male vs. female question.

Gender was a predictor of all 7 factors (Big Five + Halo + Acquiescence). Exploratory analysis examined whether gender had unique relationships with specific BFI items. These relationships could be due to unique relationships of gender with specific personality traits called facets. However, few notable relationships were observed. Self-esteem was predicted by all seven personality traits and gender. However, openness to experience showed weak relationships with self-esteem. To stabilize the model, this path was fixed to zero.

I fitted the model to data from several nations. I selected nations with (a) a large number of complete data (N = 10,000), familiarity with English as a first or common second language (e.g., India = yes, Japan = no), while trying to sample a diverse range of cultures because gender differences in self-esteem tend to vary across cultures (Bleidorn et al., 2016; Zuckerman & Hall, 2016). I fitted the model to samples from four nations: US, Netherlands, India, and Philippines with N = 10,000 for each nation. Table 1 shows the results.

The first two rows show the fit of the Canadian model to the other four nations. Fit is a bit lower for Asian samples, but still acceptable.

The results for sex differences in the Big Five are a bit surprising. Although all four samples show the typical gender difference in neuroticism, the effect sizes are relatively small. For agreeableness, the gender differences in the two Asian samples are negligible. This raises some concerns about the conclusion that gender differences in personality traits are universal and possibly due to evolved genetic differences (Schmitt et al, 2016). The most interesting novel finding is that there are no notable gender differences in self-enhancement. This also implies that self-enhancement cannot mediate gender differences in self-esteem.

The strongest predictor of self-esteem is self-enhancement. Effect sizes range from d = .27 in the Netherlands to d = .45 in the Philippines. The second strongest predictor is neuroticism. As neuroticism also shows consistent gender differences, neuroticism partially mediates the effect of gender on self-esteem. Although weak, agreeableness is a consistent negative predictor of self-esteem. This replicates Anusic et al.’s (2009) finding that the sign of the relationship reverses when halo bias in agreeableness ratings is removed from measures of agreeableness.

The total effects show the gender differences in the four samples. Consistent with meta-analysis the gender differences in self-esteem are weak with effect sizes ranging from d = .05 to d = .15. Personality explains some of this relationship. The unexplained direct effect of gender is very small.

Conclusion

A large literature and several meta-analysis have documented small, but consistent gender differences in self-ratings of self-esteem. Few studies have examined whether these differences are mere rating biases or tested causal models of these gender differences. This article addressed these questions by examining seven potential mediators; the Big Five traits as well as halo bias and acquiescence bias.

The results replicated previous findings that gender differences in self-esteem are small, d < .2. They also showed that neuroticism is a partial mediator of gender differences in self-esteem. Women tend to be more sensitive to negative information and this disposition predicts lower self-esteem. It makes sense that a general tendency to focus on negative information also extends to evaluations of the self. Women appear to be more self-critical than men. A second mediator was agreeableness. Women tend to be more agreeable and agreeable people tend to have lower self-esteem. However, this relationship was only observed in Western nations and not in Asian nations. This cultural difference explains why gender differences in self-esteem tend to be stronger in Western than in Asian cultures. Finally, a general evaluative bias in self-ratings of personality was the strongest predictor of self-esteem, but showed no notable gender differences. Gender also still had a very small relationship with self-esteem after accounting for personalty mediators.

Overall, these results are more consistent with models that emphasize similarities between men and women (Men and Women are from Earth) than models that emphasize gender differences (Women are from Venus and Men are from Mars). Even if evolutionary theories of gender differences are valid, they explain only a small amount of the variance in personality traits and self-esteem. As one evolutionary psychologists put it “it is undeniably true that men and women are more similar than different genetically, physically and psychologically” (p. 52). The results also undermine claims that women internalize negative stereotypes about them and have notably lower self-esteem as a result. Given the small effect sizes, it is surprising how much empirical and theoretical attention gender differences in self-esteem have received. One reason is that psychologists often ignore effect sizes and only care about the direction of an effect. Given the small effect size of gender on self-esteem, it seems more fruitful to examine factors that produce variation in self-esteem for men and women.

Lies, Damn Lies, and Experiments on Attitude Ratings

Ten years ago, social psychology had a life-time opportunity to realize that most of their research is bullshit. Their esteemed colleague Daryl Bem published a hoax article about extrasensory perception in their esteemed Journal of Personality and Social Psychology. The editors felt compelled to write a soul searching editorial about research practices in their field that could produce such nonsense results. However, 10 years later social psychologists continue to use the same questionable practices to publish bullshit results in JPSP. Moreover, they are willfully ignorant of any criticism of their field that is producing mostly pseudo-scientific garbage. Just now, Wegener and Petty, two social psychologists at Ohio State University wrote an article that downplays the importance of replication failures in social psychology. At the same time, they publish a JPSP article that shows they haven’t learned anything from 10 years of discussion about research practices in psychology. I treat the first author as an innocent victim who is being trained in the dark art of research practices that have given us social priming, ego-depletion, and time-reversed sexual arousal.

The authors report seven studies. We don’t know how many other studies were run. The seven studies are standard experiments with one or two (2 x 2) experimental manipulations between subjects. The studies are quick online studies with Mturk samples. The main goal was to show that some experimental manipulations influence some ratings that are supposed to measure attitudes. Any causal effect on these measures is interpreted as a change in attitudes.

The problem for the author is that their experimental manipulations have small effects on the attitude measures. So, individually studies 1-6 would not show any effects. At no point did they consider this a problem and increase sample sizes. However, they were able to fix the problem by combining studies that were similar enough into one dataset. his was also done by Bem to produce significant results for time-reversed causality. It is not a good practice, but that doesn’t bother editors and reviewers at the top journal of social psychology. After all, they all do not know how to do science.

So, let’s forget about the questionable studies 1-6 and focus on the preregistered replication study with 555 Mturk workers (Study 7). The authors analyze their data with a mediation model and find statistically significant indirect effects. The problem with this approach is that mediation no longer has the internal validity of an experiment. Spurious relationships between mediators and the DV can inflate these indirect effects. So, it is also important to demonstrate that there is an effect by showing that the manipulation changed the DV (Baron & Kenny, 1986). The authors do not report this analysis. The authors also do not provide information about standardized effect sizes to evaluate the practical significance of their manipulation. However, the authors did provide covariance matrices in a supplement and I was able to run the analyses to get this information.

Here are the results.

The main effect for the bias manipulation is d = -.04, p = .38, 95%CI = -.12, .05

The main effect for the untrustworthiness manipulation is d = .01, p = .75, 95%CI = -.07, .10.

Both effects are not significant. Moreover, the effect size is so small and thanks to the large sample size the confidence intervals are so narrow that we can reject the hypothesis that the manipulations have at least a small effect, d = .2.

So, here we see the total failure of social psychology to understand what they are doing and their inability to make a real contribution to the understanding of attitudes and attitude change. This didn’t stop Rich Petty from co-authoring an article about psychology’s contribution to addressing the Covid-19 pandemic. Now, it would be unfair to blame 150,000 deaths on social psychology, but it is a fact that 40 years of trivial experiments have done little to help us change attitudes like attitudes towards wearing masks in the real world.

I can only warn young, idealistic students to consider social psychology as a career path. I speak form experience. I was a young idealistic student eager to learn about social psychology in the 1990s. If I could go back in time, I would have done something else with my life. In 2010, I thought social psychology might actually change for the better, but in 2020 it is clear that most psychologists want to continue with their trivial experiments that tell us nothing about social behaviour. If you just can’t help it and want to study social phenomena I recommend personality psychology or other social sciences.