Category Archives: Uncategorized

JPSP:PPID = Journal of Pseudo-Scientific Psychology: Pushing Paradigms – Ignoring Data


Ulrich Orth, Angus Clark, Brent Donnellan, Richard W. Robins (DOI: 10.1037/pspp0000358) present 10 studies that show the cross-lagged panel model (CLPM) does not fit the data. This does not stop them from interpreting a statistical artifact of the CLPM as evidence for their vulnerability model of depression. Here I explain in great detail why the CLPM does not fit the data and why it creates an artifactual cross-lagged path from self-esteem to depression. It is sad that the authors, reviewers, and editors were blind to the simple truth that a bad-fitting model should be rejected and that it is unscientific to interpret parameters of models with bad fit. Ignorance of basic scientific principles in a high-profile article reveals poor training and understanding of the scientific method among psychologists. If psychology wants to gain respect and credibility, it needs to take scientific principles more seriously.


Psychology is in a crisis. Researchers are trained within narrow paradigms, methods, and theories that populate small islands of researchers. The aim is to grow the island and to become a leading and popular island. This competition between islands is rewarded by an incentive structure that imposes the reward structure of capitalism on science. The winner gets to dominate the top journals that are mistaken as outlets of quality. However, just like Coke is not superior to Pepsi (sorry Coke fans), the winner is not better than the losers. They are just market leaders for some time. No progress is being made because the dominant theories and practices are never challenged and replaced with superior ones. Even the past decade that has focused on replication failures has changed little in the way research is conducted and rewarded. Quantity of production is rewarded, even if the products fail to meet basic quality standards as long as naive consumers of researchers are happy.

This post is about the lack of training in the analysis of longitudinal data with a panel structure. A panel study essentially repeats the measurement of one or several attributes several times. Nine years of undergradute and graduate training leave most psychologists without any training how to analyze these data. This explains why the cross-lagged panel model (CLPM) was criticized four decades ago (Rogosa, 1980), but researchers continue to use it with the naive assumption that it is a plausible model to analyze panel data. Critical articles are simply ignored. This is the preferred way of dealing with criticism by psychologists. Here, I provide a detailed critique of CLPM using Orth et al.’s data ( and simulations.

Step 1: Examine your data

Psychologists are not trained to examine correlation matrices for patterns. They are trained to submit their data to pre-specified (cookie-cutter) models and hope that the data fit the model. Even if the model does not fit, results are interpreted because researchers are not trained in modifying cookie cutter models to explore reasons for bad fit. To understand why a model does not fit the data, it is useful to inspect the actual pattern of correlations.

To illustrate the benefits of visual inspection of the actual data, I am using the data from the Berkeley Longitudinal Study (BLS), which is the first dataset listed in Orth et al.’s (2020) table that lists 10 datasets.

To ease interpretation, I break up the correlation table into three components, namely (a) correlations among self-esteem measures (se1-se4 with se1-se4), correlations among depression measures (de1-de4 with de1-de4), and correlations of self-esteem measures with depression measures (se1-se4 with de1-de4);

Table 1

Table 1 shows the correlation matrix for the four repeated measurements of self-esteem. The most important information in this table is how much the magnitude of the correlations decreases along the diagonals that represent different time lags. For example, the lag-1 correlations are .76, .79, and .74, which approximately average to a value of .76. The lag-2 correlations are .65 and .69, which averages to .67. The lag-3 correlation is .60.

The first observation is that correlations are getting weaker as the time-lag gets longer. This is what we would expect from a model that assumes self-esteem actually changes over time, rather than just fluctuating around a fixed set-point. The latter model implies that retest correlations remain the same over different time lags. So, we do have evidence that self-esteem changes over time, as predicted by the cross-lagged panel model.

The next question is how much retest correlations decrease with increasing time lags. The difference from lag-1 to lag-2 is .74 – .67 = .07. The difference from lag-2 to lag-3 is .67 – .60, which is also .07. This shows no leveling off of the decrease in these data. It is possible that the next wave would produce a lag-4 correlation of .53, which would be .07 lower than then lag-3 correlation. However, a difference of .07 is not very different from 0, which would imply that change asymptotes at .60. The data are simply insufficient to provide strong information about this.

The third observation is that the lag-2 correlation is much stronger than the square of the lag-1 correlations, .67 > .74^2 = .55. Similarly, the lag-3 correlation is stronger than the product of the lag-1 and lag-2 correlations, .60 > .74 * .67 = .50 This means that a simple autoregressive model with observed variables does not fit the data. However, this is exactly the model of Orth et al.’s CLPM.

It is easy to examine the fit of this part of the CLPM model, by fitting an autoregressive model to the self-esteem panel data.

se2-se4 PON se1-3 ! This command regresses each measure on the previous measure (n on n-1).
! There is one thing I learned from Orth et al., and it was the PON command of MPLUS

Table 2

Table 2 shows the fit of the autoregressive model. While CFI meets the conventional threshold of .95 (higher is better), RMSEA shows terrible fit of the model (.06 or lower are considered acceptable). This is a problem for cookie-cutter researchers who think CLPM is a generic model that fits all data. Here we see that the model makes unrealistic assumptions and we already know what the problem is based on our inspection of the correlation table. The model predicts more change than the data actually show. We are therefore in a good position to reject the CLPM as a viable model for these data. This is actually a positive outcome. The biggest problem in correlational research are data that fit all kinds of models. Here we have data that actually disconfirm some models. Progress can be made, but only if we are willing to abandon the CLPM.

Now let’s take a look at the depression data, following the same steps as for the self-esteem data.

Table 3

The average lag-1 correlation is .43. The average lag-2 correlaiton is .45, and the lag-3 correlation is .4. These results are problematic for an autoregressive model because the lag-2 correlation is not even lower than the lag-1 correlation.

Once more it is hard to tell, whether retest-correlations are approaching an asymptote. In this case, the lag-2 minus lag-1 difference is -.02 and the lag-3 minus lag-2 difference is .05.

Finally, it is clear that the autoregressive model with manifest variables overestimates change. The lag-2 correlation is stronger than the square of the lag-1 correlations, .45 > .43^2 = .18, and the lag-3 correlation is stronger than the lag-1 * lag-2 correlation, .40 > .43*.45 = .19.

Given these results, it is not surprising that the autoregressive model fits the data even less than for the self-esteem measures (Table 4).

de2-de4 PON de1-de3 ! regress each depression measure on the previous one.

Talble 4

Even the CFI value is now in the toilet and the RMSEA value is totally unacceptable. Thus, the basic model of stability and change implemented in CLPM is inconsistent with the data. Nobody should proceed to build a more complex, bivariate model if the univariate models are inconsistent with the data. The only reason why psychologists do so all the time is that they do not think about CLPM as a model. They think CLPM is like a t-test that can be fitted to any panel data without thinking. No wonder psychology is not making any progress.

Step 2: Find a Model That Fits the Data

The second step may seem uncontroversial. If one model does not fit the data, there is probably another model that does fit the data and this model has a higher chance of being the model that reflects the causal processes that produced the data. However, psychologists have an uncanny ability to mess up even the simplest steps in data analysis. They have convinced themselves that it is wrong to fit models to data. The model has to come first so that the results can be presented as confirming a theory. However, what is the theoretical rational of the CLPM? It is not motivated by any theory of development, stability, or change. It is as atheoretical as any other model. It only has the advantage that it became popular on an island of psychology and now people use it without being questioned about it. Convention and conformity are not pillars of science.

There are many alternative models to CLPM that can be tried. One model is 60 years old and was introduced by Heise (1969). It is also an autoregressive model, but it also allows for occassion specific variance. That is, some factors may temporarily change individuals’ self-esteem or depression without any lasting effects on future measurements. This is a particularly appealing idea for a symptom checklist of depression that asks about depressive symptoms in the past four weeks. Maybe somebody’s cat died or it was a midterm period and depressive symptoms were present for a brief period, but these factors have no influence on depressive symptoms a year later.

I first fitted Heise’s model to the self-esteem data.

sse1 BY se1@1;
sse2 BY se2@1;
sse3 BY se3@1;
sse4 BY se4@1;
sse2-sse4 PON sse1-sse3 (stability);
se1-se4 (se_osv) ! occasion specific variance in self-esteem

Model fit for this model is perfect. Even the chi-square test is not significant (which in SEM is a good thing, because it means the model closely fits the data).

Model results show that there is significant occasion specific variance. After taking this variance into account the stability of the variance that is not occassion-specific, called state variance by Heise, is around r = .9 from one occasion to the next.

Fit for the depression data is also perfect.

There is even more occasion specific variance in depressive symptoms, but the non-occasion-specific variance is even more stable as the non-occasion-specific variance in self-esteem.

These results make perfect sense if we think about the way self-esteem and depression are measured. Self-esteem is measured with a trait measure of how individuals see themselves in general, ignoring ups and downs and temporary shifts in self-esteem. In contrast, depression is assessed with questions about a specific time period and respondents are supposed to focus on their current ups and downs. Their general disposition should be reflected in these judgments only to the extent that it influences their actual symptoms in the past weeks. These episodic measures are expected to have more occasion specific variance if they are valid. These results show that participants are responding to the different questions in different ways.

In conclusion, model fit and the results favor Heise’s model over the cookie-cutter CLPM.

Step 3: Putting the two autoregressive models together

Let’s first examine the correlations of self-esteem measures with depression measures.

The first observation is that the same-occasion correlations are stronger (more negative) than the cross-occasion correlations. This suggests that occasion specific variance in self-esteem is correlated with occasion specific variance in depression.

The second observation is that the lagged self-esteem to depression correlations (e.g., se1 with de2) do not become weaker (less negative) with increasing time lag, lag-1 r = -.36, lag-2 r = -.32, lag-3 r = .33.

The third observation is that the lagged depression to self-esteem correlations (e.g., de1 with se2) do not decrease from lag-1 to lag-2, although they do become weaker from lag-2 to lag-3, lag-1 r = -.44, lag-2 r = -.45, lag-3 r = -.35.

The fourth observation is that the lagged self-esteem to depression correlations (se1 with de2) are weaker than the lagged depression to self-esteem (de1 with se2) correlations . This pattern is expected because self-esteem is more stable than depressive symptoms. As illustrated in the Figure below, the path from de1-se4 is stronger than the path form se1 to de4 because the path from se1 to se4 is stronger than the path from de1 to de4.

Regression analysis or structural equation modeling is needed to examine whether there are any additional lagged effects of self-esteem on depressive symptoms. However, a strong cross-lagged path from se1 to de4 would produce a stronger correlation of se1 and de4, if stability were equal or if the effect is strong. So, a stronger lagged self-esteem to depression correlation than a lagged depression to self-esteem correlation would imply a cross-lagged effect from self-esteem to depression, but the reverse pattern is inconclusive because self-esteem is more stable.

Like Orth et al. (2020) I found that Heise’s model did not converge. However, unlike Orth et al. I did not conclude from this finding that the CLPM model is preferable. After all, it does not fit the data. Model convergence is sometimes simply a problem of default starting values that work for most models but not for all models. In this case, the high stability of self-esteem produced a problem with default starting values. Just setting this starting value to 1 solved the convergence problem and produced a well-fitting result.

The model results show no negative lagged prediction of depression from self-esteem. In fact, a small positive relationship emerged, but it was not statistically significant.

It is instructive to compare these results with the CLPM results. The CLPM model is nested in the Heise model. The only difference is that the occasion-specific variances of depression and self-esteem are fixed to zero. As these parameters were constrained across occasions, this model has two fewer parameters and the model df increase from 24 to 26. Model fit decreased in the more parsimonious model. However, the overall fit is not terrible, although RMSEA should be below .06 [Interestingly, the CFI value changed from a value over .95 to a value .94 when I estimated the model with MPLUS8.2, whereas Orth et al. used MPLUS8]. This shows the problem of relying on overall fit to endorse models. Overall fit is often good with longitudinal data because all models predict weaker correlations over longer time intervals. The direct model comparison shows that the Heise model is the better model.

In the CLPM model, self-esteem is a negative lagged predictor of depression. This is the key finding that Orth and colleagues have been using to support the vulnerability model of depression (low self-esteem leads to depression).

Why does the CLPM model produce negative lagged effects of self-esteem on depression. The reason is that the model underestimates the long-term stability of depression from time 1 to time 3 and time 4. To compensate for this it can use self-esteem that is more stable and then link self-esteem at time 2 with depression at time 3 (.745 * -.191) and self-esteem at time 3 with depression at time 4 (.742 * .739 * -.190). But even this is not sufficient to compensate for the misprediction of depression over time. Hence, the worse fit of the model. This can be seen by examining the model reproduced correlation matrix in the MPLUS Tech1 output.

Even with the additional cross-lagged path, the model predicts only a correlation of r = .157 from de1 to de4, while the observed correlation was r = .403. This discrepancy merely confirms what the univariate models showed. A model without occasion-specific variances underestimates long-term stability.

Interem Conclusion

Closer inspection of Orth et al.’s data shows that the CLPM does not fit the data. This is not surprising because it is well-known that the cross-lagged panel model often underestimates long-term stability. Even Orth has published univariate analyses of self-esteem that show a simple autoregressive model does not fit the data (Kuster & Orth, 2013). Here I showed that using the wrong model of stability creates statistical artifacts in the estimation of cross-lagged path coefficients. The only empirical support for the vulnerability model of depression is a statistical artifact.

Replication Study

I picked the My Work and I (MWI) dataset for a replication study. I picked it because it used the same measures and had a relatively large sample size (N = 663). However, the study is not an exact or direct replication of the previous one. One important difference is that measurements were repeated every two months rather than every year. The length of the time interval can influence the pattern of correlations.

There are two notable differences in the correlation table. First, the correlations increase with each measurement from .782 for se1 with se2 to .871 from se4 to se5. This suggests a response artifact, such as a stereotypic response styles that inflates consistency over time. This is more likely to happen for shorter intervals. Second, the difference between correlations with different lags are much smaller. They were .07 in the previous study. Here the differences are .02 to .03. This means there is hardly any autoregressive structure, suggesting that a trait model may fit the data better.

The pattern for depression is also different from the previous study. First, the correlations are stronger, which makes sense, because the retest interval is shorter. Somebody who suffers from depressive symptoms is more likely to still suffer two months later than a year later.

There is a clearer autoregressive structure for depression and no sign of stereotypic responding. The reason could be that depression was assessed with a symptom checklist that asks about the previous four weeks. As this question covers a new time period each time, participants may avoid stereotypic responding.

The depression-self-esteem correlations also become stronger (more negative) over time from r = -.538 to r = -.675. This means that a model with constrained coefficients may not fit the data.

The higher stability of depression explains why there is no longer a consistent pattern of stronger lagged depression to self-esteem correlations (de1 with se2) above the diagonal than self-esteem to depression correlations (se1 with de2) below the diagonal. Five correlations are stronger one way and five correlations are stronger the other way.

For self-esteem, the autoregressive model without occasion-specific variance had poor fit (RMSEA = .170, CFI = .920). Allowing for occasion-specific variance improved fit and fit was excellent (RMSEA = .002, CFI = .999). For depression, the autoregressive model without occasion-specific variance had poor fit (RMSEA = .113, CFI = .918). The model with occasion-specific variance fit better and had excellent fit (RMSEA = .029, CFI = .995). These results replicate the previous results and show that CLPM does not fit because it underestimates stability of self-esteem and depression.

The CLPM model also had bad fit in the original article (RMSEA = .105, CFI = .932). In comparison, the model with occasion specific variances had much better fit (RMSEA = .038, CFI = .991). Interestingly, this model did show a small, but statistically significant path from self-esteem to depression (effect size r = -.08). This raises the possibility that the vulnerability effect may exist for shorter time intervals of a few months, but not for longer time intervals of a year or more. However, Orth et al. do not consider this possibility. Rather, they try to justify the use of the CLPM to analyze panel data even though the model does not fit.


Orth et al. note “fit values were lowest for the CLPM” (p. 21) with a footnote that recognizes the problem of the CLPM, “As discussed in the Introduction, the CLPM underestimates the long-term stability of constructs, and this issue leads to misfit as the number of waves increases” (p. 63).

Orth et al. also note correctly that the cross-lagged effect of self-esteem on depression emerges more consistently with the CLPM model. By now it is clear why this is the case. It emerges consistently because it is a statistical artifact produced by the underestimation of stability in depression in the CLPM model. However, Orth et al.’s belief in the vulnerability effect is so strong that they are unable to come to a rational conclusion. Instead they propose that the CLPM model, despite its bad fit, shows something meaningful.

We argue that precisely because the prospective effects tested in the CLPM are also based on between-person variance, it may answer questions that cannot be assessed with models that focus on within-person effects. For example, consider the possible effects of warm parenting on children’s self-esteem (Krauss, Orth, & Robins, 2019): A cross-lagged effect in the CLPM would indicate that children raised by warm parents would be more likely to develop high self-esteem than children raised by less warm parents. A cross-lagged effect in the RI-CLPM would indicate that children who experience more parental warmth than usual at a particular time point will show a subsequent increase in self-esteem at the next time point, whereas children who experience less parental warmth than usual at a particular time point will show a subsequent drop in self-esteem at the next time point

Orth et al. then point out correctly that the CLPM is nested in other models and makes more restrictive assumptions about the absence of occasion specific variance or trait variance, but they convince themselves that this is not a problem.

As was evident also in the present analyses, the fit of the CLPM is typically not as good as the fit of the RI-CLPM (Hamaker et al., 2015; Masselink, Van Roekel, Hankin, et al., 2018). It is important to note that the CLPM is nested in the RI-CLPM (for further information about how the models examined in this research are nested, see Usami, Murayama, et al., 2019). That is, the CLPM is a special case of the RI-CLPM, where the variances of the two random intercept factors and the covariance between the random intercept factors are constrained to zero (thus, the CLPM has three additional degrees of freedom). Consequently, with increasing sample size, the RI-CLPM necessarily fits significantly better than the CLPM (MacCallum, Browne, & Cai, 2006). However, does this mean that the RI-CLPM should be preferred in model selection? Given that the two models differ in their conceptual meaning (see the discussion on between- and within-person effects above), we believe that the decision between the CLPM and RI-CLPM should not be based on model fit, but rather on theoretical considerations.

As shown here, the bad fit of CLPM is not an unfair punishment of a parsimonious model. The bad fit reveals that the model fails to model stability correctly. To disregard bad fit and to favor the more parsimonious model even if it doesn’t fit makes no sense. By the same logic, a model without cross-lagged paths would be more parsimonious than a model with cross-lagged paths and we could reject the vulnerability model simply because it is not parsimonious. For example, when I fitted the model with occasion specific variances and without cross-lagged paths, model fit was better than model fit of the CLPM (RMSEA = .041 vs. RMSEA = .109) and only slightly worse than model fit of the model with occasion specific variance and cross-lagged paths (RMSEA = .040).

It is incomprehensible to methodologists that anybody would try to argue in favor of a model that does not fit the data. If a model consistently produces bad fit, it is not a proper model of the data and has to be rejected. To prefer a model because it produces a consistent artifact that fits theoretical preferences is not science.

Replication II

Although the first replication mostly confirmed the results of the first study, one notable difference was the presence of statistically significant cross-lagged effects in the second study. There are a variety of explanations for this inconsistency. The lack of an effect in the first study could be a type-II error. The presence of an effect in the first replication study could be a type-I errror. Finally, the difference in time intervals could be a moderator.

I picked the Your Personality (YP) dataset because it was the only dataset that used the same measures as the previous two studies. The time interval was 6 months, which is in the middle of the other two intervals. This made it interesting to see whether results would be more consistent with the 2-month or the 1-year intervals.

For self-esteem, the autoregressive model with occasion specific variance had a good fit to the data (RMSEA = .016, CFI = .999). Constraining the occasion specific variance to zero reduced model fit considerably (RMSEA = .160, CFI = .912). Results for depression were unexpected. The model with occasion specific variance showed non-significant and slightly negative residuals for the state variances. This finding implies that there are no detectable changes in depression over time and that depression scores only have a stable trait and occasion specific variance. Thus, I fixed the autoregressive parameters to 1 and the residual state variances to zero. This model is equivalent to a model that specifies a trait factor. Even this model had barely acceptable fit (RMSEA = .062, CFI = .962). Fit could be increased by relaxing the constraints on the occasion specific variance (RMSEA = .060, CFI = .978). However, a simple trait model fit the data even better (RMSEA = .000, CFI = 1.000). The lack of an autoregressive structure makes it implausible that there are cross-lagged effects on depression. If there is no new state variance, self-esteem cannot be a predictor of new state variance.

The presence of a trait factor for depression suggests that there could also be a trait factor for self-esteem and that some of the correlations between self-esteem and depression are due to correlated traits. Therefore I added a trait factor to the measurement model of self-esteem. This model had good fit (RMSEA = .043, .993) and fit was superior to the CLPM (RMSEA = .123, CFI = .883). The model showed no significant cross-lagged effect from self-esteem to depression and the parameter estimate was positive rather than negative, .07. This finding is not surprising given the lack of decreasing correlations over time for depression.

Replication III

The last openly shared datasets are from the California Families Project (CFP). I first examined the children’s data (CFP-C) because Orth et al. (2020) reported a significant vulnerability effect with the RI-CLPM.

For self-esteem, the autoregressive model without occasion-specific variance had bad fit (RMSEA = .108, CFI = .908). Even the model with occasion-specific variance had poor fit (RMSEA = .091, CFI = .945). In contrast, a model with a trait factor and without occasion specific variance had good fit (RMSEA = .023, CFI = .997). This finding suggests that it is necessary to include a stable trait factor to model stability of self-esteem correctly in this dataset.

For depression, the autoregressive model without occasion-specific variance had bad fit (RMSEA = .104, CFI = .878). Even the model with occasion-specific variance had poor fit (RMSEA = .103, CFI = .897). Adding a trait factor produced a model with acceptable fit (RMSEA = .051, CFI = .983).

The trait-state model fit the data well (RMSEA = .989, CFI = .032) and much better than the CLPM (RMSEA = .079, CFI = .914). The autoregressive effect of self-esteem on depression was not significant, and only have the size of the effect size in the RI-CLPM ( -.05 vs. -.09). The difference is due to the constraint on the trait factor. Relaxing these constraints improves model fit and the vulnerability effect becomes non-significant.

Replication IV

The last dataset is based on the mothers’ self-reports in the California Families Project (CFP-M).

For self-esteem, the autoregressive model without occasion-specific variance had bad fit (RMSEA = .139, CFI = .885). The model with occasion specific variance improved fit (RMSEA = .049, CFI = .988). However, the trait-state model had even better fit (RMSEA = .046, CFI = .993).

For depression, the autoregressive model without occasion-specific variance had bad fit (RMSEA = .127, CFI = .880). The model with occasion-specific variance had excellent fit (RMSEA = .000, CFI = 1.000). The trait-state model also had excellent fit (RMSEA = .000, CFI = 1.000).

The CLPM had bad fit to the data (RMSEA = .092, CFI = .913). The Heise model improved fit (RMSEA = .038, CFI = .987). The trait-state model had even better fit (RMSEA = .031, CFI = .992). The cross-lagged effect of self-esteem on depression was negative, but small and not significant, -.05 (95%CI = -.13 to .02).

Simulation Study 1

The first simulation demonstrates that a cross-lagged effect emerges when the CLPM is fitted to data with a trait factor and one of the constructs has more trait variance which produces more stability over time.

I simulated 64% trait variance and 36% occasion-specific variance for self-esteem.

I simulated 36% trait variance and 64% occasion-specific variance for depression.

The correlation between the two trait factors was r = -.7. This produced manifest correlations of r = -.71*sqrt(.36)*sqrt(.64) = -.7 * .6 * .8 = -.34.

For self-esteem the autoregressive model without occasion specific variance had bad fit (). For depression, the autoregressive model without occasion specific variance had bad fit. The CLPM model also had bad fit (RMSEA = .141, CFI = .820). Although the simulation did not include cross-lagged paths, the CLPM showed a significant cross-lagged effect from self-esteem to depression (-.25) and a weaker cross-lagged effect from depression to self-esteem (-.14).

Needless to say, the trait-state model had perfect fit to the data and showed cross-lagged path coefficients of zero.

This simulation shows that CLPM produces artificial cross-lagged effects because it underestimates long-term stability. This problem is well-known, but Orth et al. (2020) deliberately ignore it when they interpret cross-lagged parameters in CLPM with bad fit.

Simulation Study 2

The second simulation shows that a model with a significant cross-lagged path can fit the data, if this path is actually present in the data. The cross-lagged effect was specified as a moderate effect with b = .3. Inspection of the correlation matrix shows the expected pattern that cross-lagged correlations from se to de (se1 with de2) are stronger than cross-lagged correlations from de to se (se2 with de1). The differences are strongest for lag-1.

The model with the cross-lagged paths had perfect fit (RMSEA = .000, CFI = 1.000). The model without cross-lagged paths had worse fit and RMSEA was above .06 (RMSEA = .073, CFI = .968).


The publication of Orth et al.’s (2020 article in JPSP is an embarrassment for the PPID section of JPSP. The authors did not make an innocent mistake. Their own analyses showed across 10 datasets that CLPM does not fit their data. One would expect that a team of researchers would be able to draw the correct conclusion from this finding. However, the power of motivated reasoning is strong. Rather than admitting that the vulnerability model of depression is based on a statistical artifact, the authors try to rationalize why the model with bad fit should not be rejected.

The authors write “the CLPM findings suggest that individual differences in self-esteem predict changes in individual differences in depression, consistent with the vulnerability model” (p. 39).

This conclusions is blatantly false. A finding in a model with bad fit should never be interpreted. After all, the purpose of fitting models to data and to examine model fit is to falsify models that are inconsistent with the data. However, psychologists have been brainwashed into thinking that the purpose of data analysis is only to confirm theoretical predictions and to ignore evidence that is inconsistent with theoretical models. It is therefore not a surprise that psychology has a theory crisis. Theories are nothing more than hunches that guided first explorations and are never challenged. Every discovery in psychology is considered to be true. This does not stop psychologists from developing and supporting contradictory models, which results in an every growing number of theories and confusion. It is like evolution without a selection mechanism. No wonder psychology is making little progress.

Numerous critics of psychology have pointed out that nil-hypothesis testing can be blamed for the lack of development because null-results are ambiguous. However, this excuse cannot be used here. Structural equation modeling is different from null-hypothesis testing because significant results like a high Chi-square value and derived fit indices provide clear and unambiguous evidence that a model does not fit the data. To ignore this evidence and to interpret parameters in these models is unscientific. The fact that authors, reviewers, and editors were willing to publish these unscientific claims in the top journal of personality psychology shows how poorly methods and statistics are understood by applied researchers. To gain respect and credibility, personality psychologists need to respect the scientific method.

Personality Science: The Science of Human Diversity

I wrote a textbook about personality psychology. The textbook is an e-textbook with online engagement for students. I am going to pilot the textbook this fall with my students and revise it with some additional chapters in 2021.

The book also provides an up-to-date review of the empirical literature. The content is freely accessible through a demo version of the course.

Please provide comments, corrections, additional references, etc. in the comments section or email me directly at

A review of “Low self-esteem prospectively predicts depression in adolescence and young adulthood”

In 2007, I was asked to review a ms. about the relationship between self-esteem and depression. The authors used a cross-lagged panel model to examine “prospective prediction” which is a code word for causal claims in a non-experimental study. The problem is that the cross-lagged model is fundamentally flawed because it ignores stable traits and underestimates stability. To compensate for this flaw, it uses cross-lagged paths which leads to false and inflated cross-lagged effects, especially from the more stable to the less stable construct.

I wrote a long and detailed review that was ignored by the editor and the authors and the flawed cross-lagged panel model was published (Orth, Robins, & Roberts, 2008). The article served as the basis for several follow up articles (Orth, Robins, Meier, & Conger, 2016; Rieger, Göllner, Trautwein, & Roberts, 2016; Orth, Robins, Widaman, & Conger, 2014; Orth & Robins, 2013; Sowislo & Orth, 2013; Kuster, Orth, Meier, 2012; Orth, Robins, Trzesniewski, Maes, & Schmitt, 2009; Orth, Robins, & Meier, 2009) and the main author continues to push the flawed cross-lagged panel model (Orth, Clark, Donnellan, & Robins, 2020), although he himself published a model with a trait factor to explain stability in self-esteem (Kuster & Orth, 2013). It is scientifically unjustified to omit this trait factor from bivariate models that relate self-esteem to depression, if ample evidence shows that a trait factor underlies stability of self-esteem (Kuster & Orth, 2013). So, an entire literature is based on a statistical artifact that has been well known four four decades (Rogosa, 1980).

I just found my old review while looking into a folder called “file drawer” and I thought I share it here. It just shows how peer-review doesn’t serve the purpose of quality control and that ambition often trumps the search for truth.

Review – Dec / 3 / 2017

This article tackles an important question: What is the causal relation between depression and self-esteem? As always, at the most abstract level there are three answers to this question. Self-esteem causes (low) depression. Depression causes (low) self-esteem. The correlation is due to a third unobserved variable. To complicate matters, these causal models are not mutually exclusive. It is possible that all three causal models contribute to the observed correlations between self-esteem and depression.

The authors hope to test causal models by means of longitudinal studies, and their empirical data are better than data from many previous studies to examine this question. However, the statistical analyses have some shortcomings that may lead to false inferences about causality.

The first important question is the definition of depression and self-esteem. Depression and self-esteem can be measured in different ways. Self-esteem measures can measure state self-esteem or self-esteem in general. Similarly, depression measures can ask about depressive symptoms over a short time interval (a few weeks) or dispositional depression. The nature of the measure will have a strong impact on the observed retest correlations, even after taking random measurement error into account.

In the empirical studies, self-esteem was measured with a questionnaire that asks about general tendencies (Rosenberg’s self-esteem scale). In contrast, depression was assessed by asking about symptoms within the preceding seven days (CES-D).  Surprisingly, Study 1 shows no differences in the retest correlations of depression and self-esteem. Less surprising is the fact that in the absence of different stabilities, cross-lagged effects are small and hardly different from each other, whereas Study 2 shows differences in stability and asymmetrical patterns of cross-lagged coefficients. This pattern of results suggests that the cross-lagged coefficients are driven by the stability of the measures (see Rogosa, 1980, for an excellent discussion of cross-lagged panel studies).

The good news is that the authors’ data are suitable to test alternative models. One important alternative model would be a model that postulates two latent dispositions for depression and self-esteem (not a single common factor). The latent disposition would produce stability in depression and self-esteem over time. The lower retest correlations of depression would be due to more situational fluctuation of depressive symptoms. The model could allow for a correlation between the latent trait factors of depression and self-esteem. Based on Watson’s model, one would predict a very strong negative correlation between the two trait factors (but less than -1), while situational fluctuation of depression could be relatively weakly related to fluctuation in self-esteem.

The main difference between the cross-lagged model and the trait model concerns the pattern of correlations across different retest intervals. The cross-lagged model predicts a simplex structure (i.e., the magnitude of correlations decreases with increasing retest intervals). In contrast, the trait model predicts that retest correlations are unrelated to the length of the retest interval. With adequate statistical power, it is therefore possible to test these models against each other. With more complex statistical methods it is even possible to test a hybrid model that allows for all three causal effects (Kenny & Zautra, 1995).

The present manuscript simply presents one model with adequate model fit. However, model fit is largely influenced by the measurement model. The measurement model fits the data well because it is based on parcels (i.e., parcels are made to be parallel indicators of a construct and are bound to fit well). Therefore, the fit indices are insensitive to the specification of the longitudinal pattern of correlations. To illustrate, global fit is based on the fit to a correlation matrix with 276 parameters (3 indicators * 2 constructs * 4 waves = 24 indicators , 24 * 23 / 2 = 276 correlations). At the latent level, there are only 28 parameters (2 constructs * 4 waves = 8 latent factors, 8 * 7 / 2 = 28 parameters). The cross-lagged model constrains only 12 of these parameters (12 / 276 < 5%). Thus, the fit of the causal model should be evaluated in terms of the relative fit of the measurement model to the structural model. Table 2 shows the relevant information. Surprisingly, it shows only a difference of 6 degrees of freedom between Model 2 and 3, where I would have expected 12 degrees of freedom difference (?). More important, with six degrees of freedom, the chi-square difference is quite large 59. Although the qui-square test may be overly sensitive, it would be important to know why the model fit is not better. My guess is that the model underestimates long-term stability due to the failure to include a trait component. The same test for Study 2 suggests a better fit of the cross-lagged model in Study 2. However, even a good fit does not indicate that the model is correct. A trait model may fit the data as well or even better.

Regarding Study 1, the authors commit the common fallacy to interpret null-effects as evidence for the lack of a significant effect. Even if in Study 1, self-esteem was a significant (p < .05) lagged predictor of depression, and depression was not a significant (p > .05) lagged predictor of self-esteem, it is incorrect to conclude that self-esteem has an effect, but depression does not have an effect. Indeed, given the small magnitude of the two effects (-.04 vs -.10 in Figure 1) it is likely that these effects are not significantly different from each other (it is good practice in SEM studies to report confidence intervals, which would make it easier to interpret the results).

The limitation section does acknowledge that “the study designs do not allow for strong conclusions regarding the causal influence of self-esteem on depression” However, without more detail and explicit discussion of alternative models, the importance of this disclaimer in the fine print is lost to most readers unfamiliar with structural equation modeling, and the statement seems to contradict the conclusions drawn in the abstract and causal interpretations of the results in the discussion (e.g., Future research should seek to identify the mediating processes of the effect of self esteem on depression).

I have no theoretical reasons to favor any causal model. I am simply trying to point out that alternative models are plausible and likely to fit the data as well as those presented in the manuscript. At a minimum a revision should acknowledge this, and present the actual empirical data (correlation tables) to allow other researchers to test alternative models.

Why do men report higher levels of self-esteem than woman?

Self-esteem is one of the most popular constructs in personality/social psychology. The common approach to study self-esteem is to give participants a questionnaire with one or more questions (items). To study gender differences, the scores of multiple items are added up or averaged separately for men and women, and then subtracted from each other. If this difference score is not zero, the data show a gender difference. Of course, the difference will never be exactly zero. So, it is impossible to confirm the nil-hypothesis that men and women are exactly the same. A more interesting question is whether gender differences in self-esteem are fairly consistent across different samples and how large gender differences, on average, are. To answer this question, psychologists conduct meta-analyses. A meta-analysis combines findings from small samples into one large sample.

The first comprehensive meta-analysis of self-esteem reported a small difference between men and women, with men reporting slightly higher levels of self-esteem than women (Kling et al., 1999). What does a small difference look like. First, imagine that you have to predict whether 50 men and 50 women are above or below the average (median in self-esteem, but the only information that you have is their gender. If there was no difference between men and women, you have no reliable information about gender and you might just flip a coin and have a 50% chance of guessing correctly. However, given the information that men are slightly more likely to be above average in self-esteem, you guess above-average for men and below average for women. This blatant stereotype helps you to be correct 54% of the time, but you are still incorrect in your guesses 46% of the time.

Another way to get a sense of the magnitude of the effect size is to compare it to well-known, large gender differences. One of the largest gender differences that is also easy to measure is height. Men are 1.33 standard deviations taller than women, while the difference in self-esteem ratings is only 0.21 standard deviations. This means the difference in self-esteem is only 15% of the difference in height.

A more recent meta-analysis found an even smaller difference of d = .11 (Zuckerman & Hall, 2016). A small difference increases the probability that gender differences in self-esteem ratings may be even smaller or even in the opposite direction in some populations. That is, while the difference in height is so large that it can be observed in all human populations, the difference in self-esteem is so small that it may not be universally observed.

Another problem with small effects is that they are more susceptible to the influence of systematic measurement error. Unfortunately, psychologists rarely examine the influence of measurement error on their measures. Thus, this possibility has not been explored.

Another problem is that psychologists tend to rely on convenience samples, which makes it difficult to generalize findings to the general population. For example, psychology undergraduate samples select for specific personality traits that may make male or female psychology students less representative of their respective gender.

It is therefore problematic to draw premature conclusions about gender differences in self-esteem on the basis of meta-analyses of self-esteem ratings in convenience samples.

What Explains Gender Differences in Self-Esteem Ratings?

The most common explanations for gender differences in self-esteem are gender roles (Zuckerman & Hall, 2016) or biological differences (Schmitt et al, 2016). However, there are few direct empirical tests of these hypotheses. Even biologically oriented researchers recognize that self-esteem is influenced by many different factors, including environmental ones. It is therefore unlikely that biological sex differences have a direct influence on self-esteem. A more plausible model would assume that gender differences in self-esteem are mediated by a trait that shows stronger gender differences and that predicts self-esteem. The same holds for social theories. It seems unlikely that women rely on gender stereotypes to evaluate their self-esteem. It is more plausible that they rely on attributes that show gender differences. For example, Western societies have different beauty standards for men and women and women tend to have lower self-esteem in ratings of their attractiveness (Gentile et al., 2009). Thus, a logical next step is to test mediation models. Surprisingly, few studies have explored well-known predictors of self-esteem as potential mediators of gender differences in self-esteem.

Personality Traits and Self-Esteem

Since the 1980s, thousands of studies have measured personality from the perspective of the Five Factor Model. The Big Five capture variation in negative emotionality (Neuroticism), positive energy (Extraversion), curiosity and creativity (Openness), cooperation and empathy (Agreeableness), and goal-striving and impulse-control (Conscientiousness). Given the popularity of self-esteem and the Big Five in personality research, many studies have examined the relationship between the Big Five and self-esteem, while other studies have examined gender differences in the Big Five traits.

Studies of gender differences show the biggest and most consistent differences for neuroticism and agreeableness. Women tend to score higher on both dimensions than men. The results for the Big Five and self-esteem are more complicated. Simple correlations show that higher self-esteem is associated with lower Neuroticism and higher Extraversion, Openness, Agreeableness, and Conscientiousness (Robins et al., 2001). The problem is that Big Five measures have a denotative and an evaluative component. Being neurotic does not only mean to respond more strongly with negative emotions; it also is undesirable. Using structural equation model, Anusic et al. (2009) separated the denotative and evaluative component and found that self-esteem was strongly related to the evaluative component of personality ratings. This evaluative factor in personality ratings was first discovered by Thorndike (1920) one-hundred years ago. The finding that self-esteem is related to overly positive self-ratings of personality is also consistent with a large literature on self-enhancement. Individuals with high self-esteem tend to exaggerate their positive qualities ().

Interestingly, there are very few published studies of gender differences in self-enhancement. One possible explanation for this is that there is only a weak relationship between gender and self-enhancement. The rational is that gender is routinely measured and that many studies of self-enhancement could have examined gender differences. It is also well known that psychologists are biased against null-findings. Thus, ample data without publications suggest that there is no strong relationship. However, a few studies have found stronger self-enhancement for men than for women. For example, one study showed that men overestimate their intelligence more than women (von Stumm et al., 2011). There is also evidence that self-enhancement and halo predict biases in intelligence ratings (Anusic, et al., 2009). However, it is not clear whether gender differences are related to halo or are specific to ratings of intelligence.

In short, a review of the literature on gender and personality and personality and self-esteem suggests three potential mediators of the gender differences in self-esteem. Men may report higher levels of self-esteem because they are lower in neuroticism, lower in agreeableness, or higher in self-enhancement.

Empirical Test of the Mediation Model

I used data from the Gosling–Potter Internet Personality Project (Gosling, Vazire, Srivastava,
& John, 2004
). Participants were visitors of a website who were interested in taking a personality test and receiving feedback about their personality. The advantage of this sampling approach is that it creates a very large dataset with millions of participants. The disadvantage is that men and women who visited this sight might differ in personality traits or self-esteem. The questionnaire included a single-item measure of self-esteem. This item shows the typical gender difference in self-esteem (Bleidorn et al., 2016).

To separate descriptive factors of the Big Five from evaluative bias and acquiescence bias, I fitted a measurement model to the 44-item Big Five Inventory. I demonstrated earlier that this measurement model has good fit for Canadian participants (Schimmack, 2019). To test the mediation model, I added gender and self-esteem to the model. In this study, gender was measured with a simple dichotomous male vs. female question.

Gender was a predictor of all 7 factors (Big Five + Halo + Acquiescence). Exploratory analysis examined whether gender had unique relationships with specific BFI items. These relationships could be due to unique relationships of gender with specific personality traits called facets. However, few notable relationships were observed. Self-esteem was predicted by all seven personality traits and gender. However, openness to experience showed weak relationships with self-esteem. To stabilize the model, this path was fixed to zero.

I fitted the model to data from several nations. I selected nations with (a) a large number of complete data (N = 10,000), familiarity with English as a first or common second language (e.g., India = yes, Japan = no), while trying to sample a diverse range of cultures because gender differences in self-esteem tend to vary across cultures (Bleidorn et al., 2016; Zuckerman & Hall, 2016). I fitted the model to samples from four nations: US, Netherlands, India, and Philippines with N = 10,000 for each nation. Table 1 shows the results.

The first two rows show the fit of the Canadian model to the other four nations. Fit is a bit lower for Asian samples, but still acceptable.

The results for sex differences in the Big Five are a bit surprising. Although all four samples show the typical gender difference in neuroticism, the effect sizes are relatively small. For agreeableness, the gender differences in the two Asian samples are negligible. This raises some concerns about the conclusion that gender differences in personality traits are universal and possibly due to evolved genetic differences (Schmitt et al, 2016). The most interesting novel finding is that there are no notable gender differences in self-enhancement. This also implies that self-enhancement cannot mediate gender differences in self-esteem.

The strongest predictor of self-esteem is self-enhancement. Effect sizes range from d = .27 in the Netherlands to d = .45 in the Philippines. The second strongest predictor is neuroticism. As neuroticism also shows consistent gender differences, neuroticism partially mediates the effect of gender on self-esteem. Although weak, agreeableness is a consistent negative predictor of self-esteem. This replicates Anusic et al.’s (2009) finding that the sign of the relationship reverses when halo bias in agreeableness ratings is removed from measures of agreeableness.

The total effects show the gender differences in the four samples. Consistent with meta-analysis the gender differences in self-esteem are weak with effect sizes ranging from d = .05 to d = .15. Personality explains some of this relationship. The unexplained direct effect of gender is very small.


A large literature and several meta-analysis have documented small, but consistent gender differences in self-ratings of self-esteem. Few studies have examined whether these differences are mere rating biases or tested causal models of these gender differences. This article addressed these questions by examining seven potential mediators; the Big Five traits as well as halo bias and acquiescence bias.

The results replicated previous findings that gender differences in self-esteem are small, d < .2. They also showed that neuroticism is a partial mediator of gender differences in self-esteem. Women tend to be more sensitive to negative information and this disposition predicts lower self-esteem. It makes sense that a general tendency to focus on negative information also extends to evaluations of the self. Women appear to be more self-critical than men. A second mediator was agreeableness. Women tend to be more agreeable and agreeable people tend to have lower self-esteem. However, this relationship was only observed in Western nations and not in Asian nations. This cultural difference explains why gender differences in self-esteem tend to be stronger in Western than in Asian cultures. Finally, a general evaluative bias in self-ratings of personality was the strongest predictor of self-esteem, but showed no notable gender differences. Gender also still had a very small relationship with self-esteem after accounting for personalty mediators.

Overall, these results are more consistent with models that emphasize similarities between men and women (Men and Women are from Earth) than models that emphasize gender differences (Women are from Venus and Men are from Mars). Even if evolutionary theories of gender differences are valid, they explain only a small amount of the variance in personality traits and self-esteem. As one evolutionary psychologists put it “it is undeniably true that men and women are more similar than different genetically, physically and psychologically” (p. 52). The results also undermine claims that women internalize negative stereotypes about them and have notably lower self-esteem as a result. Given the small effect sizes, it is surprising how much empirical and theoretical attention gender differences in self-esteem have received. One reason is that psychologists often ignore effect sizes and only care about the direction of an effect. Given the small effect size of gender on self-esteem, it seems more fruitful to examine factors that produce variation in self-esteem for men and women.

Lies, Damn Lies, and Experiments on Attitude Ratings

Ten years ago, social psychology had a life-time opportunity to realize that most of their research is bullshit. Their esteemed colleague Daryl Bem published a hoax article about extrasensory perception in their esteemed Journal of Personality and Social Psychology. The editors felt compelled to write a soul searching editorial about research practices in their field that could produce such nonsense results. However, 10 years later social psychologists continue to use the same questionable practices to publish bullshit results in JPSP. Moreover, they are willfully ignorant of any criticism of their field that is producing mostly pseudo-scientific garbage. Just now, Wegener and Petty, two social psychologists at Ohio State University wrote an article that downplays the importance of replication failures in social psychology. At the same time, they publish a JPSP article that shows they haven’t learned anything from 10 years of discussion about research practices in psychology. I treat the first author as an innocent victim who is being trained in the dark art of research practices that have given us social priming, ego-depletion, and time-reversed sexual arousal.

The authors report seven studies. We don’t know how many other studies were run. The seven studies are standard experiments with one or two (2 x 2) experimental manipulations between subjects. The studies are quick online studies with Mturk samples. The main goal was to show that some experimental manipulations influence some ratings that are supposed to measure attitudes. Any causal effect on these measures is interpreted as a change in attitudes.

The problem for the author is that their experimental manipulations have small effects on the attitude measures. So, individually studies 1-6 would not show any effects. At no point did they consider this a problem and increase sample sizes. However, they were able to fix the problem by combining studies that were similar enough into one dataset. his was also done by Bem to produce significant results for time-reversed causality. It is not a good practice, but that doesn’t bother editors and reviewers at the top journal of social psychology. After all, they all do not know how to do science.

So, let’s forget about the questionable studies 1-6 and focus on the preregistered replication study with 555 Mturk workers (Study 7). The authors analyze their data with a mediation model and find statistically significant indirect effects. The problem with this approach is that mediation no longer has the internal validity of an experiment. Spurious relationships between mediators and the DV can inflate these indirect effects. So, it is also important to demonstrate that there is an effect by showing that the manipulation changed the DV (Baron & Kenny, 1986). The authors do not report this analysis. The authors also do not provide information about standardized effect sizes to evaluate the practical significance of their manipulation. However, the authors did provide covariance matrices in a supplement and I was able to run the analyses to get this information.

Here are the results.

The main effect for the bias manipulation is d = -.04, p = .38, 95%CI = -.12, .05

The main effect for the untrustworthiness manipulation is d = .01, p = .75, 95%CI = -.07, .10.

Both effects are not significant. Moreover, the effect size is so small and thanks to the large sample size the confidence intervals are so narrow that we can reject the hypothesis that the manipulations have at least a small effect, d = .2.

So, here we see the total failure of social psychology to understand what they are doing and their inability to make a real contribution to the understanding of attitudes and attitude change. This didn’t stop Rich Petty from co-authoring an article about psychology’s contribution to addressing the Covid-19 pandemic. Now, it would be unfair to blame 150,000 deaths on social psychology, but it is a fact that 40 years of trivial experiments have done little to help us change attitudes like attitudes towards wearing masks in the real world.

I can only warn young, idealistic students to consider social psychology as a career path. I speak form experience. I was a young idealistic student eager to learn about social psychology in the 1990s. If I could go back in time, I would have done something else with my life. In 2010, I thought social psychology might actually change for the better, but in 2020 it is clear that most psychologists want to continue with their trivial experiments that tell us nothing about social behaviour. If you just can’t help it and want to study social phenomena I recommend personality psychology or other social sciences.

A Hierarchical Factor Analysis of Openness to Experience

In this blog post I report the results of a hierarchical factor analysis of 16 primary openness to experience factors. The data were obtained and made public by Christensen, Cotter, and Silvia (2019). The dataset contains correlations for 138 openness items taken from four different Big Five measures (NEO-PI3; HEXACO, BFAS, & Woo). The sample size was N = 802.

The authors used network analysis to examine the relationship among the items. In the network graph, the authors identified 10 clusters (communities) of items. Some of these clusters combine overlapping constructs in different questionnaires. For example, aesthetic appreciation is represented in all four questionnaires.

This is a good first step, but Figure 1 leaves many questions unanswered. Mainly, it does not provide quantitative information about the relationship of the clusters to each other. The main reason is that network analysis does not have a representation of the forces that bind items within a cluster together. This information was presented in a traditional correlation table based on sum scores of items. The problem with sum scores is that correlations between sum scores can be distorted by secondary loadings. Moreover, there is no formal test that 10 clusters provide an accurate representation of item-relationships. As a result, there is no test of this model against other plausible models. The advantage of structural equation modeling with latent variables is that it is possible to represent unobserved constructs like Openness and to test the fit of a model to the data.

Despite the advantages of structural equation modeling (SEM), many researchers are reluctant to use structural equation modeling for a number of unfortunate reasons. First, structural equation modeling has been called Confirmatory Factor Analysis (CFA). This has led to the misperception that SEM can only be used to test theoretical models. However, it is not clear how one would derive a theoretical that perfectly fits data without exploration. I use SEM to explore the structure of openness without an a priori theoretical model. This is no more exploratory than visual inspection of a network representation of a correlation matrix. There is no good term for this use of SEM because the term exploratory factor analysis is used for a different mathematical model. So, I simply call it SEM.

Another reason why SEM may not be used is that model fit can show that a specified model does not fit the data. It can be time consuming and require thought to create a model that actually fits the data. In contrast, EFA and network models always provide a solution even if the solution is suboptimal. This makes SEM harder to use than other exploratory methods. However, with some openness to new ideas and persistence, it is also always possible to find a fitting model with SEM. This does not mean it is the correct model, but it is also possible to compare models to each other with fit indices.

SEM is a very flexible tool and its capabilities have often not been fully recognized. While higher-order or two-level models are fairly common, models with more than two levels are rare, but can be easily fit to data that have a hierarchical structure. This is a useful feature of SEM because theoretical models have postulated that personality is hierarchically structured with several levels: The global level, aspects, facets, and even more specific traits called nuances below facets. However, nobody has attempted to fit a hierarchical model to see whether Openness has an aspect, a facet, and a nuance level. Christensen et al.’s data seemed ideally suited to examine this question.

One limitation of SEM is that modeling becomes increasingly more difficult as the number of items increases. On the other hand, three items per construct are sufficient to create a measurement model at the lowest level in the hierarchy. I therefore first conducted simple CFA analysis of items belong to the same scale and retained items with high loadings on the primary factor and no notable residual correlations with other items. I did not use the 20 aspect items because they were not designed to measure clean facets of Openness. This way, I only need to fit a total of 48 items for the 16 primary scales of Openness in the three questionnaires:

NEO: Artistic, Ideas, Fantasy, Feeling, Active, Values
HEXACO: Artistic, Inquisitive, Creative, Unconventional
Woo: Artistic, Culture, Tolerance, Creative, Depth, Intellect

Exploratory analysis showed that the creative scales in the HEXACO and Woo measures did not have unique variance and could be represented by a single primary factor. This was also the case for the artistic construct in the HEXACO and Woo measures. However, the NEO artistic items showed some unique variance and were modeled as a distinct construct, although this could just be some systematic method variance in the NEO items.

The final model (MPLUS syntax) had reasonably good fit to the data, RMSEA = .042, CFI = .903. This fit was obtained after exploratory analyses of the data and simply shows that it was possible to find a model that fits the data. A truly confirmatory test would require new data and fit is expected to decrease because the model may have overfitted the data. To obtain good model fit it was necessary to include secondary loadings of items. Cross-validation can be used to confirm that these secondary loadings are robust. All of this is not particularly important because the model is exploratory and provides a first attempt at fitting a hierarchical factor model to the Openness domain.

In Figure 2, the boxes represent primary factors that represent the shared variance among three items. The first noteworthy different to the network model is that there are 14 primary constructs compared to 10 clusters in the network model. However, Neo-Artistic (N-Artistic) is strongly related to the W/H-Artistic factor and could be combined while allowing some systematic measurement error in the NEO-items. So, conceptually, there are only 13 distinct constructs. This still leaves three more constructs than the network analysis identified. The reason for this discrepancy is that there is no strict criterion at which point a cluster may reflect to related sub-clusters.

Figure 2 shows a hierarchy with four levels. For example, creativity (W/H-Creative) is linked to Openness through an unmeasured facet (Facet-2) and artistic (W/H-Artistic). This also means that creative is only weakly linked to Openness as the indirect path is the product of the three links, .9 * .7 * .5 = .3. This means that Openness explains only 9% of the variance in the creativity factor.

In factor analysis it is common to use loadings greater than .6 for markers that can be used to measure a construct and to interpret its meaning. I highlighted constructs that are related .6 or higher with the Openness factor. The most notable marker is the NEO-Ideas factor with a direct loading of .9. This suggests that the core feature of Openness is to be open to new ideas. Another marker is Woo’s curiosity factor and mediated by the facet-2 factor, the HEXACO inquisitive factor. So, core features of Openness are being open to new ideas, being curious, and inquisitive. Although these labels sound very similar, the actual constructs are not redundant. The other indicators that meet the .6 threshold are artistic and unconventional.

Other primary factors differ greatly in their relatedness to the Openness factor. Openness to Feeling’s relationship is particularly weak, .4 * .4 = .16, and suggests that openness to feelings is not a feature of Openness or that the NEO-Feelings items are poor measures of this construct.

Finally, it is noteworthy that the model provides no support for the Big Five Aspects Model that postulates a level with two factors between Openness and Openness Factors. It is particularly troubling that the intellect aspect is most strongly related to Woo’s intellectual efficiency factor (W-Intellect, effect size r = .6), and only weakly related to the ideas factor (N-Ideas, r = .2), and the curiosity factor (W-Curious, r = .2). As Figure 2 shows, (self-rated) intellectual abilities are a distinct facet and not a broader aspect with several subordinate facets. The Openness facet is most strongly related to artistic (W/H artistic, r = .4), with weaker relationships to feelings, fantasy, and ideas (all r = .2). The problem with the development of the Big Five Aspects Model was that it relied on Exploratory Factor Analysis that is unable to test hierarchical structures in data. Future research on hierarchical structures of personality should use Hierarchical Factor Analysis.

In conclusion, SEM is capable of fitting hierarchical models to data. It is therefore ideally suited to test hierarchical models of personality. Why is nobody doing this. Orthodoxy has delegated SEM to confirmatory analysis of models that never fit the data because we need to explore before we can build theories. It requires high openness to new ideas, being unconventional, and curiosity, and inquisitiveness to break with conventions and to use SEM as a flexible and powerful statistical tool for data exploration.

Open SOEP: Spousal Similarity in Personality

Abstract: I examined spousal similarity in personality using 4-waves of data over a 12-year period in the German Socio-Economic Panel. There is very little spousal similarity in actual personality traits like the Big Five. However, there is a high similarity in the halo rating bias between spouses.

Spousal similarity in personality is an interesting topic for several reasons. First, there are conflicting folk ideas about spousal similarity. One saying assumes that “birds of the same feather flock together;” another says that “opposites attract.” Second, there is large interest in the characteristics people find attractive in a mate. Do extraverts find other extraverts more attractive? Would assertive (low agreeableness) individuals prefer a mate who is as assertive as they are or rather somebody who is submissive (high agreeableness)? Third, we might wonder whether spouses become more similar to each other over time. Finally, twin studies of heritability make the assumption that mating is random; an assumption that can be questionable.

Given so many reasons to study spousal similarity in personality, it is surprising how little attention this topic has received. A literature search retrieved only a few articles with few citations: Watson, Beer, McDade-Montez (2014) [20 citations], Humbad, Donnellan, Iacono McGue, & Burt (2010) [30 citations], Rammstedt & Schupp (2008) [25 citations]. One possible explanation for this lack of interest could be that spouses are not similar in personality traits. It is well-known that psychology has a bias against null-results; that is, the lack of statistical relationships. Another possibility is that spousal similarity is small and difficult to detect in small convenience samples that are typical in psychology. In support of the latter explanation, two of the three studies had large samples and did report spousal similarity in personality.

Humbad et al. (2010) found rather small correlations between husbands’ and wives’ personality scores in a sample of 1,296 married couples. With the exception of traditionalism, r = .49, all correlations were below r = .2, and the median correlation was r = .11. They also found that spousal similarity did not change over time, suggesting that the little similarity there is can be attributed to assortative mating (marrying somebody with similar traits).

Rammstedt and Schupp (2008) used data from the German Socio-Economic Panel (SOEP), an annual survey of representative household samples. In 2005, the SOEP included for the first time a short 15-item measure of the Big Five personality traits. The sample included 6,909 couples. This study produced several correlations greater than r = .2, for agreeableness, r = .25, conscientiousness, r = .31, and openness, r = .33. The lowest correlation was obtained for extraversion, r = .10. A cross-sectional analysis with length of marriage showed that spousal similarity was higher for couples who were married longer. For example, spousal similarity for openness increased from r = .26 for newlyweds (less than 5 years of marriage) to r = .47 for couples married more than 40 years.

A decade later it is possible to build on Rammstedt and Schupp’s results because the SOEP has collected three more waves with personality assessments in 2009, 2013, and 2017. This makes it possible to examine spousal similarity over time and to separate spousal similarity in stable dispositions (traits) and in deviations from the typical level (states).

I start with simple correlations, separately for each of the four waves using all couples that were available at a specific wave. The most notable observation is that the correlations do not increase over time. In fact, they even show a slight trend to decrease. This provides strong evidence that spouses are not becoming more similar to each other over time. An introvert who marries an extravert does not become more extraverted as a result or vice versa.

TraitW1 (N = 6263)W2 (N = 5905)W3 (N = 5404)W4 (N = 7805)

I repeated the analysis using only couples who stayed together and participated in all four waves. The sample size for this analysis was N = 1,860.


The correlations were not stronger and did not increase over time.

The next analysis examined correlations over time. If spousal similarity is driven by assortment on some stable trait, husbands’ scores in 2005 should still be correlated with wives’ scores in 2017 and vice versa. To ensure comparability for different time lags, I only used couples who stayed in the survey for all four waves (N = 1,860).

Trait2005 Trait2009 Trait2013 Trait2017 Trait
2005 Neuroticism.
2005 Extraversion.040-.02-.02
2005 Openness.
2005 Agreeableness.
2005 Conscientiousness.

The results show more similarity on the same occasion (2005/2005) than across time. Across-time correlations are all below .2 and are decreasing. However, there are some small correlations of r = .1 for Openness, Agreeableness, and Conscientiousness, suggesting some spousal similarity in the stable trait variance. Another question is why spouses show similarity in the changing state variance.

There are two possible explanations for spousal similarity in personality state variance. One explanation is that spouses’ personality really changes in sync, just like their well-being changes in the same direction over time (Schimmack & Lucas, 2010). Another explanation is that spouses’ self-ratings are influenced by rating biases and that these rating biases are correlated (Anusic et al., 2009). To test these alternative hypotheses, I fitted a measurement model to the Big Five scales that distinguishes halo bias in personality ratings from actual variance in personality. I did this for the first and the last wave (2005, 2017) to separate similarity in the stable trait variance from similarity in state variance.

The key finding is that there is high spousal similarity in halo bias. Some couples are more likely to exaggerate their positive qualities than others. After removing this bias, there is relatively little spousal similarity for the actual trait variance.

FactorTraitState 2005State 2017

In conclusion, spouses are not very similar in their personality traits. This may explain why this topic has received so little attention in the scientific literature. Null-results are often considered uninteresting. However, these findings do raise some questions. Why don’t extraverts marry extraverts or why don’t conscientious people not marry conscientious people. Wouldn’t they be happier with somebody who is similar in their personality? Research with the SOEP data suggests that that is also not the case. Maybe the Big Five traits are not as important for marital satisfaction as we think. Maybe other traits are more important. Clearly, human mating is not random, but it is also not based on matching personality traits.

We don’t forget and until Bargh apologizes we will not forgive

John Bargh is a controversial social scientists with a knack of getting significant results when others cannot (Bargh in Bartlett, 2012). When somebody failed to replicate his most famous elderly-priming results (he published two exact replication studies, 2a and 2b, that were both successful, p < .05), he wrote a blog post. The blog-post blew up in his face and he removed it. For a while, it looked as if this historic document was lost, but it has been shared online. Here is another link to it : Nothing in their heads

Personality x Situation Interactions: A Cautionary Note

Abstract: No robust and reliable interaction effects of the Big Five personality traits and unemployment on life-satisfaction in the German Socio-Economic Panel.

With the exception of late Walter Mischel, Lee Ross, and Dick Nisbett, we are all interactionists (ok, maybe Costa & Mcrae are guilty of dispositionism). As Lewin told every body in 1934, behaviour is a function of the person and the situation, and the a priori probability that the interaction effect between the two is zero (the nil-hypothesis is false) is pretty much zero. So, our journals should be filled with examples of personality x situation interactions. Right? But they are not. Every once in a while when I try to update my lecture notes and look for good examples of a personality x situation interaction I can’t find good examples. One reason is of course the long history of studying situations and traits separately. However, experience sampling studies emerged in the 1980s and the data are ideally suited to look for interaction effects. Another problem is that interaction effects can be difficult to demonstrate because you need large samples to get significant results.

This time I had a solution to my problems. I have access to the German Socio-Economic Panel (SOEP) data. The SOEP has a large sample (N > 10,000), measured the Big Five four times over a 12-year period and many measures of situations like marriage, child birth, or unemployment. So, I could just run an analysis and find a personality x situation interaction. After all, in large samples, you always get p < .05. Right? If you think so, you might be interested to read on and find out what happened.

The Big Five were measure the first time in 2005 (wave v). I picked unemployment and neuroticism as predictors because it is well-known that neuroticism is a personality predictor of life-satisfaction and unemployment is a situational predictor of life-satisfaction. It also made sense that neurotic people might respond more strongly to a negative life-event. However, contrary to these expectations, the interaction was far from significant (p = .5), while the main effects of unemployment (-1.5) and neuroticism (-.5) were highly significant. The effect of unemployment is equivalent to a change by three standard deviations in neuroticism.

Undeterred, I looked for interactions with the other Big Five dimensions. Surely, I would find an explanation for the interaction when I found one. To make things simple, I added all five interactions to the model and, hooray, a significant interaction with conscientiousness popped up, p = .02.

Was I the first to discover this? I quickly checked for articles and of course somebody else had beat me to the punch. There it was. In 2010, Boyce, Wood, and Brown had used the SOEP data to show that conscientious people respond more strongly to the loss of a job.

Five years later, a follow-up article came to the same conclusion.

A bit skeptical of p-values that are just below .02, I examined whether the interaction effect can be replicated. I ran the same analysis as I did with the 2005 data in 2009.

The effect size was cut in half and the p-value was no longer significant, p = .25. However, the results did replicate that none of the other four Big Five dimensions moderated the effect of unemployment.

So, what about the 2013 wave? Again not significant, although the effect size is again negative.

And what happened in 2017? A significant effect, hooray again, but this time the effect is positive.

Maybe the analyses are just not powerful enough. To increase power, we can include prior life-satisfaction as a predictor variable to control for some of the stable trait variance in life-satisfaction judgments. We are now only trying to predict changes in life-satisfaction in response to unemployment. In addition, we can include prior unemployment to make sure that the effect of unemployment is not due to some stable third variable.

We see that it is current unemployment that has a negative effect on life-satisfaction. Prior unemployment actually has a positive effect, suggesting some adaptation to long-term unemployment. Most important, the interaction between conscientiousness and current unemployment is not significant, p = .68.

The interaction was also non-significant in 2013, p = .69.

And there was no significant interaction in 2017, p = .38.

I am sure that I am not the first to look at this, especially given two published articles that reported a significant interaction. However, I suspect that nobody thought about sharing these results because the norm in psychology is still to report significant results. However, the key finding here appears to be that the Big Five traits do not systematically interact with a situation in explaining an important outcome.

So, I am still looking for a good demonstration of a personality x situation interaction that I can use for my lecture in the fall. Meanwhile, I know better than to use the published studies as an example.

Open Letter about Systemic Racism to the Editor of SPPS

Dear Margo Monteith,

it is very disappointing that you are not willing to retract an openly racist article that was published in your journal Social Psychological and Personality Science (SPPS) when Simine Varzire was editor of the journal and Lee Jussim was the action editor of the article in question (Cesario, Johnson, & Terrill, 2019). I have repeatedly pleaded with you to retract the article that draws conclusions on the basis of false assumptions. I am even more stunned by your decision because you rejected my commentary on this racist article with the justification that a better criticism was submitted. This criticism was just published (Ross et al., 2020). It makes the same observation that I made in my critique; that is, the conclusion that there is no racial bias in policing and the use of force rests entirely on an invalid assumption. The original authors simply assume that police officers only encounter violent criminals or that they only encounter violent criminals when they use deadly force.

Maybe you are not watching the news, but the Black Lives Matter movement started because police often use deadly force against non-violent African Americans. In some cases, this is even documented on video. Please watch the murder of Tamir Rice, George Floyd, Philando Castile, and Eric Garner and then tell their families and friends that police only kills violent criminals. That is what SPPS is telling everybody with the mantel of scientific truth, but is a blatantly false claim based on racists assumptions. So, why are you not retracting this offensive article?

Philando Castile:

Tamir Rice:

Eric Garner:

George Floyd:

So, why are you not retracting an article that makes an obviously false and offensive assumption? Do you think that a retraction would look badly on the reputation of your journal? In that case, you are mistaken. Research shows that journals that retract articles with false conclusions have higher impact factors and are more prestigious than journals that try to maintain a flawless image by avoiding retractions of bad science (Nature). So, your actions are not only offensive, but also hurt the reputation of SPPS and ultimately our science.

Your justification for not retracting the article is unconvincing.

Just how to analyze data such as this is debated, mostly in criminology journals. (One can wonder what psychology was present in Cesario et al.’s study that led to publication in SPPS, but that’s another matter.) Cesario et al. made the important point that benchmarking with population data is problematic. Their methodology was imperfect. Ross et al. made important improvements. If one is interested in this question of police bias with benchmarking, the papers bring successive advances. ”

Your response implies that you did not fully understand Ross et al.’s criticism of the offensive article. The whole approach of “benchmarking” is flawed. So, publishing an article that introduces a flawed statistical approach from criminology to psychology is dangerous. What if we would start using this approach to study other disparities? Ross et al. show that this would be extremely harmful to psychological science. It is important to retract an article that introduces this flawed statistical approach to psychologists. As an editor it is your responsibility to ensure that this does not happen.

It is particular shocking and beyond comprehension that you resist retraction at the very same time many universities and academics are keenly aware of the systemic racism in academia. This article about an issue that affects every African American was based on research funding to White academics, reviewed by White academics, approved by White academics, and now defended and not retracted by a White academic. How does your action promote diversity and inclusion? It is even more surprising that you seem to be blind to this systemic racism in the publication of this racist article given your research on prejudice and the funding you received to study these issues (CV). Can you at least acknowledge that it is very offensive to Black people to attribute their losses of lives entirely to violent crime?

Ulrich Schimmack