Category Archives: Personality

The Black Box of Meta-Analysis: Personality Change

Psychologists treat meta-analyses as the gold standard to answer empirical questions. The idea is that meta-analyses combine all of the relevant information into a single number that reveals the answer to an empirical question. The problem with this naive interpretation of meta-analyses is that meta-analyses cannot provide more information than the original studies contained. If original studies have major limitations, a meta-analytic integration does not make these limitations disappear. Meta-analyses can only reduce random sampling error, but they cannot fix problems of original studies. However, once a meta-analysis is published, the problems are often ignored and the preliminary conclusion is treated as an ultimate truth.

In this regard meta-analyses are like collateralized debt obligations that were popular until problems with CDOs triggered the financial crisis in 2008. A collateralized debt obligation (CDO) pools together cash flow-generating assets and repackages this asset pool into discrete tranches that can be sold to investors. The problem is when a CDO is considered to be less risky than the actual debt in the CDO actually is and investors believe they get high returns with low risks, when the actual debt is much more risky than investors believe.

In psychology, the review process and publication in a top journal give the appeal that information is trustworthy and can be cited as solid evidence. However, a closer inspection of the original studies might reveal that the results of a meta-analysis rest on shaky foundations.

Roberts et al. (2006) published a highly-cited meta-analysis in the prestigious journal Psychological Bulletin. The key finding of this meta-analysis was that personality levels change with age in longitudinal studies of personality.

The strongest change was observed for conscientiousness. According to the figure, conscientiousness doesn’t change much during adolescence, when the prefrontal cortex is still developing, but increases from d ~ .4 to d ~ .9 from age 30 to age 70 by about half a standard deviation.

Like many other consumers, I bought the main finding and used the results in my Introduction to Personality lectures without carefully checking the meta-anlysis. However, when I analyzed new data from longitudinal studies with large national representative samples, I could not find the predicted pattern (Schimmack, 2019a, 2019b, 2019c). Thus, I decided to take a closer look at the meta-analysis.

Roberts and colleagues list all the studies that were used with information about sample sizes, personality dimensions, and the ages that were studied. Thus, it is easy to find the studies that examined conscientiousness with participants who were 30 years or older at the start of the study.

Study NWeightStart1Max.IntervalES
Costa et al. (2000)22740.4441990.00
Costa et al. (1980)4330.08366440.00
Costa & McCrae (1988)3980.0835646NA
Labouvie-Vief & Jain (2002)3000.0639639NA
Branje et al. (2004)2850.064224NA
Small et al. (2003)2230.046866NA
P. Martin (2002)1790.03655460.10
Costa & McCrae (1992)1750.0353770.06
Cramer (2003)1550.03331414NA
Haan, Millsap, & Hartka (1986)1180.02331010NA
Helson & Kwan (2000)1060.02334247NA
Helson & Wink (1992)1010.0243990.20
Grigoriadis & Fekken (1992)890.023033
Roberts et al. (2002)780.024399
Dudek & Hall (1991)700.01492525
Mclamed et al. (1974)620.013633
Cartwright & Wink (1994)400.01311515
Weinryb et al. (1992)370.013922
Wink & Helson (1993)210.00312525
Total N / Average51441.00411119

There are 19 studies with a total sample size of N = 5,144 participants. However, sample sizes vary dramatically across studies from a low of N = 21 to a high of N = 2,274. Table 1 shows the proportion of participants that would be used to weight effect sizes according to sample sizes. By far the largest study found no significant increase in conscientiousness. I tried to find information about effect sizes from the other studies, but the published articles didn’t contain means or the information was from an unpublished source. I did not bother to obtain information from samples with less than 100 participants, because they contribute only 8% to the total sample size. Even big effects would be washed out by the larger samples.

The main conclusion that can be drawn from this information is that there is no reliable information to make claims about personality change throughout adulthood. If we assume that conscientiousness changes by half a standard deviation over a 40 year period, the average effect size for a decade is d = .12. For studies with even shorter retest intervals, the predicted effect size is even weaker. It is therefore highly speculative to extrapolate from this patchwork of data and make claims about personality change during adulthood.

Fortunately, much better information is now available from longitudinal panels with over thousand participants who have been followed for 12 (SOEP) or 20 (MIDUS) years with three or four retests. Theories of personality stability and change need to be revisited in the light of this new evidence. Updating theories in the face of new data is at the basis of science. Citing an outdated meta-analysis as if it provided a timeless answer to a question is not.

How Valid are Short Big-Five Scales?

The first measures of the Big Five used a large number of items to measure personality. This made it difficult to include personality measures in studies as the assessment of personality would take up all of the survey time. Over time, shorter scales became available. One important short Big Five measure is the BFI-S (Lang et al., 2011).  This 15-item measure has been used in several national representative, longitudinal studies such as the German Socio-Economic Panel (Schimmack, 2019a). These results provide unique insights into the stability of personality (Schimmack, 2019b) and the relationship of personality with other constructs such as life-satisfaction (Schimmack, 2019c). Some of these results overturn textbook claims about personality. However, critics argue that these results cannot be trusted because the BFI-S is an invalid measure of personality.

Thus, it is is critical importance to evaluate the validity of the BFI-S. Here I use Gosling and colleagues data to examine the validity of the BFI-S. Previously, I fitted a measurement model to the full 44-item BFI (Schimmack, 2019d). It is straightforward to evaluate the validity of the BFI-S by examining the correlation of the 3-item BFI-S scale scores with the latent factors based on all 44 BFI items. For comparison purposes, I also show the correlations for the BFI scale scores. The complete results for individual items are shown in the previous blog post (Schimmack, 2019d).

The measurement model for the BFS has seven independent factors. Five factors represent the Big Five and two factors represent method factors. One factor represents acquiescence bias. The other factor represents evaluative bias that is present in all self-ratings of personality (Anusic et al., 2009). As all factors are independent, the squared coefficients can be interpreted as the amount of variance that a factor explains in a scale score.

The results show that the BFI-S scales are nearly as valid as the longer BFI scales (Table 1).

Scale#ItemsNEOACEVBACQ
N-BFI80.79-0.08-0.01-0.05-0.02-0.420.05
N-BFI-S30.77-0.13-0.050.07-0.04-0.290.07
E-BFI8-0.020.830.04-0.050.000.440.06
E-BFI-S30.050.820.000.04-0.070.320.07
O-BFI100.04-0.030.76-0.04-0.050.360.19
O-BFI-S30.090.000.66-0.04-0.100.320.25
A-BFI9-0.070.00-0.070.780.030.440.04
A-BFI-S3-0.03-0.060.000.750.000.330.09
C-BFI9-0.050.00-0.050.040.820.420.03
C-BFI-S3-0.090.00-0.020.000.750.440.06

For example, the factor-scale correlations for neuroticism, extraversion, and agreeableness are nearly identical. The biggest difference was observed for openness with a correlation of r = .76 for the BFI-scale and r = .66 for the BFI-S scale. The only other notable systematic variance in scales is the evaluative bias influence which tends to be stronger for the longer scales with the exception of conscientiousness. In the future, measurement models with an evaluative bias factor can be used to select items with low loadings on the evaluative bias factor to reduce the influence of this bias on scale scores. Given these results, one would expect that the BFI and BFI-S produce similar results. The next analyses tested this prediction.

Gender Differences

I examined gender differences three ways. First, I examined standardized mean differences at the level of latent factors in a model with scalar invariance (Schimmack, 2019d). Second, I computed standardized mean differences with the BFI scales. Finally, I computed standardized mean differences with the BFI-S scales. Table 2 shows the results. Results for the BFI and BFI-S scales are very similar. The latent mean differences show somewhat larger differences for neuroticism and agreeablness because these mean differences are not attenuated by random measurement error. The latent means also show very small gender differences for the method factors. Thus, mean differences based on scale scores are not biased by method variance.

Table 2. Standardized Mean Differences between Men and Women

NEOACEVBACQ
Factor0.640.17-0.180.310.150.090.16
BFI0.450.14-0.100.200.14
BFI-S0.480.21-0.030.180.12

Note. Positive values indicate higher means for women than for men.

In short, there is no evidence that using 3-item scales invalidates the study of gender differences.

Age Differences

I demonstrated measurement invariance for different age groups (Schimmack, 2019d). Thus, I used simple correlations to examine the relationship between age and the Big Five. I restricted the age range from 17 to 70. Analyses of the full dataset suggest that older respondents have higher levels of conscientiousness and agreeableness (Soto, John, Gosling, & Potter, 2011).

Table 3 shows the results. The BFI and the BFI-S both show the predicted positive relationship with conscientiousness and the effect size is practically identical. The effect size for the latent variable model is stronger because the relationship is not attenuated by random measurement error. Other relationships are weaker and also consistent across measures except for Openness. The latent variable model reveals the reason for the discrepancies. Three items (#15 ingenious, #l35 like routine work, and #10 sophisticated in art) showed unique relationships with age. The art-related items showed a unique relationship with age. The latent factor does not include the unique content of these items and shows a positive relationship between openness and age. The scale scores include this content and show a weaker relationship. The positive relationship of openness with age for the latent factor is rather surprising as it is not found in nationally representative samples (Schimmack, 2019b). One possible explanation for this relationship is that older individuals who take an online personality test are more open.

NEOACEVBACQ
Factor-0.08-0.020.180.120.330.01-0.11
BFI-0.08-0.010.080.090.26
BFI-S-0.08-0.04-0.020.080.25

In sum, the most important finding is that the 3-item BFI-S conscientiousness scale shows the same relationship with age as the BFI-scale and the latent factor. Thus, the failure to find aging effects in the longitudinal SOEP data with the BFI-S cannot be attributed to the use of an invalid short measure of conscientiousness. The real scientific question is why the cross-sectional study by Soto et al. (2011) and my analysis of the longitudinal SOEP data show divergent results.

Conclusion

Science has changed since researchers are able to communicate and discuss research findings on social media. I strongly believe that open science outside of peer-controlled journals is beneficial for the advancement of science. However, the downside of social media of open science is that it becomes more difficult to evaluate expertise of online commentators. True experts are able to back up their claims with scientific evidence. This is what I did here. I showed that Brenton Wiernik’s comment has as much scientific validity as a Donald Trump tweet. Whatever the reason for the lack of personality change in the SOEP data will be, it is not the use of the BFI-S to measure the Big Five.

Personality Measurement with the Big Five Inventory

In one of the worst psychometric articles every published (although the authors still have a chance to retract their in press article before it is actually published), Hussey and Hughes argue that personality psychologists intentionally fail to test the validity of personality measures. They call this practice validity-hacking. They also conduct some psychometric tests of popular personality measures and claim that they fail to demonstrate structural validity.

I have demonstrated that this claim is blatantly false and that the authors failed to conduct a proper test of structural validity (Schimmack, 2019a). That is, the authors fitted a model to the data that is known to be false. Not surprisingly, they found that their model didn’t meet standard criteria of model fit. This is exactly what should happen when a false model is subjected to a test of structural validity. Bad models should not fit the data. However, a real test of structural validity requires fitting a plausible model to the data. I already demonstrated with several Big Five measures that these measures have good structural validity and that scale scores can be used as reasonable measures of the latent constructs (Schimmack, 2019b). Here I examine the structural validity of the Big Five Inventory (Oliver John) that was used by Hussay and Hughes.

While I am still waiting to receive the actual data that were used by Hussay and Hughes, I obtained a much larger and better dataset from Sam Gosling that includes data from 1 million visitors to a website that provides personality feedback (https://www.outofservice.com/bigfive/).

For the present analyses I focused on the subgroup of Canadian visitors with complete data (N = 340,000). Subsequent analyses can examine measurement invariance with the US sample and samples from other nations. To examine the structure of the BFI, I fitted a structural equation model. The model has seven factors. Five factors represent the Big Five personality traits. The other two factores represent rating biases. One bias is an evaluative bias and the other bias is acquiescence bias. Initially, loadings on the method factors were fixed. This basic model was then modified in three ways. First, item loadings on the evaluative bias factor were relaxed to allow for some items to show more or less evaluative bias. Second, secondary loadings were added to allow for some items to be influenced by more than one factor. Finally, items of the same construct were allowed to covary to allow for similar wording or shared meaning (e.g., three arts items from the openness factor were allowed to covary). The final model and the complete results can be found on OSF (https://osf.io/23k8v/).

Model fit was acceptable, CFI = .953, RMSEA = .030, SRMR = .032. In contrast, fitting a simple structure without method factors produced unacceptable fit for all three fit indices, CFI = .734, RMSEA = .068, SRMR = .110. This shows that the model specification by Hussey and Hughes accounted for the bad fit. It has been known for over 20 years that a simple structure does not fit Big Five data (McCrae et al., 1996). Thus, Hussay and Hughes claim that the BFI lacks validity is based on an outdated and implausible measurement model.

Table 1 shows the factor loading pattern for the 44 BFI items on the Big Five factors and the two method factors. It also shows the contribution of the seven factors to the scale scores that are used to provide visitors with personality feedback and in many research articles that use scale scores as proxies for the latent constructs.

Item#NEOACEVBACQ
Neuroticism
depressed/blue40.33-0.150.20-0.480.06
relaxed9-0.720.230.18
tense140.51-0.250.20
worry190.60-0.080.07-0.210.17
emotionally stable24-0.610.270.18
moody290.43-0.330.18
calm34-0.58-0.04-0.14-0.120.250.20
nervous390.52-0.250.17
SUM0.79-0.08-0.01-0.05-0.020.420.05
Extraversion
talkative10.130.70-0.070.230.18
reserved6-0.580.09-0.210.18
full of energy110.34-0.110.580.20
generate enthusiasm160.070.440.110.500.20
quiet21-0.810.04-0.210.17
assertive26-0.090.400.14-0.240.180.240.19
shy and inhibited310.180.64-0.220.17
outgoing360.720.090.350.18
SUM-0.020.830.04-0.050.000.440.06
Openness 
original50.53-0.110.380.21
curious100.41-0.070.310.24
ingenious 150.570.090.21
active imagination200.130.53-0.170.270.21
inventive25-0.090.54-0.100.340.20
value art300.120.460.090.160.18
like routine work35-0.280.100.13-0.210.17
like reflecting40-0.080.580.270.21
few artistic interests41-0.26-0.090.15
sophisticated in art440.070.44-0.060.100.16
SUM0.04-0.030.76-0.04-0.050.360.19
Agreeableness
find faults w. others20.15-0.42-0.240.19
helpful / unselfish70.440.100.290.23
start quarrels 120.130.20-0.50-0.09-0.240.19
forgiving170.47-0.140.240.19
trusting 220.150.330.260.20
cold and aloof27-0.190.14-0.46-0.350.17
considerate and kind320.040.620.290.23
rude370.090.12-0.63-0.13-0.230.18
like to cooperate420.15-0.100.440.280.22
SUM-0.070.00-0.070.780.030.440.04
Conscientiousness
thorough job30.590.280.22
careless 8-0.17-0.51-0.230.18
reliable worker13-0.090.090.550.300.24
disorganized180.15-0.59-0.200.16
lazy23-0.52-0.450.17
persevere until finished280.560.260.20
efficient33-0.090.560.300.23
follow plans380.10-0.060.460.260.20
easily distracted430.190.09-0.52-0.220.17
SUM-0.050.00-0.050.040.820.420.03

Most of the secondary loadings are very small, although they are statistically highly significant in this large sample. Most items also have the highest loading on the primary factor. Exceptions are the items blue/depressed, full of engery, and generate enthusiasm that have higher loadings on the evaluative bias factor. Except for two openness items, all items also have loadings greater than .3 on the primary factor. Thus, the loadings are consistent with the intended factor structure.

The most important results are the loadings of the scale scores on the latent factors. As the factors are all independent, squaring these coefficients shows the amount of explained variance by each factor. By far the largest variance component is the intended construct with correlations ranging from .76 for openness to .83 for extraversion. Thus, the lion share of the reliable variance in scale scores reflects the intended construct. The next biggest contributor is evaluative bias with correlations ranging from .36 for openness to .44 for extraversion. Although this means only 15 to 20 percent of the total variance in scale scores reflects evaluative bias, this systematic variance can produce spurious correlations when scale scores are used to predict other self-report measures (e.g., life satisfaction, Schimmack, 2019c).

In sum, a careful psychometric evaluation of the BFI shows that the BFI has good structural validity. The key problem is the presence of evaluative bias in scale scores. Although this requires caution in the interpretation of results obtained with BFI scales, it doesn’t justify the conclusion that the BFI is invalid.

Measurement Invariance

Hussey and Hughes also examined measurement invariance across age-groups and the two largest gender groups. They claimed that the BFI lacks measurement invariance, but this claim was based on a cunning misrepresentation of the results (Schimmack, 2019a). The claim is based on the fact that the simple-structure model does not fit in any group. However, fit did not decrease when measurement invariance was imposed on different groups. Thus, all groups showed the same structure and fit did not increase when measurement invariance was imposed, but this fact was hidden in the supplementary results.

I replicated their analyses with the current dataset. First, I fitted the model for the whole sample separately to the male and female samples. Fit for the male sample was acceptable, CFI = .949, RMSEA = .029, SRMR = .033. So was fit for the female sample, CFI = .947, RMSEA = .030, SRMR = .037.

Table 2 shows the results side by side. There are no notable differences between the parameter estimates for males and females (m/f). This finding replicates results with other Big Five measures (Schimmack, 2019a).

Item#NEOACEVBACQ
Neuroticism
depressed/blue4.33/ .30-.18/-.11.19/ .20-.45/-.50.07/.05
relaxed9-.71/-.72.24/ .23.19/.18
tense14.52/ .49-.17/-.14.11/ .13-.27/-.32.20/ .20
worry19.58/ .57-.10/-.08.05/ .07-.22/-.22.17/ .17
emotionally stable24-.58/-.58.10/ .06.25/ .30.19/ .17
moody29.41/ .38-.26/-.25-.30/-.38.18/ .18
calm34-.55/-.59-.02/-.03.14/ .13.12/ .13-.27/-.24.21/ .19
nervous39.51/ .49-.21/.26-.10/-.10.08/ .08-.11/-.11-.27/-.25.18/ .17
SUM.78/ .77-.09/-.08-.01/-.01-.07/-.05-.02/-.02-.42-.46.05/ .04
Extraversion
talkative1.09/ .11.69/ .70-.10/-.08.24/ .24.19/ .18
reserved6-.55/-.60.08/.10.21/ .22.19/ .18
full of energy11.33/ .32-.09/-.04.56/ .59.21/ .20
generate enthusiasm16.04/ .03.44/ .43.12/ .13.48/ .50.20/ .20
quiet21-.79/-.82.03/ .04-.22/-.21.17/ .16
assertive26-.08/-.10.39/ .40.12/ .14-.23/-.25.18/ .17.26/ .24.20/ .18
shy and inhibited31.19/ .15.61/ .66.23/ .22.18/ .17
outgoing36.71/ .71.10/ .07.35/ .38.18/ .18
SUM-.02/-.02.82/ .82.04/ .05-.04-.06.00/ .00.45/ .44.07/ .06
Openness 
original5.50/ .54-.12/-.12.40/ .39.22/ .20
curious10.40/ .42-.05/-.08.32/ .30.25/ .23
ingenious 150.00/0.00.60/ .56.18/ .16.10/ .04.22/ .20
active imagination20.50/ .55-.07/-.06-.17/-.18.29/ .26.23/ .21
inventive25-.07/ -.08.51/ .55-.12/-.10.37/ .34.21/ .19
value art30.10/ .03.43/ .52.08/ .07.17/ .14.18/ .19
like routine work35-.27/-.27.10/ .10.09/ .15-.22/-.21.17/ .16
like reflecting40-.09/-.08.58/ .58.28/ .26.22/ .20
few artistic interests41-.25/-.29-.10/-.09.16/ .15
sophisticated in art44.03/ .00.42/ .49-.08/-.08.09/ .09.16/ .16
SUM.01/ -.01-.01/-.01.74/ .78-.05/-.05-.03/-.06.38/ .34.20/ .19
Agreeableness
find faults w. others2.14/ .17-.42/-.42-.24/-.24.19/ .19
helpful / unselfish7.45/ .43.09/.11.29/ .29.23/ .23
start quarrels 12.12/ .16.23/ .18-.49/-.49-.07/-.08-.24/-.24.19/ .19
forgiving17.49/ .46-.14/-.13.25/ .24.20/ .19
trusting 22-.14/-.16.38/ .32.27/ .25.21/ .19
cold and aloof27-.20/-.18.14/ .12.44/ .46-.34/-.37.18/ .17
considerate and kind32.02/.01.62/.61.28/ .30.22/ .23
rude37.10/ .12.12/ .12-.62/-.62-.13/-.08-.23/-.23.19/ .18
like to cooperate42.18/ .11-.09/-.10.43/ .45.28/ .29.23/ .22
SUM-.07/-.08.00/ .00-.07/-.07.78/ .77.03/ .03.43/ .44.04/ .04
Conscientiousness
thorough job3.58/ .59.29/ .28.23/ .22
careless 8-0.16-.49/-.51.24/ .23.19/ .18
reliable worker13-.10/-.09.09/ .10.55/ .55.30/ .31.24/ .24
disorganized18.13/ .16-.58/-.59-.21/-.20.17/ .15
lazy23-.52/-.51-.45/-.45.18/ .17
persevere until finished28.54/ .58.27/ .25.21/ .19
efficient33-.11/-.07.52/ .58.30/ .29.24/ .23
follow plans38.00/ .00-.06/-.07.45/ .44.27/ .26.21/ .20
easily distracted43.17/ .19.07/ .06-.53/-.53-.22/-.22.18/ .17
SUM-.05/-.05-.01/-.01-.05/-.06.04/ .04.81/ .82.43/ .41.03/ .03

I then fitted a multi-group model with metric invariance. Despite the high similarity between the individual models, model fit decreased, CFI = .925, RMSEA = .033, SRMR = .062. Although RMSEA and SRMR were still good, the decrease in fit might be considered evidence that the invariance assumption is violated. Table 2 shows that it is insufficient to examine changes in global fit indices. What matters is whether the decrease in fit has any substantial meaning. Given the results in Table 2, this is not the case.

The next model imposed scalar invariance. Before presenting the results, it is helpful to know what scalar invariance implies. Take extraversion as an example. Assume that there are no notable gender differences in extraversion. However, extraversion has multiple facets that are represented by items in the BFI. One facet is assertiveness and the BFI includes an assertiveness item. Scalar invariance implies that there cannot be gender differences in assertiveness if there are no gender differences in extraversion. It is obvious that this is an odd assumption because gender differences can occur at any level in the hierarchy of personality traits. Thus, evidence that scalar invariance is violated does not imply that we can not examine gender differences in personality. Rather, it would require further examination of the pattern of mean differences at the level of the factors and the item residuals.

However, imposing scalar invariance did not produce a decrease in fit, CFI = .921, RMSEA = .034, SRMR = .063. Inspection of the modification indices showed the highest modification index for item O6 “valuing art” with an implied mean difference of 0.058. This implies that there are no notable gender differences at the item-level. The pattern of mean differences at the factor level is consistent with previous studies, showing higher levels of neuroticism (d = .64) and agreeableness (d = .31), although the difference in agreeableness is relatively small compared to some other studies.

In sum, the results show that the BFI can be used to examine gender differences in personality and that the pattern of gender differences observed with the BFI is not a measurement artifact.

Age Differences

Hussey and Hughes used a median split to examine invariance across age-groups. The problem with a median split is that online samples tend to be very young. Figure 1 shows the age distribution for the Canadian sample. The median age is 22.

To create two age-groups, I split the sample into a group of under 30 and 30+ participants. The unequal sample size is not a problem because both groups are large given the large overall sample size (young N = 221,801, old N = 88,713). A published article examined age differences in the full sample, but the article did not use SEM to test measurement invariance (Soto, John, Gosling, & Potter, 2011). Given the cross-sectional nature of the data, it is not clear whether age differences are cohort differences or aging effects. Longitudinal studies suggest that age differences may reflect generational changes rather than longitudinal changes over time (Schimmack, 2019d). In any case, the main point of the present analyses is to examine measurement invariance across different age groups.

Fit for the model with metric invariance was similar to the fit for the gender model, CFI = .927, RMSEA = .033, SRMR = .062. Fit for the model with scalar invariance was only slightly weaker for CFI and better for RMSEA. More important, inspection of the modification indices showed the largest difference for O10 “sophisticated in art” with a standardized mean difference of .068. Thus, there were no notable differences between the two age groups at the item level.

The results at the factor level reproduced the finding with scale scores by Soto et al. (2011). The older group had a higher level of conscientiousness (d = .61) than the younger group. Differences for the other personalty dimensions were statistically small. There were no notable differences in response styles.

In sum, the results show that the BFI shows reasonable measurement invariance across age groups. Contrary to the claims by Hussey and Hughes, this finding is consistent with the results reported in Hussay and Hughes’s supplementary materials. These results suggest that BFI scale scores provide useful information about personality and that published articles that used scale scores produced meaningful results.

Conclusion

Hussey and Hughes accused personality researchers of validity hacking. That is, they do not report results of psychometric tests because these tests would show that personality measures are invalid. This is a strong claim that requires strong evidence. However, closer inspection of this claim shows that the authors used an outdated measurement model and misrepresented the results of their invariance analyses. Here I showed that the BFI has good structural validity and shows reasonable invariance across gender and age groups. Thus Hussay and Hughes’s claims are blatantly false.

So far, i have only examined the BFI, but I have little confidence in the authors’ conclusions about other measures like Rosenberg’s self-esteem scale. I am still waiting for the authors to share all of their data to examine all of their claims. At present, there is no evidence of v-hacking. Of course, this does not mean that self-ratings of personality are perfectly valid. As I showed, self-ratings of the Big Five are contaminated with evaluative bias. I presented a measurement model that can test for the presence of these biases and that can be used to control for rating biases. Future validation studies might benefit from using this measurement model as a basis for developing better measures and better measurement models. Substantive articles might also benefit from using a measurement model rather than scale scores, especially when the BFI is used as a predictor of other self-report measures to control for shared rating biases.

Open-SOEP: Personality and Wellbeing Revisited

[corrected 8/6/2019 5.29pm – there was a mistake in the model for worry]

After behaviorism banned emotions as scientific constructs and cognitivism viewed humans as computers, the 1980s witnessed the affective revolution. Finally, psychologists were again allowed to study feelings.

The 1980s also were a time where personality psychologists agreed on the Big Five as a unified model of personality traits. Accordingly, personality can be efficiently summarized by individuals’ standing on five dimensions: Neuroticism, Extraversion, Openness, Agreeableness, and Conscientiousness.

Not surprisingly, the 1980s also produced a model of personalty, emotions (affect), and well-being that has survived until today. The model was first proposed by Costa and McCrae in 1980 (see Schimmack, 2019, for details). This model assumed that extraversion is a disposition to experience more positive affect, neuroticism is a disposition to experience more negative affect, and the balance of positive and negative affect is a major determinant of life-satisfaction. As extraversion and neuroticism are independent dimensions, the model also assumed that positive affect and negative affect are independent, which led to the creation of the widely used Positive Affect and Negative Affect Schedule (Watson et al., 1988) as a measure of well-being.

The model also assumed that general affective dispositions account for most of the stability in well-being over time, while environmental factors produce only momentary and short-lived fluctuations around dispositional levels of well-being (Diener, 1984; Lykken & Tellgen, 1996). This model dominated well-being research in psychology for 20 years (see Diener, Suh, Lucas, & Smith, 1999, for a review).

However, when Positive Psychology emerged at the beginning of the new millenium, psychologists focus shifted from the influence of stable dispositions to factors that could be changed with interventions to boost individuals’ wellbeing (Seligman & Csikszentmihalyi, 2000) and some articles even questioned the influence of dispositions on well-being (Diener, Lucas, & Scollon, 1996). As a result, the past 20 years have seen very little new research on dispositional influences on well-being. The last major article is a meta-analysis that showed positive correlations of extraversion and neuroticism with several well-being indicators (Steel, Schmidt, & Shultz, 2008).

Revisiting the Evidence

There is robust evidence for the influence of neuroticism on wellbeing. Most important, this relationship has been demonstrated in multi-method studies that control for shared method variance when self-ratings of personality are correlated with self-ratings of well-being (McCrae & Costa, 1991; Schimmack, Oishi, Funder, & Furr, 2004). However, the relationship between extraversion and well-being is not as strong or consistent as one would expect based on Costa and McCrae’s (1980) model. For example, McCrae and Costa failed to find evidence for this relationship in a multi-method study, and other studies that controlled for response styles also failed to find the predicted effect (Schimmack, Schupp, & Wagner, 2008).

Taking a closer look at Costa and McCrae’s (1980) article, we see that they did not include life-satisfaction measures in their study. The key empirical finding supporting their model is that extraversion facets like sociabilty measured at time 1 predict positive affect and hedonic balance (positive affect minus negative affect) concurrently and longitudinally and that these correlations remain fairly stable over time. This suggests that personality is stable and contributes to the stable variance in the affect measures. However, the effect size is small (r = .22 to .24). This suggests that extraversion accounts for about 5% of the variance in affect. This finding hardly supports the claim that extraversion accounts for half of the stable variance in well-being.

It is symptomatic of psychology that subsequent articles run with the story while ignoring gaps in the actual empirical evidence. As longitudinal studies in psychology are rare, there have been few attempts to replicate Costa and McCrae’s findings.

Headey and Wearing (1989) replicated and extended Costa and McCrae’s study by including life-satisfaction measures as an indicator of wellbeing. They replicated the key findings and showed that personality also predicts future life-satisfaction. However, the effect size for extraversion was again fairly small; as was the effect of neuroticism, suggesting that most of the stable variance in life-satisfaction is not explained by extraversion and neuroticism.

A key limitation of both studies is that they do not take shared method variance into account. Although method variance may be transient, it is also possible that it is stable over time (Anusic et al., 2009). Thus, even the already modest effect sizes may still be inflated by shared method variance.

New Evidence

Data and Model

Fortunately, better data are now available to revisit the longitudinal relationships between personality and life-satisfaction. I used the data from the German Socio-Economic Panel (SOEP). The SOEP measured the Big Five personality traits on four occasions (waves) spanning a period of 12 years (2005, 2009, 2013, 2017). Personality was measured with the 15-item BFI-S. I created a measurement model for the BFI-S that shows measurement invariance across the four occasions (Schimmack, 2019a). I also related personality to the single-item life-satisfaction rating in the SOEP (Schimmack, 2019b). Here, I extend this analysis by taking advantage of the fourth measurement of personality in 2017, which makes it possible to separate trait and state variance in personality and well-being.

The SOEP measures life-satisfaction in two ways. First, it includes several domain-satisfaction items (health, finances, recreation, housing). Second, it includes a global life-satisfaction item. In a different post (Schimmack, 2019c), I examined the relationship between these items and found that global items are influenced by a general disposition factor and satisfaction with finances and health, while the other two domains are relatively unimportant. Based on this finding and related evidence (Zou, Schimmack, & Gere, 2013), I averaged the domain satisfaction judgments and used it as an indicator of life-satisfaction. This makes it possible to remove random measurement error from the measurement of life-satisfaction on a single occasion. I then fitted latent-trait-state (LST) models to the personality factors and the well-being factor. These models separate the longitudinal correlations into two components. A stable trait component and a changing state component. A third parameter estimates how stable state variance is over time.

There are several ways to relate personality to life-satisfaction in this model. I chose to predict life-satisfaction variance on each occasion to the personality variances on the same occasion. The model indirect function can then be used to examine how much of the variance is due to stable personality traits or due to personality states.

The availability of four waves of data also makes it possible to model stability of the residual variances in personality items. Typically, these residuals are allowed to correlate to allow for item-specific stability, but the use of correlated residuals makes it impossible to relate this variance to other constructs. With four waves, it is possible to fit an LST model to item-residuals. Exploration of the data showed that the neuroticism item “worry” showed consistent relationships with well-being. Thus, I fitted an LST model to this item and allowed for an influence of worry on life-satisfaction.

The synatax and the complete results are posted on OSF (SOEP.4W.B5.DSX.LS).

Results

Overall model fit was acceptable, CFI = .967, RMSEA = .019, SRMR = .030.

Trait Variance and Stability of State Variance

Table 1 shows the amount of trait variance and the stability of state variance in the personality predictor variables. A more detailed discussion of the implications of these results for personality research can be found elsewhere (Schimmack, 2019a). The results for the Big Five serve as a comparison for the trait variance in life-satisfaction.

TraitStability1Y-Stability
Neuroticism0.690.380.790.56
Extraversion0.740.380.780.51
Openness0.710.340.760.53
Agreeableness0.680.200.670.57
Conscientiousness0.600.290.730.64
Halo0.600.360.780.63
Acquiescence0.340.110.580.81
Worry0.640.520.850.60

Table 2 shows how life-satisfaction at each time point is related to personality predictors. For model identification purposes, it is necessary to fix one relationship to zero. I used openness because meta-analysis show that it is the weakest predictor of life-satisfaction (Steel et. al., 2008). I did not impose constraints across the four waves.

LS-T1LS-T2LS-T3LS-T4
Neuroticism-0.29-0.27-0.27-0.26
Extraversion0.080.070.080.09
Openness
Agreeableness0.080.050.040.03
Conscientiousness0.040.040.040.04
Halo0.190.280.240.26
Acquiescence0.180.080.120.16
Worry-0.35-0.34-0.35-0.33

The results show that out of the Big Five, neuroticism is the only notable predictor of life-satisfaction with a moderate effect size (r = -.26 to -.29). A notable finding is that extraversion is a weak predictor of life-satisfaction (r = .07 to .09). This finding is inconsistent with Costa and McCrae’s (1980) model. The results for agreeableness and conscientiousness are also weak. This finding is inconsistent with meta-analysis and with McCrae and Costa’s (1991) suggestion that high agreeableness and conscientiousness are also instrumental for higher life-satisfaction. Both halo and acquiescence bias are stronger predictors of life-satisfaction judgments than extraversion, agreeableness, and conscientiousness. Another notable finding is that the worry-facet of neuroticism is the strongest personality predictor; even stronger than the neuroticism factor (rs = -.33 to -.35). This finding is consistent with previous studies that facets of neuroticism and extraversion are better predictors of life-satisfaction than the global factors (Schimmack, Oishi, Funder, & Furr, 2004).

Table 3 shows how much of the variance in life-satisfaction is explained by trait factors that remain stable over time.

LS-T1LS-T2LS-T3LS-T4
Neuroticism0.050.050.050.05
Extraversion0.000.000.000.01
Openness
Agreeableness0.000.000.000.00
Conscientiousness0.000.000.000.00
Halo0.020.040.030.04
Acquiescence0.010.000.000.01
Worry0.080.080.080.07
Unexplained0.380.380.380.38
Total0.550.560.560.55

Given the weak effects of extraversion, agreeableness, and conscientiousness, it is not surprising that these Big Five traits explain less than 1% of the variance in life-satisfaction judgments. The only notable predictor is neuroticism, which explains 5-6% of the variance. In addition, the worry facet of neuroticism is an even stronger predictor of trait variance in life-satisfaction. This finding shows that more specific traits below the Big Five add to the prediction of life-satisfaction (Schimmack, Oishi, Furr, & Funder, 2004). Halo adds only 2% and acquiescence only 1%. By far the largest portion of the trait variance was unexplained with 41% of the variance. Combined this implies that approximately half of the variance in life-satisfaction is trait variance. This finding is consistent with estimates in a meta-analysis and other analyses of the SOEP data (Anusic & Schimmack, 2016; Schimmack, Krupp, Wagner & Schupp, 2010). The estimate of 55% trait variance is also smaller than the estimate of 70% trait variance in the Big Five personality traits. This finding is also consistent with meta-analytic comparison of personality and well-being measures (Anusic & Schimmack, 2016).

Table 4 shows the results for the state-predictors of life-satisfaction. Once more extraversion, agreeableness, and conscientiousness predict less than 1% of the variance. This time, neuroticism and worry are also relatively weak predictors because most of the relationship for this traits stems from the stable component. However, the results suggest that some changes in neuroticism and worry are related to changes in life-satisfaction. However, most of the state variance in life-satisfaction is not explained by the personality predictors (33% out of 44%).

LS-T1LS-T2LS-T3LS-T4
Neuroticism0.020.020.020.02
Extraversion0.000.000.000.00
Openness
Agreeableness0.000.000.000.00
Conscientiousness0.000.000.000.00
Halo0.010.030.020.03
Acquiescence0.020.000.010.02
Worry0.040.040.040.04
Unexplained0.330.340.340.34
Total0.440.440.440.45

Conclusion

These results challenge Costa and McCrae’s (1980) model of personality and well-being in several ways. First, extraversion is not a strong predictor of the stable variance in life-satisfaction. Second, even the influence of neuroticism accounts for only 10% of the stable trait variance in life-satisfaction. Adding other Big Five predictors also does not help because they have negligible relationships with life-satisfaction. Thus, most of the trait variance in life-satisfaction remains unexplained. It is either explained by more specific personality traits than the Big Five (facets) or by stable environmental factors (e.g., income). The SOEP data provide ample opportunity to look for additional predictors of trait variance. Also, researchers should conduct studies with broader personality questionnaires to find additional predictors of life-satisfaction. Searching for these predictors is an important area of research in an area that has stagnated over the past two decades.

Costa and McCrae’s model also underestimated the importance of state-factors. State factors are highly stable over fairly long periods of times and account for 50% of the reliable variance in life-satisfaction. As the Big Five mostly reflect stable traits, they cannot account for this important variance in life-satisfaction. Schimmack and Lucas (2010) argued that these factors are environmental factors because changes in life-satisfaction are shared between spouses. Thus, changes in actual life-circumstances may contribute to state variance in life-satisfaction. Consistent with this model, spouses were more similar in domains that are shared (housing, income) than in domains that are less shared (health, recreation).

Evidently, the conclusions are based on a single German sample. As impressive as these data are, it is important to compare results across samples from different populations. At least regarding the influence of extraversion, the present results are consistent with other studies that suggest the influence of extraversion on life-satisfaction (Kim, Schimmack, & Tsutsui, 2019). The idea that extraverts are happier has been exaggerated by Costa and McCrae’s model, while their own empirical results did not warrant this claim. The reason is that psychologists often ignore effect sizes.

Implications

The present results also have implications for developmental theories of personality. The idea of development is a process with an ideal outcome. For humans, the outcome is an adult human being with optimal capabilities. A collective of personality psychologists suggested that optimal personality development results in a personality type with optimal personality characteristics. I criticized this idea and argued that there is no such thing as an optimal personality. Just like there is no optimal height as the end-goal of human growth, there is no optimal level of extraversion or conscientiousness. In clinical psychology, the key criterion of mental health is that an intervention is beneficial for a patients’ well-being. Thus, we could argue that an optimal personality is a personality that maximizes individuals’ well-being. Meta-analyses suggests that extraveted, agreeable, and conscientious people have higher well-being. Thus, it might be beneficial for individuals to become more extraverted, agreeable, and conscientious. However, the present results challenge this view. After removing the evaluative aspect of personality from the Big Five only neuroticism remains a notable predictor of well-being. Thus, the key personality trait for self-improvement is neuroticism. Not surprisingly, this is also the key aspect that is targeted in self-help books and well-being programs. Until we have a better understanding of the relationship between personality and well-being, it seems premature to propose interventions that are aimed at changing individuals’ personality. Just like personality psychologists no longer endorse conversion therapy for sexual orientation, I urge for caution in submitting individuals who are carefree and impulsive to a conscientiousness conversion program. You never know when acting on the spur of a moment is the best course of action.

Open-SOEP: No Significant Personality Change over 12 Years

Studying personality stability and change is easy and hard. It is easy because the method is straightforward. Administer a valid measure of personality to a group of participants and repeat the measurement several times. Describing the method takes a sentence or two compared to pages that describe an intricate laboratory experiment with an elaborate deception. It is hard because it requires time and participants may drop out of a study. Meanwhile there is nothing to publish while a researcher is waiting for the next retest. In our fast paced world of academic publishing where researchers are expected to publish 5 or more articles a year, there is no place for slow research. As a result, evidence on personality change is scarce. The best evidence so far comes from a meta-analysis that patched together small studies with different measures, populations, and small samples. Although this meta-analysis is the best evidence available, it cannot be trusted because the evidence is inconclusive.

Psychologists have to thank economists and sociologists who are used to collaborate on big data collections. One of these collaborations is the German Socio-Economic Panel (SOEP). The SOEP is an ongoing longitudinal study with a representative sample that started in 1984. In 2005, the SOEP included the BFI-S; a 15-item personality measure that assesses the Big Five. Since then, the BFI-S has been administered in four-year intervals in 2009, 2013, and 2017. Thus, we now have longitudinal data spanning 12-years with four waves of data. This makes it possible to revisit the question of personality stability with much better data than a meta-analysis of heterogeneous studies can provide. Surely, the results are based on a German sample, but there is little evidence that personality development varies across cultures.

Method

One drawback of the SOEP is that each personality dimension is measured with just three items. This makes scale scores unreliable and scale scores can be contaminated with method variance (e.g., evaluative bias, acquiescence bias). To avoid these problems and to examine measurement invariance, it is better to analyze the data with a measurement model that examines personality change at the level of latent variables that correct for measurement error. I developed a measurement model for the SOEP (Schimmack, 2019a) and I already demonstrated invariance across the first three waves of the SOEP (Schimmack, 2019b). Here I added the fourth wave of data from 2017 to the dataset to produce even better information about long-term changes in personality.

To analyze the data, I first fitted the measurement model for the BFI-S to the data from each wave and imposed equality constraints to ensure measurement invariance. The longitudinal stability of personality was examined using the latent-trait-state (LTS) model that decomposes stability over time into two components; (a) a stable trait component that never changes and (b) a changing state component. The changing state component allows for factors that influence personalty to change over time and to change personality. These changing factors may produce changes that last a long time or changes that are more temporary. The time course of changes in personality is modeled with an autoregressive parameter that reflects how many of the changes at time 1 are still present at time 2.

The LTS model is typically fitted without modeling mean level changes. However, the model can also be used to model the mean structure in the data. In latent variable models, changes in personality are assumed to occur at the level of the latent traits, while item means (intercepts) are assumed to be constant over time. As the latent trait is stable, it cannot be used to model mean-level changes. Thus, one option is to free the means of the state factors. However, the influence of the state factors decreases over time, which is inconsistent with the idea of lasting changes in personality. Thus, a better option is to let the means of the occasion specific factors to vary freely, even if the occasion specific variance is zero. Although this model may lack realism, it would show the pattern of mean level changes in the data without imposing some model on the data (e.g., a linear trend).

The model specification and the complete results can be found on OSF (ttps://osf.io/vpcfd/). The overall model fit was acceptable, CFI = .971, RMSEA = .019, SRMR = .031.

Rank-order Stability and Change

A study of the first three waves in the SOEP replicated earlier findings of high retest stability in personality with stabilities over .9 over a one-year period (Conley, 1984; Schimmack, 2019c). However, three ways are insufficient to separate trait variance from state variance, and few studies with four waves of personality are available. Anusic and Schimmack (2016) used a meta-analytic approach to do so on the basis of smaller studies. Their model suggested that about 70% of the reliable variance in personality is trait variance and that the remaining 30% state variance are rather unstable with a low annual stability of .3. This would suggest that any changes in personality do not last long and individuals quickly revert back to their trait level of personality.

Table 1 shows the results for the SOEP data.

TraitStability1Y-Stability
Neuroticism0.670.380.79
Extraversion0.740.360.77
Openness0.710.380.79
Agreeableness0.690.180.65
Conscientiousness0.640.240.70
Halo0.640.310.75
Acquiescence0.320.100.56

The results show a similar split between trait and state variance as the meta-analysis, with about two-thirds of the variance being trait variance and one-third being state variance. A new finding is that the halo factor, an evaluative bias in personality ratings, also has 60% trait variance. Thus, this response style can also be considered a stable trait. In contrast, acquiescence bias has less trait variance and seems to be more influenced by momentary factors that are inconsistent over time.

The results for the stability of the state variance are different from the meta-analysis. The SOEP data suggest that changes in personality are more persistent than the meta-analysis suggested. The annual stability estimates are around .7. Thus, any changes that are evident at time 2 would still be evident over the next years. The stability over 4-years is around .3. These results are more encouraging for researchers who are interested in personality change than the meta-analytic results in Anusic and Schimmack, 2016). Nevertheless, the relatively small amount of state variance and the high stability of the state variance imply that it takes time to find even small changes in personality. Not surprisingly, it has been difficult to uncover predictors of personality change even in large samples like the SOEP (Specht et al., 2011).

In sum, the results confirm that personality ratings are highly stable over extended periods of time and that a large portion of this stability is caused by stable factors that ensure persistent individual differences in personality over the life span.

Mean Levels

Table 2 shows the results for the mean levels. Means in the first year, 2005, are used as the reference group. The results provide little evidence for personalty change in adulthood. None of the Big Five dimensions shows a consistent trend over time. The results for conscientiousness are most important because a meta-analysis suggested that conscientiousness increases substantially throughout adulthood. There is no evidence for such a trend in the SOEP.

NEOAC
20050.000.000.000.000.00
2009-0.13-0.12-0.16-0.18-0.07
2013-0.18-0.03-0.04-0.08-0.06
2017-0.16-0.06-0.05-0.11-0.26

The general pattern of decreases for all five dimensions suggests that acquiescence bias might have changed over time. Thus, I also fitted a model with free means for acquiescence bias but the results did not change. Thus, it does not account for the small decrease in the Big Five. Adding means for the halo factor, instead, reduces changes for most scales, but would suggest a stronger decrease in neuroticism. However, the pattern is never a gradual change, but a drop from time 1 to time 2 with no major changes afterwards. This suggests that some panel effect or period effects have small effects on personality ratings, but there is no evidence to support the claim that personality systematically changes throughout adulthood.

Conclusion

Personality research was attacked by situationists who claimed that personality is a mere social construction. In the 1980s, personality researchers had presented evidence that personality traits are real and stable using twin studies, multi-rater studies, and longitudinal studies. However, two meta-analysis by Roberts and colleagues suggested that personality exists but is less stable than personality psychologists assumed. These meta-analysis had a strong influence on personality psychology in the 2000s. They are featured in personality textbooks and often cited as evidence that personality still develops throughout adulthood. However, more recent evidence are more consistent with the view of personality as mostly stable throughout adulthood. Costa and McCrae famously compared personality to plaster. While it can be shaped and molded early on, it finally sets into a shape that can not be altered. Yes, there may be cracks here and there, but the overall shape is set. While this image may be too rigid, it is consistent with the evidence that even major life-events that occur during adulthood seem to have very little influence on personality (Specht et al., 2011).

The idea of personalty change is often coupled with the notion that personality develops and that there can be personal growth in adulthood. The problem with these notions is that it implies that there is a normative or desirable direction of personality change. For example, an increase in conscientiousness is seen as evidence of growing maturity. However, the measurement model that I used distinguishes between the denotative and connotative aspects of personality. Lazy is both descriptive and evaluative. However, evaluations are rooted in cultural norms and values. Why is it good to work as much as possible, to avoid mistakes at any costs? Should education and policies try to increase conscientiousness levels? Is there an optimal level? These are all very difficult questions that go well beyond the existing science of personality. Once we focus on the denotative aspect of personality, we see that some people work harder than others or that some people are more creative than others, and that these differences are fairly stable, without any evidence what causes this stability. Just like people differ in personality, they differ in other characteristics that have received more attention. Current culture aims towards greater acceptance of differences in sexual orientation, gender identity, body types, religion, etc. Maybe we should also include personalty traits there and let introverts be proud introverts and disagreeable people be proud disagreeable people. Maybe personality differences only exist because they were not a problem during human evolution or diversity is even an advantage that allows humans as a group to adapt to different circumstances. Thus, the strong evidence of personality stability is not necessary a problem that needs to be solved because there is normal personality. There is only normal variation in personality.

Measuring Personality in the SOEP

The German Socio-Economic-Panel (SOEP) is a longitudinal study of German households. The core questions address economic issues, work, health, and well-being. However, additional questions are sometimes added. In 2005, the SOEP included a 15-item measure of the Big Five; the so-called BFI-S (Lang et al., 2011). As each personality dimension is measured with only three items, scale scores are rather unreliable measures of the Big Five. A superior way to examine personality in the SOEP is to build a measurement model that relates observed item scores to latent factors that represent the Big Five.

Anusic et al. (2009) proposed a latent variable model for an English version of the BFI-S.

The most important feature of this model is the modeling of method factors in personality ratings. An acquiescence factor accounts for general response tendencies independent of item content. In addition, a halo factor accounts for evaluative bias that inflates correlations between two desirable or two undesirable items and attenuates correlations between a desirable and an undesirable item. The Figure shows that the halo factor is bias because it correlates highly with evaluative bias in ratings of intelligence and attractiveness.

The model also includes a higher-order factor that accounts for a correlation between extraversion and openness.

Since the article was published I have modified the model in two ways. First, the Big Five are conceptualized as fully independent which is in accordance with the original theory. Rather than allowing for correlations among Big Five factors, secondary loadings are used to allow for relationships between extraversion and openness items. Second, halo bias is modeled as a characteristic of individual items rather than the Big Five. This approach is preferable because some items have low loadings on halo.

Figure 2 shows the new model.

I fitted this model to the 2005 data using MPLUS (syntax and output: https://osf.io/vpcfd/ ). The model had acceptable fit to the data, CFI = .962, RMSEA = .035, SRMR = .029.

Table 1 shows the factor loadings. It also shows the correlation of the sum scores with the latent factors.

Item#NEOACEVBACQ
Neuroticism
worried50.49-0.020.19
nervous100.64-0.310.18
relaxed15-0.550.350.21
SUM0.750.000.000.000.00-0.300.09
Extraversion
talkative20.600.130.400.23
sociable80.640.370.22
reserved12-0.520.20-0.110.19
SUM0.000.750.00-0.100.050.360.09
Openess
original40.260.41-0.330.380.22
artistic90.150.360.290.17
imaginative140.300.550.220.21
SUM0.000.300.57-0.130.000.390.26
Agreeableness
rude30.12-0.51-0.320.19
forgiving60.230.320.24
considerate130.490.480.29
SUM0.00-0.070.000.580.000.500.11
Conscientiousness
thorough10.710.350.30
lazy7-0.16-0.41-0.350.20
efficient110.390.480.28
SUM0.000.000.000.090.640.510.11

The results show that all items load on their primary factor although some loadings are very small (e.g., forgiving). Secondary loadings tend to be small (< .2), although they are highly significant in the large sample. All items load on the evaluative bias factor, with some fairly large loadings for considerate, efficient, and talkative. Reserved is the most evaluatively neutral item. Acquiescence bias is rather weak.

The scale scores are most strongly related to the intended latent factor. The relationship is fairly strong for neuroticism and extraversion, suggesting that about 50% of the variance in scale scores reflects the intended construct. However, for the other three dimensions, correlations suggest that less than 50% of the variance reflects the intended construct. Moreover, the remaining variance is not just random measurement error. Evaluative bias contributes from 10% up to 25% of additional variance. Acquiescence bias plays a minor role because most scales have a reverse scored item. Openness is an exception and acquiescence bias contributes 10% of the variance in scores on the Openness scale.

Given the good fit of this model, I recommend it for studies that want to examine correlates of the Big Five or that want to compare groups. Using this model will produce better estimates of effect sizes and control for spurious relationships due to method factors.

Open-SOEP: Cohort vs. Age Effects on Personality

The German Socio-Economic Panel (SOEP) is a unique and amazing project. Since 1984, representative samples of German families have been survived annually. This project has produced a massive amount of data and hundreds of publications. The traditional journal publications make it difficult to keep track with developments and to find related articles. A better way to make use of these data may be open science where researchers can quickly share information.

In 2005, the SOEP included a brief, 15-item, measure of the Big Five personality traits. These data were used for cross-sectional studies that related personality to other variables measured in the SOEP such as well-being (Rammstedt, 2007). In 2009, the SOEP repeated the measurement of the Big Five. This provided longitudinal data for analyses of stability and change of personality. Researchers rushed to analyze the data and to report their findings. JPSP published two independent articles based on the same data (Lucas & Donnellan, 2011; Specht, Egloff, Schmukle, 2011). Both articles examined age-differences across birth-cohorts and over time. Ideally age-effects would show up in both analyses and produce similar trends in the data. Both articles also paid little attention to cohort differences in personality (i.e., Germans born in 1920 who grew up during Nazi times might differ from Germans born in 1950 who grew up during the revolutionary 60s).

In 2017, the Big Five questions were administered again, which makes it easier to spot age-trends and to distinguish age-effects from cohort effects. Recently, the first article based on the three-waves of data was published in JPSP (Wagner, Lüdtke, & Robitzsch, 2019). The article focused on retest correlations (consistency of individual differences over time), and did not examine mean levels of personality. The article does not mention cohort effects.

Cohort/Culture Effects

Like many Western countries, German culture has changed tremendously during the 20st century. In addition, German culture has been shaped by unique historical events such as the rise and fall of Hitler, the second world war, followed by the Wirschaftswunder, the division of the country into a democratic and a socialist country and the unification of Germany after the fall of the Berlin Wall. The SOEP data provide a unique opportunity to examine whether personality is shaped by culture.

So far, studies of cultural influences on personality have mostly relied on cross-cultural comparisons of Western cultures with non-Western cultures. The main finding of these studies is that citizens of modern, individualistic nations tend to be more extraverted and open to experiences than citizens in traditional, collectivistic cultures.

Based on these findings, one might expect higher levels of extraversion and openness in younger generations of Germans who grew up in a more individualistic culture than their parents and grandparents.

Method

The data are the Big Five ratings for the three waves in the SOEP (vp, zp, & bdp). Data were prepared and analyzed using R (see OSF for R-code). The three items for each of the Big Five scales were summed and analyzed as a function of 7 cohorts spanning 10 years (born 1978 to 1988 age 17-27 to age 77 to 87) and three waves (2005, 2009, 2013). The overall mean was subtracted from each of the 21 means and the mean differences were divided by the pooled standard deviation. This way, mean differences in the figures are standardized mean differences to ease interpretation of effect sizes.

Results

Openness to Experience

Openness to experience showed a clear cohort effect (Figure 1) with the lowest scores for the oldest cohort (1918-28) and the highest scores for the youngest cohort (1978 to 1988). The difference between the youngest and oldest cohorts is d = .72, which is considered a large effect size. In comparison, there is no clear age trend in Figure 1. While, scores decrease from t1 to t2, they increase from t2 to t3. All differences between t1 and t2 are small, |d| < .2.

Extraversion

Extraversion also shows a cohort effect in the predicted direction, but the effect size is smaller, d = .34.

In contrast, there are no age effects and the overall difference between 2005 and 2013 is d = -0.01.

Conscientiousness

I next examined conscientiousness because studies of age effects tend to show the largest age effects for this Big Five dimension. Regarding cohort effects, one might expect a decrease because older generations worked very hard to rebuild post-war Germany.

Consistent with the developmental literature, the youngest age-cohort shows an increase in conscientiousness from 2005 to 2013, although the effect size is small (d = .21). The other age-cohorts show very small decreases in conscientiousness except for the oldest age-cohort that shows a small decrease, d = -.22. Regarding cohort effects, there is no general tend, but the youngest cohort shows very low levels of conscientiousness even in 2013 when they are 25 to 35 years old.

Agreeableness

Developmental studies suggest that agreeableness increases as people get older. However, the SOEP data do not confirm this trend.

Within each cohort, agreeableness scores decrease although the effect sizes are very small. The overall decrease from 2005 to 2013 is d = -.09. In contrast, there is a clear cohort effect with agreeableness being the highest in the oldest generation. The decrease tends to level of for the last three generations. The effect size is moderate, d = -.38.

Neuroticism

The main result for neuroticism is that there is neither a pronounced cohort effect, d = -.09, nor age effect, d = -.13.

Conclusion

Previous analysis of personality data in the SOEP have focused on age effects and interpreted cross-sectional differences between older and younger Germans as age effects. However, these analyses were based on only two waves of data, which makes it difficult to interpret changes in personality scores over time. The third wave shows that some of the trends did not continue and suggest that there are no notable effects of aging in the SOEP data. The only age-effect consistent with the literature is an increase in conscientiousness in the youngest cohort of 17 to 27-year olds.

However, the data are consistent with cohort effects that are consistent with cross-cultural studies. The more individualistic a culture becomes, the more open and extraverted individuals become. Deeper analysis might help to elucidate which factors contribute to these changes (e.g., education level). The results also suggested that agreeableness decreased which might be another consequence of increasing individualism.

Overall, the results suggest that personality is influenced by cultural factors during adolescence and early adulthood, but that personality remains fairly stable throughout adulthood. This conclusion is also supported by other longitudinal studies (e.g., MIDUS) that show little changes in Big Five scores over time. Maybe Costa and McCrae were not entirely wrong when they compared personality to plaster that can be shaped while it is setting, but remains stable after it is dried.