Category Archives: Validity

Psychologists are not immune to the Dunning-Kruger Effect

Background

Bar-Anan and Vianello (2018) published a structural equation model in support of a dual-attitude model that postulates explicit and implicit attitudes towards racial groups, political parties, and the self. I used their data to argue against a dual-attitude model. Vianello and Bar-Anan (2020) wrote a commentary that challenged my conclusions. I was a reviewer of their commentary and pointed out several problems with their new model (Schimmack, 2020). They did not respond to my review and their commentary was published without changes. I wrote a reply to their commentary. In the reply, I merely pointed to my criticism of their new model. Vianello and Bar-Anan wrote a review of my reply, in which they continue to claim that my model is wrong. I invited them to discuss the differences between our models, but they declined. In this blog post, I show that Vianello and Bar-Anan lack insight into the shortcomings of their model, which is consistent with the Dunning-Kruger effect that incompetent individuals lack insight into their own incompetence. On top of this, Vianello and Bar-Anan show willful ignorance by resisting arguments that undermine their motivated belief in dual-attitude models. As I show below, Vianello and Bar-Anan’s model has several unexplained results (e.g, negative loadings on method factors), worse fit than my model, and produces false evidence of incremental predictive validity for the implicit attitude factors.

Introduction

The skill set of psychology researchers is fairly limited. In some areas expertise is needed to create creative experimental setups. In other areas, some expertise in the use of measurement instruments (e.g., EEG) is required. However, for the most part, once data are collected, little expertise is needed. Data are analyzed with simple statistical tools like t-tests, ANOVAs, or multiple regression. These statistical methods are implemented in simple commands and no expertise is required to obtain results from statistics programs like SPSS or R.

Structural equation modeling is different because researchers have to specify a model that is fitted to the data. With complex data sets, the number of possible models that can be specified increases exponentially and it is not possible to specify all models and to simply pick the model with the best fit. Moreover, there will be many models with similar fit and it requires expertise to pick plausible models. Unfortunately, psychologists receive little formal training in structural equation modeling because graduate training relies heavily on training by supervisors rather than formal training. As most supervisors never received training in structural equation modeling, they cannot teach their graduate student how to perform these analyses. This means that expertise in structural equation modeling varies widely.

An inevitable consequence of wide variation in expertise is that individuals with low expertise have little insight into their limited abilities. This is known as the Dunning-Kruger effect that has been replicated in numerous studies. Even incentives to provide accurate performance estimates do not eliminate the overconfidence of individuals with low levels of expertise (Ehrlinger et al., 2008).

The Dunning-Kruger effect explains Vianello and Bar-Anan’s (2020) response to my article that presents another ill-fitting model that makes little theoretical sense. This overconfidence may also explain why they are unwilling to engage in a discussion of their model with me. They may not realize that my model is superior because they were unable to compare the models or to run more direct comparisons of the models. As their commentary is published in the influential journal Perspectives on Psychological Science and as many readers lack the expertise to evaluate the merits of their criticism, it is necessary to explain clearly why their criticism of my models is invalid and why their new alternative model is flawed.

Reproducing Vianello and Bar-Anan’s Model

I learned the hard way that the best way to fit a structural equation model is to start with small models of parts of the data and then to add variables or other partial models to build a complex model. The reason is that bad fit in smaller models can be easily identified and lead to important model modifications, whereas bad fit in a complex model can have thousands of reasons that are difficult to diagnose. In this particular case, I saw new reason to even fit a complex model for attitudes to political parties, racial groups, and the self. Instead I fitted separate models for each attitude domain. Vianello and Bar-Anan (2020) take issue with this decision.

As for estimating method variance across attitude domains, that is the very logic behind an MTMM design (Campbell & Fiske, 1959; Widaman, 1985): Method variance is shared across measures of different traits that use the same method (e.g., among indirect measures
of automatic racial bias and political preferences). Trait variance is shared across measures of the same trait that use different methods (e.g., among direct and indirect measures of racial attitude). Separating the MTMM matrix into three separate submatrices (one for each
trait), as Schimmack did in his article, misses a main advantage of an MTMM design.

This criticism is based on an outdated notion of validation by means of correlations in a multi-trait-multi-method matrix. In this MTMM tables, every trait is measured with all methods. For example, the Big Five traits are measured with students’ self-ratings, mothers’ ratings, and fathers’ ratings (5 traits x 3 methods). This is not possible for validation studies of explicit and implicit measures because it is assumed that explicit measures measure explicit constructs and implicit measures measure implicit constructs. Thus, it is not possible to fully cross traits and methods. This problem is evident in all models by Bar-Anan and Vianello and myself. Bar-Anan and Vianello make the mistake to assume that using implicit measures for several attitude domains solves this problem, but their assumption that we can use correlations between implicit measures in one domain and implicit measures in another domain to solve this problem is wrong. In fact, it makes matters worse because they fail to model method variance within a single attitude domain properly.

To show this problem, I first constructed measurement models for each attitude domain and then show that combining well-fitting models of three three domains produces a better fitting model than Vianello and Bar-Anan’s model.

Racial Bias

In their revised model, Vianello and Bar-Anan postulate three method factors. One for explicit measures, one for IAT-related measures, and one for the Affective Missatribution Paradigm and the Evaluative Priming Task. It is not possible to estimate a separate method factor for all explicit measures, but it is possible to allow for method factors that are unique to the IAT-related measures and one that is unique to the AMP and EPT. In the first model, I fitted this model to the measures of racial bias. The model appears to have good fit, RMSEA = .013, CFI = 973. In this model, the correlation between the explicit and implicit racial bias factors is r = .80.

However, it would be premature to stop the analysis here because overall fit values in models with many missing values are misleading (Zhang & Savaley, 2019). Even if fit were good, it is good practice to examine the modification indices to see whether some parameters are misspecified.

Inspection of the fit indices shows one very large Modification Index of 146.04 for the residual correlation between the feeling thermometer and the preference ratings. There is a very plausible explanation for this finding. These two measures are very similar and can share method variance. For example, social desirable responding could have the same effect on both ratings. This was the reason why I included only one of the two measures in my model. An alternative is to include both ratings and allow for the correlated residual to model shared method variance.

As predicted by the MI, model fit improved, RMSEA = .006, CFI = .995. Vianello and Bar-Anan (2020) might object that this finding is post-hoc after peeking at the data, while their model is specified theoretically. However, this argument is weak. If they really theoretically predicted that feeling thermometer and direct ratings share no method variance, it is not clear what theory they have in mind. After all, shared rating biases are very common. Moreover, their model also assumes shared method variance between these factors, but it also predicts that this method variance also influences dissimilar measures like the Modern Racism Scale and even ratings of other attitude objects. In short, neither their model nor my models are based on theories, in part because psychologists have ignored to develop and validate measurement theories. Even if it were theoretically predicted that feeling-thermometer and preference ratings do not share method variance, the large MI for this parameter would indicate that this theory is wrong. Thus, the data falsify this prediction. In the modified model, the implicit-explicit correlation increases from .80 to .90, providing even less support for the dual-attitude model.

Further inspection of the MI showed no plausible further improvements of the model. One important finding in this partial model is that there is no evidence of shared method variance between the AMP and EPT, r = -.04. Thus, closer inspection of the correlations among the racial attitude domain suggests two problems for Vianello and Bar-Anan’s model. There is evidence of shared method variance between two explicit measures and there is no evidence of shared method variance between two implicit measures, namely the AMP and EPT.

Next, I built a model for the political orientation domain starting with the specification in Vianello and Bar-Anan’s model. Once more, overall fit appears to be good, RMSEA = .014, CFI = .989. In this model, the correlation between the implicit and explicit factor is r = .9. However, inspection of the MI replicates a residual correlation between feeling thermometer and preference ratings. MI = 91.91. Allowing for this shared method variance improved model fit, RMSEA = .012, CFI = .993, but had little effect on the implicit-explicit correlation, r = .91. In this model, there was some evidence of shared method variance between the AMP and EPT, r = .13.

Next, I put these two well-fitting models together, leaving each model unchanged. The only new question is how measures of racial bias should be related to measures of political orientation. It is common to allow trait factors to correlate freely. This is also what Vianello and Bar-Anan did and I followed this common practices. Thus, there is no theoretical structure imposed on the trait correlations. I did not specify any additional relations for the method factors. If such relationships exist, this should lead to low fit. Model fit seemed to be good, RMSEA = .009, CFI = .982. The biggest MI was observed for the loading of the Modern Racism Scale (MRS) on the explicit political orientation factor, MI = 197.69. This is consistent with the item content of the MRS that combines racism with conservative politics (e.g., being against affirmative action). For that reason, I included the MRS in my measurement model of political orientation (Schimmack, 2020).

Vianello and Bar-Anan (2020) criticize my use of the MRS. “For instance, Schimmack chose to omit one of the indirect measures—the SPF—from the models, to include the Modern Racism Scale (McConahay, 1983) as an indicator of political evaluation, and to omit the thermometer scales from two of his models. We assume that Schimmack had good practical or theoretical reasons for his modelling decisions; unfortunately, however, he did not include those reasons.” If they had inspected the MI, they would have seen that my decision to use the MRS as a different method to measure political orientation was justified by the data as well as by the item-content of the scale.

After allowing for this theoretically expected relationship, model fit improves, chi2(df = 231) = 506.93, RMSEA = .007, CFI = .990. Next I examined whether the IAT method factor for racial bias is related to the IAT method factor for political orientation. Adding this relationship did not improve fit, chi2(230) = 506.65 = RMSEA = .007, CFI = .990. More important, the correlation was not significant, r = -.06. This is a problem for Vianello and Bar-Anan’s model that assumes the two method factors are identical. To test this hypothesis, I fitted a model with a single IAT method factor. This model had worse fit, chi2(231) = 526.99, RMSEA = .007, CFI = .989. Thus, there is no evidence for a general IAT method factor.

I next explored the possibility of a method factor for the explicit measures. I had identified shared method variance for the feeling thermometer and preference ratings for racial bias and for political orientation. I now modeled this shared method variance with method factors and let the two method factors correlate with each other. The addition of a correlation did not improve model fit, chi2(230) = 506.93, RMSEA = .007, CFI = .990 and the correlation between the two explicit method factors was not significant, r = .00. Imposing a single method factor for both attitude domains reduced model fit, chi2(df = 229) = 568.27, RMSEA = .008, CFI = .987.

I also tried to fit a single method factor for the AMP and EPT. The model only converged by constraining two loadings. Then model fit improved slightly, chi2(df = 230) = 501.75, RMSEA = .007, CFI = .990. The problem for Vianello and Bar-Anan is that the better fit was achieved with a negative loading on the method factor. This is inconsistent with the idea that a general method factor inflates correlations across attitude domains.

In sum, there is no evidence that method factors are consistent across the two attitude domains. Therefore I retained the basic model that specified method variance within attitude domains. I then added the three criterion variables to the model. As in Vianello and Bar-Anan’s model, contact was regressed on the explicit and implicit racial bias factor and previous voting and intention to vote were regressed on the explicit and implicit political orientation factors. The residuals were allowed to correlate freely, as in Vianello and Bar-Anan’s model.

Overall model fit decreased slightly for CFI, chi2(df = 297) = 668.61, RMSEA = .007, CFI = .988. MI suggested an additional relationship between the explicit political orientation factor and racial contact. Modifying the model accordingly improved fit slightly, chi2(df = 296) = 660.59, RMSEA = .007, CFI = .988. There were no additional MI involving the two voting measures.

Results were different from Vianello and Bar-Anan’s results. They reported that the implicit factors had incremental predictive validity for all three criterion measures.

In contrast, the model I am developing here shows no incremental predictive validity for the implicit factors.

It is important to note that I create the measurement model before I examined predictive validity. After the measurement model was created, criterion variables were added and the data determined the pattern of results. It is unclear how Vianello and Bar-Anan developed a measurement model with non-existing method factors that produced the desired outcome of significant incremental validity.

To try to reproduce their full result, I also added self-esteem measures to the model. To do so, I first created a measurement model for the self-esteem measures. The basic measurement model had poor fit, chi2(df = 58) = 434.49, RMSEA = .019, CFI = .885. Once more, the MI suggested that feeling-thermometer and preference ratings shared method variance. Allowing for this residual correlation increased model fit, chi2(df = 57) = 165.77, RMSEA = .010, CFI = .967. Another MI suggested a loading of the speeded task on the implicit factor, MI = 54.59. Allowing for this loading further improved model fit, chi2(df = 56) = 110.01, RMSEA = .007, CFI = .983. The crucial correlation between the explicit and implicit factor was r = .36. The correlation in Vianello and Bar-Anan’s model was r = .30.

I then added the self-esteem model to the model with the other two attitude domains, chi2(df = 695) = 1309.59, RMSEA = .006, CFI = .982. Next I added correlations of the IAT method factor for self-esteem with the two other IAT-method factors. This improved model fit, chi2(df = 693) = 1274.59, RMSEA = .006, CFI = .983. The reason was a significant correlation between the IAT method factors for self-esteem and racial bias. I offered an explanation for this finding in my article. Most White respondents associate self with good and White with good. If some respondents are better able to control their automatic tendencies, they will show less pro-self and pro-White biases. In contrast, Vianello and Bar-Anan have no theoretical explanation for a shared method factor across attitude domains. There was no significant correlation between IAT method factors for self-esteem and political orientation. The reason is that political orientation has more balanced automatic tendencies so that method variance does not favor one direction over the other.

This model had better fit with fewer parameters than Vianello and Bar-Anan’s model, chi2(df = 679) = 1719.39, RMSEA = .008, CFI = .970. The critical results of predictive validity remained unchanged.

I also fitted Vianello and Bar-Anan’s model and added four parameters that I identified as missing from their model: (a) the loading of the MRS on the explicit political orientation factor and (b) the correlations between feeling-thermometer and preference ratings for each domain. Making these adjustments improved model fit considerably, chi2(df = 675) = 1235.59, RMSEA = .006, CFI = .984. This modest adjustment altered the pattern of results for the prediction of the three criterion variables. Unlike Vianello and Bar-Anan’s model, the implicit factors no longer predicted any of the three criterion variables.

Conclusion

My interaction with Vianello and Bar-Anan are symptomatic of social psychologists misapplication of the scientific method. Rather than using data to test theories, data are being abused to confirm pre-existing beliefs. This confirmation bias goes against philosophies of science that have demonstrated the need to subject theories to strong tests and to allow data to falsify theories. Verificationism is so ingrained in social psychology that Vianello and Bar-Anan ended up with a model that showed significant incremental predictive validity for all three criterion measures in their model, when this model made several questionable assumptions. They may object that I am biased in the opposite direction, but I presented clear justifications for modeling decisions and my model fits better than their model. In my 2020 article, I showed that Bar-Anan also co-authored another article that exaggerated evidence of predictive validity that disappeared when I reanalyzed the data (Greenwald, Smith, Sriram, Bar-Anan, & Nosek, 2009). Ten years later, social psychologists claim that they have improved their research methods, but Vianello and Bar-Anan’s commentary in 2020 shows that social psychologists have a long way to go. If social psychologists want to (re)gain trust, they need to be willing to discard cherished theories that are not supported by data.

References

Bar-Anan, Y., & Vianello, M. (2018). A multi-method multi-trait test of the dual-attitude perspective. Journal of Experimental Psychology: General, 147(8), 1264–1272. https://doi.org/10.1037/xge0000383

Ehrlinger, J., Johnson, K., Banner, M., Dunning, D., & Kruger, J. (2008). Why the unskilled are unaware: Further explorations of (absent) self-insight among the incompetent. Organizational Behavior and Human Decision Processes, 105(1), 98–121. https://doi.org/10.1016/j.obhdp.2007.05.002

Greenwald, A. G., Smith, C. T., Sriram, N., Bar-Anan, Y., & Nosek, B. A. (2009). Implicit race attitudes predicted vote in the 2008 U.S. Presidential election. Analyses of Social Issues and Public Policy (ASAP), 9(1), 241–253. https://doi.org/10.1111/j.1530-2415.2009.01195.x

Schimmack U. The Implicit Association Test: A Method in Search of a Construct. Perspectives on Psychological Science. October 2019. doi:10.1177/1745691619863798

Vianello M, Bar-Anan Y. Can the Implicit Association Test Measure Automatic Judgment? The Validation Continues. Perspectives on Psychological Science. February 2020. doi:10.1177/1745691619897960

Zhang, X. & Savalei, V. (2020) Examining the effect of missing data on RMSEA and CFI under normal theory full-information maximum likelihood, Structural Equation Modeling: A Multidisciplinary Journal, 27:2, 219-239, DOI: 10.1080/10705511.2019.1642111

Statues are Falling, but Intelligence Researchers Cling to Their Racist Past

Psychology wants to be a science. Unfortunately, respect and reputations need to be earned. Just putting the name science in your department name or in the title of your journals doesn’t make you a science. A decade ago, social psychologists were shocked to find out that for years one of their colleagues had just made up data and nobody had noticed it. Then, another social psychologists proved physics wrong and claimed to have evidence of time reversed causality in a study with erotic pictures and undergraduate student. This also turned out to be a hoax. Over the past decade, psychology has tried to gain respect by doing more replication studies of classic findings (that often fail), starting to preregister studies (which medicine has implemented years ago), and in general to analyze and report their results more honestly. However, another crisis in psychology is that most measures in psychology are used without evidence that they measure what they measure. Imagine a real science where scientists first ensure that their measurement instruments work and then use them to study distant planets or microorganisms. Not so psychology. Psychologists have found a way around proper measurement called operationalism. Rather than trying to find measures for constructs, constructs are defined by the measures. What is happiness? While philosophers have tried hard to answer this questions, psychologists cannot be bothered to spend time to think about this question. Happiness is whatever your rating on a happiness self-report measure measures.

The same cheap trick has been used by intelligence researchers to make claims about human intelligence. They developed a series of tasks and performance on these tasks is used to create a score. These scores could be given a name like “score that reflects performance on a series of tasks some White men (yes, I am a White male myself) find interesting,” but then nobody would care about these scores. So, they decided to call it intelligence. If pressed to define intelligence, they usually do not have a good answer to this question, but they also don’t feel the need to give an answer because intelligence is just a term for the test. However, the choice of the term is not an accident. It is supposed to sound as if the test measures something that corresponds to the everyday term intelligence to make the test more interesting. However, it is possible that the test is not the best measure of what we normally mean by intelligence. For example, performance on intelligence tests correlates only about r = .3 with self-ratings or ratings by close friends and family members of intelligence. While there can be measurement in self-ratings, there can also be measurement error in intelligence tests. Although intelligence researchers are considered to be intelligent, they rarely consider this possibility. After all, their main objective is to use these tests and to see how they relate to other measures.

Confusing labels for tests are annoying, but hardly worth to write a long blog post about. However, some racist intelligence researchers use the label to make claims about intelligence and skin color (Lynn & Meisenberg, 2010). Moreover, the authors even use their racist preconception that dark-skinned people are less intelligence to claim that intelligence tests measure intelligence BECAUSE performance on these tests correlates with skin color.

You don’t have to be a rocket scientists to realize that this is a circular argument. Intelligence tests are valid because they confirm a racist stereotype. This is not how real science works, but this doesn’t bother intelligence researchers. The questionable article has been cited 80 times.

I only came across this nonsense because a recent article used national IQ scores to make an argument about intelligence and homicides. After concerns about the science were raised, the authors retracted their article pointing to problems in the measurement of national differences in IQ. The editor of this journal, Psychological Science, wrote an editorial with “A Call for Greater Sensitivity in the Wake of a Publication Controversy.”

Greater sensitivity also means to clean the journals of unscientific and hurtful claims that serve no scientific purpose. In this spirit, I asked the current editor of Intelligence in an email on June 15th to retract Lynn and Meisenberger’s offensive article. Today, I received the response that the journal is not going to retract the article.

Richard Haier (Emeritus, Editor in Chief) Decision Letter

This decision just shows the unwillingness among psychologists to take responsibility for a lot of bad science that is published in their journals. This is unfortunately because it shows the low motivation to change and improve psychology. It is often said that science is the most superior method to gain knowledge because science is self-correcting. However, often scientists stand in the way of correction and the process of self-correction is best measured in decades or centuries. Max Plank famously observed that scientific self-correction often requires the demise of the old guard. However, it is also important not to hire new scientists who continue to abuse the freedom and resources awarded to scientists to spread racist ideology. Meanwhile, it is best to be careful and to distrust any claims about group differences in intelligence because intelligence researchers are not willing to clean up their act.

The Pie Of Happiness

This blog post reports the results of an analysis that predicts variation in scores on the Satisfaction with Life Scale (Diener et al., 1985) from variation in satisfaction with life domains. A bottom-up model predicts that evaluations of important life domains account for a substantial amount of the variance in global life-satisfaction judgments (Andrews & Withey, 1976). However, empirical tests of this prediction fail to show this (Andrews & Withey, 1976).

Here I used the data from the Panel Study of Income Dynamics (PSID) well-being supplement in 2016. The analysis is based on 8,339 respondents. The sample is the largest national representative sample with the SWLS, although only respondents 30 or order are included in the survey.

The survey also included Cantril’s ladder, which was included in the model, to identify method variance that is unique to the SWLS and not shared with other global well-being measures. Andrews & Withey found that about 10% of the variance is unique to a specific well-being scale.

The PSID-WB module included 10 questions about specific life domains: house, city, job, finances, hobby, romantic, family, friends, health, and faith. Out of these 10 domains, faith was not included because it is not clear how atheists answer a question about faith.

The problem with multiple regression is that shared variance among predictor variables contributes to the explained variance in the criterion variable, but the regression weights do not show this influence and the nature of the shared variance remains unclear. A solution to this problem is to model the shared variance among predictor variables with structural equation modeling. I call this method Variance Decomposition Analysis (VDA).

MODEL 1

Model 1 used a general satisfaction (GS) factor to model most of the shared variance among the nine domain satisfaction judgments. However, a single factor model did not fit the data, indicating that the structure is more complex. There are several ways to modify the model to achieve acceptable fit. Model 1 is just one of several plausible models. The fit of model 1 was acceptable, CFI = .994, RMSEA = .030.

Model 1 used two types of relationships among domains. For some domain relationships, the model assumes a causal influence of one domain on another domain. For other relationship, it is assumed that judgments about the two domains rely on overlapping information. Rather than simply allowing for correlated residuals, this overlapping variance was modelled as unique factors with constrained loadings for model identification purposes.

Causal Relationships

Financial satisfaction (D4) was assumed to have positive effects on housing (D1) and job (D3). The rational is that income can buy a better house and pay satisfaction is a component of job satisfaction. Financial satisfaction was also assumed to have negative effects on satisfaction with family (D7) and friends (D8). The reason is that higher income often comes at a cost of less time for family and friends (work-life balance/trade-off).

Health (D9) was assumed to have positive effects on hobbies (D5), family (D7), and friends (D8). The rational was that good health is important to enjoy life.

Romantic (D6) was assumed to have a causal influence on friends (D8) because a romantic partner can fulfill many of the needs that a friend can fulfill, but not vice versa.

Finally, the model includes a path from job (D3) to city (D2) because dissatisfaction with a job may be attributed to few opportunities to change jobs.

Domain Overlap

Housing (D1) and city (D2) were assumed to have overlapping domain content. For example, high house prices can lead to less desirable housing and lower the attractiveness of a city.

Romantic (D6) was assumed to share content with family (D7) for respondents who are in romantic relationships.

Friendship (D8) and family (D7) were also assumed to have overlapping content because couples tend to socialize together.

Finally, hobby (D5) and friendship (D8) were assumed to share content because some hobbies are social activities.

Figure 2 shows the same figure with parameter estimates.

The most important finding is that the loadings on the general satisfaction (GS) factor are all substantial (> .5), indicating that most of the shared variance stems from variance that is shared across all domain satisfaction judgments.

Most of the causal effects in the model are weak, indicating that they make a negligible contribution to the shared variance among domain satisfaction judgments. The strongest shared variances are observed for romantic (D6) and family (D7) (.60 x .47 = .28) and housing (D1) and city (D2) (.44 x .43 = .19).

Model 1 separates the variances of the nine domains into 9 unique variances (the empty circles next to each square) and five variances that represent shared variances among the domains (GS, D12, D67, D78, D58). This makes it possible to examine how much the unique variances and the shared variances contribute to variance in SWLS scores. To examine this question, I created a global well-being measurement model with a single latent factor (LS) and the SWLS items and the Ladder measures as indicators. The LS factor was regressed on the nine domains. The model also included a method factor for the five SWLS items (swlsmeth). The model may look a bit confusing, but the top part is equivalent to the model already discussed. The new part is that all nine domains have a causal error pointing at the LS factor. The unusual part is that all residual variances are named, and that the model includes a latent variable SWLS, which represents the sum score of the five SWLS items. This makes it possible to use the model indirect function to estimate the path from each residual variance to the SWLS sum score. As all of the residual variance are independent, squaring the total path coefficients yields the amount of variance that is explained by a residual and the variances add up to 1.

GS has many paths leading to SWLS. Squaring the standardized total path coefficient (b = .67) yields 45% of explained variance. The four shared variances between pairs of domains (d12, d67, d78, d58) yield another 2% of explained variance for a total of 47% explained variance from variance that is shared among domains. The residual variances of the nine domains add up to 9% of explained variance. The residual variance in LS that is not explained by the nine domains accounts for 23% of the total variance in SWLS scores. The SWLS method factor contributes 11% of variance. And the residuals of the 5 SWLS items that represent random measurement error add up to 11% of variance.

These results show that only a small portion of the variance in SWLS scores can be attributed to evaluations of specific life domains. Most of the variance stems from the shared variance among domains and the unexplained variance. Thus, a crucial question is the nature of these variance sources. There are two options. First, unexplained variance could be due to evaluations of specific domains and shared variance among domains may still reflect evaluations of domains. In this case, SWLS scores would have high validity as a global measure of subjective evaluations of domains. The other possibility is that shared variance among domains and unexplained variance reflects systematic measurement error. In this case, SWLS scores would have only 6% valid variance if they are supposed to reflect global evaluations of life domains. The problem is that decades of subjective well-being research have failed to provide an empirical answer to this question.

Model 2: A bottom-up model of shared variance among domains

Model 1 assumed that shared variance among domains is mostly produced by a general factor. However, a general factor alone was not able to explain the pattern of correlations and additional relationships were added to the model Model 2 assume that shared variance among domains is exclusively due to causal relationships among domains. Model fit was good, CFI = .994, RMSEA = .043.

Although the causal network is not completely arbitrary, it is possible to find alternative models. More important, the data do not distinguish between Model 1 and Model 2. Thus, the choice of a causal network or a general factor is arbitrary. The implication is that it is not clear whether 47% of the variance in SWLS scores reflect evaluations of domains or some alternative, top-down, influence.

This does not mean that it is impossible to examine this question. To test these models against each other, it would be necessary to include objective predictors of domain (e.g., income, objective health, frequency of sex, etc.) in the model. The models make different predictions about the relationship of these objective indicators to the various domain satisfactions. In addition, it is possible to include measures of systematic method variance (e.g., halo bias) or predictors of top-down effects (e.g., neuroticism) in the model. Thus, the contribution of domain-specific evaluations to SWLS scores is an empirical question.

Conclusion

It is widely assumed that the SWLS is a valid measure of subjective well-being and that SWLS scores reflect a summary of evaluations of specific life domains. However, regression analyses show that only a small portion of the variance in global well-being judgments is explained by unique variance in domain satisfaction judgments (Andrews & Withey, 1976). In fact, most of the variance stems from the shared variance among domain satisfaction judgments (Model 1). Here I show that it is not clear what this shared variance represents. It could be mostly due to a general factor that reflects internal dispositions (e.g., neuroticism) or method variance (halo bias), but it could also result from relationships among domains in a complex network of interdependence. At present it is unclear how much top-down and bottom-up processes contribute to shared variance among domains. I believe that this is an important research question because it is essential for the validity of global life-satisfaction measures like the SWLS. If respondents are not reflecting about important life domains when they rate their overall well-being, these items are not measuring what they are supposed to measure; that is, they lack construct validity.

Construct Validity of the Satisfaction with Life Scale

With close to 10,000 citations in WebofScience, Ed Diener’s article that introduced the “Satisfaction with Life Scale” (SWLS) is a citation classic in well-being science. While single-item measures are used in large national representative surveys (e.g., General Social Survey, German Socio-Economic Panel, World Value Survey), psychologists prefer multi-item scales because they have higher reliability and therewith also higher validity.

Study 1 in Diener et al. (1985) demonstrated that the SWLS shows convergent validity with single-item measures like Cantril’s ladder, r = .62, .66), and Andrews and Withey’s Delighted-Terrible scale, r = .68, .62. Attesting to the higher reliability of the 5-item SWLS is the finding that the internal consistency was .87 and the retest reliability was r = .82. These results suggest that the SWLS and single-item measures measure a single construct with different amounts of random measurement error.

The important question for well-being scientists who use the SWLS and other global well-being measures is whether these items measure what they are intended to measure. To answer this question, we need to know what life-satisfaction measures are intended to measure.

Diener et al. (1985) draw on Andrews and Withey’s (1976) model of well-being perceptions. Accordingly, life-satisfaction judgments are based on subjective evaluations of important concerns.

Judgments of satisfaction are dependent upon a comparison of one’s circumstances with what is thought to be an appropriate standard. It is important to point out that the judgment of how satisfied people are with their present state of affairs is based on a comparison with a standard which each individual sets for him· or herself; it is not externally imposed. It is a hallmark of the subjective well-being area that it centers on the person’s own judgments, not upon some criterion which is judged to be important by the researcher (Diener, 1984).

This definition of life-satisfaction makes two important points. First, it is assumed that respondents are thinking about their circumstances when they judge their life-satisfaction. That is, we we can think about life-satisfaction as an attitude with an individual’s life as the attitude object. Just like individuals are assumed to think about the important features of Coca Cola when they are asked to report their attitudes towards Coca Cola, respondents are assumed to think about the important features of their lives, when they report their attitudes towards their lives.

The second part of the definition makes it clear that attitudes towards lives are based on subjectively chosen criteria to evaluate lives. Just like individuals may like the taste of Coke or dislike the taste of Coke, the same life circumstance can be evaluated differently by different individuals. Some may be extremely satisfied with an income of $100,000 and some may be extremely dissatisfied with the same income. For students, some students may be happy with a GPA of 2.9, others may be unhappy with the same GPA. The reason is that the evaluation criteria or standards can very across individuals and that there is no objective criterion that is used to evaluate life circumstances. This makes life-satisfaction judgments an indicator of subjective well-being.

The reliance on subjective evaluation criteria also implies that individuals can give different weights to different life domains. For some people, family life may be the most important domain, for others it may be work (Andrews & Withey, 1976). The same point is made by Diener et al. (1985).

For example, although health, energy, and so forth may be desirable, particular individuals may place different values on them. It is for this reason that ,we need to ask the person for their overall evaluation of their life, rather than summing across their satisfaction with specific domains, to obtain a measure of overall life-satisfaction (p. 71).

This point makes sense. If life-satisfaction judgments on evaluations of life circumstances and individuals place different emphasis on different life domains, more important domains should have a stronger influence on global life-satisfaction judgments (Schimmack, Diener, & Oishi, 2002). However, starting with Andrews and Withey (1976), empirical tests of this prediction have failed to confirm it. When individuals are asked to rate the importance of life domains, and these weights are used to compute a weighted average, the weighted average is not a better predictor of global judgments than a simple unweighted average (Rohrer & Schmukle, 2018).

Although this fact has been known since 1974, its theoretical significance has been ignored. There are two possible interpretations of this finding. On the one hand, it could be that importance ratings are invalid. That is, people don’t really know what is important to them and the actual importance is best revealed by the regression weights when global life-satisfaction ratings are regressed on domain satisfaction either across participants or within-participants over time. The alternative explanation is more troubling. In this case, global life-satisfaction judgments are invalid. Maybe these judgments are not based on subjective evaluations of life-circumstances.

Schwarz and Strack (1999) made the point that global life-satisfaction judgments are based on quick heuristics that produce invalid information. The problem of their criticism is that they focused on unstable sources such as mood or temporarily accessible information as the main sources of life-satisfaction judgments. This model fails to explain the high temporal stability of life-satisfaction judgments. (Schimmack & Oishi, 2005).

However, it is possible that stable factors produce systematic method variance in life-satisfaction judgments. For example, Andrews and Withey (1976) suggested that halo bias could influence ratings of domain satisfaction and life-satisfaction. They used informant ratings to rule out this possibility, but their test of this hypothesis was statistically flawed (Schimmack, 2019). Thus, it is possible that a substantial portion of the reliable variance in SWLS scores is halo bias.

Diener et al. (1985) tried to address the problem of systematic measurement error in two ways. First, they included the Marlowe-Crowne Social Desirability (MCSD) scale to measure social desirable responding and found no correlation with SWLS scores, r = .02. The problem is that the MCSD is not a valid measure of socially desriable responding or halo bias, but rather a measure of agreeableness and conscientiousness. Thus, the correlation is better interpreted as evidence that life-satisfaction is fairly independent of these personality traits. Second, Study 3 with 53 elderly residents of Urbana-Champaign included an interview with two trained interviewers. Afterwards, the interviewers made ratings of the interviewees’ well-being. The averaged interviewer’ ratings correlated r = .43 with the self-ratings of well-being. The problem here is that individuals who are motivated to present a positive image in their SWLS ratings are also likely to present a positive image in an interview. Moreover, the conveyed sense of well-being could reflect individuals’ personality more than their life-circumstances. Thus, it is not clear how much of the agreement between self-ratings and interviewer-ratings reflects evaluations of actual life-circumstances.

The most recent review article by Ed Diener was published last year; “Advances and Open Questions in the Science of Subjective Well-Being” (Diener, Lucas, & Oishi, 2018). The article makes it clear that the construct has not changed since 1985.

Subjective well-being (SWB) reflects an overall evaluation of the quality of a person’s life from her or his own perspective” (p. 1).

As the term implies, SWB refers to the extent to which a person believes or feels that his or her life is going well. The descriptor “subjective” serves to define and limit the scope of the construct: SWB researchers are interested in evaluations of the quality of a person’s life from that person’s own perspective.” (p. 2)

The authors also explicitly state that subjective well-being measures are subjective because individuals can focus on different aspects of their lives depending on their importance to them.

it is the subjective nature of the construct that gives it its power. This is due to the fact that different people likely weight different objective circumstances differently depending on their goals, their values, and even their culture” (p. 3).

The fact that global measures allow individuals to assign different weights to different domains is seen as a strength.

Presumably, subjective evaluations of quality of life reflect these idiosyncratic reactions to objective life circumstances in ways that alternative approaches (such as the objective list
approach) cannot. Thus, when evaluating the impact of events, interventions, or public-policy decisions on quality of life, subjective evaluations may provide a better mechanism for assessment than alternative, objective approaches
(p. 3).

The problem is that this claim requires empirical evidence to show that global life-satisfaction judgments are indeed more valid measures of subjective well-being than simple averages because they properly weigh information in accordance with individuals’ subjective preferences, and since 1976 this evidence has been lacking.

Diener et al.’s (2018) review glosses over this glaring problem for the construct validity of the SWLS and other global well-being measures.

Because most measures are simple self-reports, considerable research addresses the psychometric properties of these types of assessments. This research consistently shows that existing self-report measures exhibit strong psychometric properties including high internal consistency when multiple-item measures are used; moderately strong test-retest reliability, especially over short periods of time; reasonable convergence with alternative measures (especially those that have also been shown to have high levels of reliability and validity); and theoretically meaningful patterns of associations with other constructs and criteria (see Diener et al., 2009, and Diener, Inglehart, & Tay, 2013, for reviews). There is little debate about the quality of SWB measures when evaluated using these traditional criteria.

While it is true that there is little debate, this does not mean that there is strong evidence for the construct validity of the SWLS. The open question is how much respondents are really conducting a memory search for information about important life domains, evaluate these domains based on subjective criteria, and then report an overall summary of these evaluations. If so, subjective importance weights should improve predictions, but they often do not. Moreover, in regression models individual life domains often contribute small amounts of unique variance (Andrews & Withey, 1976), and some important aspects like health often account for close to zero percent of the variance in life-satisfaction judgments.

Convergent Validity

One key feature of construct validity is convergent validity between two independent methods that measure the same construct (Campbell & Fiske, 1959). Ideally, multiple methods are used and it is possible to examine whether the pattern of correlations matches theoretical predictions (Cronbach & Meehl, 1955; Schimmack, 2019). Diener et al. (2018) mention some evidence of convergent validity.

For example, Schneider and Schimmack (2009) conducted a meta-analysis of the correlation between self and informant reports, and they found that there is reasonable agreement (r = .42) between these two methods of assessing SWB.

The problem with this evidence is that the correlation between two measures only shows that both methods are valid, but it is not possible to quantify the amount of valid variance in self-ratings or informant ratings, which requires at least three methods (Andrews & Withey, 1976; Zou, Schimmack, & Gere, 2013). Theoretically, it would be possible that most of the variance in self-ratings is valid and that informant ratings are rather invalid. This is what Andrews and Withey (1976) claimed with estimates of 65% valid variance in self-ratings and 15% valid variance in informant ratings, with a correlation of r = .32. However, their model was incorrect and allowed for method variance in self-ratings to inflate the factor loading of self-ratings.

Zou et al. (2013) avoided this problem by using self-ratings and ratings by two informants as independent methods and found no evidence that self-ratings are more valid than informant ratings; a finding that is mirrored in ratings of personality traits (Anusic et al., 2009). Thus, a correlation of r = .3, implies that 30% of the variance in self-ratings is valid and 30% of the variance in informant ratings is valid.

While this evidence shows that self-ratings of life-satisfaction show convergent validity with informant ratings, it also shows that a substantial portion of the reliable variance in self-ratings is not shared with informants. Moreover, it is not clear what information produces agreement between self-ratings and informant ratings. This question has received surprisingly little attention, although it is critical for the construct validity of life-satisfaction judgments. Two articles have examined this question with opposite conclusions. Schneider and Schimmack (2010) found some evidence that satisfaction in important life domains contributed to self-informant agreement. This finding would support the bottom-up model of well-being judgments that raters are actually considering life circumstances when they make well-being judgments. In contrast, Dobewall, Realo, Allik, Esko, andMetspalu (2013) proposed that personality traits like depression and cheerfulness accounted for self-informant agreement. In this case, informants do not need ot know anything about life circumstances. All they need to know is whether an individual has a positive or negative lens to evaluate their lives. If informants are not using information about life circumstances, they cannot be used to validate self-ratings to show that self-ratings are based on evaluations of life circumstances.

Diener et al. (2018) cite a number of additional findings as evidence of convergent validity.

Physiological measures, including brain activity (Davidson, 2004) and hormones (Buchanan, al’Absi, & Lovallo, 1999), along with behavioral measures such as the amount of smiling (e.g., Oettingen & Seligman, 1990; Seder & Oishi, 2012) and patterns of online behaviors (Schwartz, Eichstaedt, Kern, Dziurzynski, Agrawal et al., 2013) have also been used to assess SWB. (p. 7).

This evidence has several limitations. First, hormones do not reflect evaluations and are at best indirectly related to life-evaluations. Asymmetries in prefrontal brain activity (Davidson, 2004) have been shown to reflect approach and avoidance motivation more than pleasure and displeasure, and brain activity is a better measure of momentary states than the evaluation of fairly stable life circumstances. Finally, they also may reflect individuals’ personality more than their life circumstances. The same is true for the behavioral measures. Most important, correlations with a single indicators do not provide information about the amount of valid variance in life-satisfaction judgments. To quantify validity it is necessary to examine these findings within a causal network (Schimmack, 2019).

Diener et al. (2019) agree with my assessment in their final conclusions about measurement of subjective well-being.

The first (and perhaps least controversial) is that many open questions remain
regarding the associations among different SWB measures and the extent to which these measures map on to theoretical expectations; therefore, understanding how the measures relate and how they diverge will continue to be one of the most important goals of research in the area of SWB. Although different camps have emerged that advocate for one set of measures over others, we believe that such advocacy is premature. More research is needed about the strengths, weaknesses, and relative merits of the various approaches to measurement that we have documented in this review
(p. 7).

The problem is that well-being scientists have made no progress on this front since Andrews and Withey (1976) conducted the first thorough construct validation studies. The reason is that social and personality psychology suffers from a validation crisis (Schimmack, 2019). Researchers simply assume that measures are valid rather than testing it or they use necessary, but insufficient criteria like internal consistency (alpha), retest reliability as evidence. Moreover, there is a tendency to ignore inconvenient findings. As a result, 40 years after Andrews and Withey’s (1976) seminal article was published, it remains unclear (a) whether respondents aggregate information about important life domains to make global judgments, (b) how much of the variance in life-satisfaction judgments is valid, and (c) which factors produce systematic biases in life-satisfaction judgments that may lead to false conclusions about the causes of life-satisfaction and to false policy recommendations.

Health is probably the best example to illustrate the importance of valid measurement of subjective well-being. It makes intuitive sense that health has an influence on well-being. Illness often disables individuals from pursuing their goals and enjoying life as everybody who had the flu knows. Diener et al. (2018) agree.

“One life circumstance that might play a prominent role in subjective well-being is a person’s health” (p. 15).

It is also difficult to see how there could be dramatic individual differences in the criteria that are used to evaluate health. Sure, fitness levels may be a matter of personal preference, but nobody is enjoying a stroke, heart attack, or cancer, or even having the flu.

Thus, it was a surprising finding that health seemed to have a small influence on global well-being judgments.

“Initial research on the topic of health conditions often concluded that health played only a minor role in wellbeing judgments (Diener et al., 1999; Okun, Stock, Haring,
& Witter, 1984).”

More problematic was the finding that subjective evaluations of health seemed to play no role in these judgments in multivariate analyses that controlled for shared variance among ratings of several life domains. For example, in Andrews and Withey’s (1976) studies satisfaction with health contributed only 1% unique variance in the global measure.

In contrast, direct importance ratings show that health is rated as the second most important domain (Rohrer & Schmukle, 2018).

Thus, we have to conclude that health doesn’t seem to matter for people’s subjective well-being. Or we can conclude that global measures are (partially) invalid measures because respondents do not weigh life domains in accordance with their importance. This question clearly has policy relevance as health care costs are a large part of wealthy nations’ GDP and financing health care is a controversial political issue, especially in the United States. Why would this be the case, if health is actually not important for well-being. We could argue that it is important for life expectancy (Veenhoven’s happy life-years) or that it matters for objective well-being, but not for subjective well-being, but clearly the question why health satisfaction plays a small role in global measures of subjective well-being is an important one. The problem is that 40 years of well-being science have passed without addressing this important question. But as they say, better late than never. So, let’s get on with it and figure out how responses to global well-being questions are made and whether these cognitive processes are in line with the theoretical model of subjective well-being.

Frank M. Andrews and Stephen B. Withey’s Social Indicators of Well-Being

In 1976, Andrews and Withey published a groundbreaking book on the measurement of well-being. Although their book has been cited over 2,000 times, including influential articles like Diener’s 1984 and 1999 Psychological Bulletin articles on Subjective Well-Being, it is likely that many people are not familiar with the book because books are not as accessible as online articles. The aim of this blog post is to review and comment on the main points made by Andrews and Withey.

CHAPTER 1: Introduction

A&W (wink) believed that well-being indicators are useful because they reflect major societal forces that influence individuals’ well-being.

“In these days of growing interdependence and social complexity we need more adequate cues and indicators of the nature, meaning, pace, and course of social change” (p. 1).

Presumably, A&W would be pleasantly surprised about the widespread use of well-being surveys for this purpose. Well-being questions are included in the General Social Survey, The German Socio-Economic Panel Study, the World Value Survey, and Gallup’s World Poll and the daily survey of Americans’ well-being and health.

A&W saw themselves as part of a broader movement towards evidence based public policy.

The social indicator “movement” is gaining adherents all over the world. … Several facets of these definitions reflect the basic perspectives of the social indicator effort. The quest is for a limited yet comprehensive set of coherent and significant indicators, which can be monitored over time, and which can be disaggregated to the level of the relevant social unit (p. 4).

Objective and Subjective Indicators

A&W criticize the common distinction between objective and subjective indicators of well-being. Objective indicators such as hunger, pollution, or unemployment are factors that are universally considered bad for individuals are typically called objective indicators.

A&W propose to distinguish three features of indicators.

Thus, it may be more helpful and meaningful to consider the individualistic or consensual aspects of phenomena, the private or public accessibility of evidence, and the different forms and patterns of behavior needed to change something rather than to cling to the more simplistic notions of objective and subjective.

They propose to use “perceptions of well-being” as a social indicator. This indicator is individualistic, private, and may require personalized interventions to change them.

The work of engineers, industrialists, construction workers, technological innovators, foresters, and farmers who alter the physical and biological environment is matched by educators, therapists, advertisers, lovers, friends, ministers, politicians, and issue advocates who are all interested and active in constructing, tearing down, and remodeling subjective appreciations and experiences. (p. 6)

A&W argue that measuring “perceptions of well-being” is important because citizens of modern societies share the belief that societies should maximize well-being.

The promotion of individual well-being is a central goal of virtually all modern societies, and of many units within them. While there are real and important differences of opinion-both within societies and between them-about how individual well-being is to be maximized, there is nearly universal agreement that the goal itself is a worthy one and is to be actively pursued. (p. 7).

Research Goals

A&W’s goal was to develop a set of indicators (not just one) that fulfill several criteria that can be considered validation criteria.

1. Content validity. Their coverage should be sufficiently broad to include all the most
important concerns of the population whose well-being is to be monitored. If the relevant population includes demographic or cultural subgroups that might be the targets of separate social policies, or that might be affected differentially by social policies, the indicators should have relevance for each of the subgroups as well as for the whole population.

2. Construct Validity. The validity (i.e., accuracy) with which the indicators are measured
should be high, and known.

3. Parsimony and Efficiency: It should be possible to measure the indicators with a high degree of statistical and economic efficiency so that it is feasible to monitor them on a regular basis at reasonable cost.

4. Flexibility: The instrument used to measure the indicators should be flexible so that it can accommodate different trade-offs between resource input, accuracy of output, and degree of detail or specificity.

In short, the indicators should be measured with breadth, relevance, efficiency, validity, and flexibility. (p. 8).

A&W then list several specific research questions that they aimed to answer.

1. What are the more significant general concerns of the American people?

2. Which of these concerns are relevant to Americans’ sense of general wellbeing?

3. What is the relative potency of each concern vis-a.-vis well-being?

4. How do the relevant concerns relate to one another?

5. How do Americans arrive at their general sense of well-being?

6. To what extent can Americans easily identify and report their feelings
about well-being?

7. To what extent will they bias their answers?

8. How stable are Americans’ evaluations of particular concerns?

9. How comparable are various subgroups within the American population
with respect to each of the questions above?

Although some of these questions have been examined in great detail others have been neglected in the following decades of well-being research. In particular, very little attention has been paid to questions about the potency (strength of influence) of different concerns for global perceptions of well-being, and to the question how different concerns are related to each other. In contrast, the stability of well-being perceptions has been examined in numerous longitudinal studies (see Anusic & Schimmack, 2016, for the most recent meta-analysis).

Usefulness

A&W “propose six products of value to social scientists, to policymakers and implementers of policy, and to people who want to influence the course of society” (p. 9).

1. Repeated measurement of well-being perceptions can be used to see whether (humans’) lives are betting better or worse.

2. Comparison of groups (e.g., men vs. women, White vs. Black Americans) can be used to examine equity and inequity in well-being.

3. Positive or negative correlations among domains can be informative. For example, marital satisfaction may be positively or negatively correlated to each other, and this evidence has been used to study work-family or work-live balance.

4. It is possible to see how much well-being perceptions are based on more objective aspects of life (job, housing) versus more abstract aspects such as values or meaning.

5. It is informative to see what domains have a stronger influence on well-being perceptions, which shows people’s values and priorities.

6. It is important to know whether people appreciate actual improvement. For example, a drop in crime rates is more desirable if citizens also feel safer. “The appreciation of life’s conditions would often seem to be as important as what those conditions actually are” (p. 10).

One may justifiably claim, then, that people’s evaluations are terribly important: to those who would like to raise satisfactions by trying to meet people’s needs, to those who would like to raise dissatisfactions and stimulate new challenges, to those who would suppress or reduce feelings and public expressions of discontent, and above all, to the individuals themselves. It is their perceptions of their own well-being, or lack of well-being, that ultimately define the quality of their lives (p. 10).

BASIC CONCEPTS AND A CONCEPTUAL MODEL

The most important contribution of A&W is their conception of well-being as a broad evaluations of important life domains. We might think about a life as a pizza with several slices that have different toppings. Some are appealing (say ham and pineapple) and some are less appealing (say sardines and olives). Well-being is conceptualized as the sum or average of evaluations of the different slices. This view of well-being is now called the bottom-up model after Diener (1984).

We conceive of well-being indicators as occurring at several levels of specificity. The most global indicators are those that refer to life as a whole; they are not specific to anyone particular aspect of life (p. 11).

Mostly forgotten is A&W’s distinction between life domains and criteria.

Domains and Criteria

Domains are essentially different slices of the pizza of life such as work, family, health, recreation.

Criteria are values, standards, aspirations, goals, and-in general-ways of judging what the domains of life afford. In modern research, they are best represented by models of human values or motives, such as Schwartz’s model of human values. Thus, life domains or aspects can be desirable or undesirable because the foster or block fulfillment of universal needs for safety, freedom, pleasure, connectedness, and achievement to name a few.

The quality of life is not just a matter of the conditions of one’s physical, interpersonal and social setting but also a matter of how these are judged and evaluated by oneself and others. The values that one brings to bear on life are in themselves determinants of one’s assessed quality of life. Leave the situations of life stable and simply alter the standards of judgment and one’s assessed quality of life could go up or down according to the value framework. (p. 13).

A Conceptual Model

A&W’s Exhibit 1.1 shows a grid of life domains and evaluation criteria (values). According to their bottom-up model, perceptions of well-being are an integrated summary of these lower-order evaluations of specific life domains.

“The diagram is also intended to imply that global evaluations-i.e., how a person feels about life as a whole-may be the result of combining the domain evaluations or the criterion evaluations” (p. 14)

METHODS AND DATA

The Measurement of Affective Evaluations

A&W proposed that perceptions of well-being are based on two modes of evaluation.

The basic entries in the model, just described, are what we designate as “affective evaluations.” The phrase suggests our hypothesis that a person’s assessment of life quality involves both a cognitive evaluation and some degree of positive and/or negative feeling, i.e., “affect.”

One mode is cognitive and could be performed by a computer. Once objective circumstances are known and there are clear criteria for evaluation, it is possible to compute the discrepancy. For example, if a person needs $40,000 a year to afford housing, food, and basic necessities, an income of $20,000 is clearly inadequate, whereas an income of $70,000 is more than adequate. However, A&W also propose that evaluations have a feeling or affective component. That is, the individual who earns only $20,000 may feel worse about their income, while the individual with a $70,000 income may feel good about their income.

Not much progress has been made in terms of distinguishing affective or cognitive evaluations, especially when it comes to evaluations of specific life domains. One problem is that it is difficult to measure affective reactions and that self-reports of feelings may simply be cognitive judgments. It is therefore easier to think about well-being “perceptions” as evaluations, without trying to distinguish between cognitive and affective evaluations.

Both global and more specific evaluations are measured with rating scales. A&W favored the delighted-terrible scale, but it didn’t catch on. Much more commonly used is Cantril’s Ladder or life-satisfaction or happiness questions.

In the next section of this interview/questionnaire we want to find out how you feel about various parts of your life, and life in this country as you see it. Please tell me the feelings you have now-taking into account what has happened in the last year and what you expect in the near future.

A&W were concerned that a large proportion of respondents’ report high levels of satisfaction because they are merely satisfied, but not really happy or delighted. They also wanted a 7-point scale and suggested that more categories would not produce more sensitive responses, while a 7-point scale is clearly preferable to the 3-point happiness measure that is still used in the General Social Survey. They also wanted a scale where each response option is clearly labelled, while some scales like Cantril’s ladder only label the most extreme options (best possible life, worst possible life).

Data Sources

A&W conducted several cross-sectional surveys.

CHAPTER 2: Identifying and Mapping Concerns

Research Strategy

The basic strategy of our approach was first to assemble a very large number of possible life concerns and to write questionnaire items to tap people’s feelings, if any, about them. Then, having administered these items to broad samples of Americans, we used the resulting data to empirically explore how people’s feelings about these items are organized.

IDENTIFYING CONCERNS

The task of identifying concerns involved examining four different types of
sources.

One source was previous surveys that had included open questions about
people’s concerns. Two examples of such items are:

All of us want certain things out of life. When you think about what really
matters in your own life, what are your wishes and hopes for the future? In
other words, if you imagine your future in the best possible light, what
would your life look like then, if you are to be happy? (Cantril, 1965)

In this study we are interested in people’s views about many different
things. What things going on in the United States these days worry or
concern you? (Blumenthal et aI., 1972)

In our search for expression of life concerns, we examined data from these very general unstructured questions in eight different surveys.

A second type of source was structured interviews, typically lasting an hour or two with about a dozen people of heterogeneous background.

A third type of source, particularly useful for expanding our list of criterion-type concerns, was previously published lists of values.

This information was used to create items that were administered in some of the surveys.

MAPPING THE CONCERNS

Given the list of 123 concern items, the next step was to explore how they fit
together in people’s thinking.

Maps and the Mapping Process

Selecting and Clustering Concern-Level Measures

A&W’s work identified clusters of concerns that are often included in surveys of domain satisfaction such as work (green), recreation (orange), standard of living (purple), housing (light blue), health (red), and family (dark blue).

The map for criteria, shows that the most central values are hedonism (having fun), achievement, acceptance and affiliation (accept by other) and freedom.

These findings are consistent with modern conceptions of well-being as the freedom to seek pleasure and to avoid pain (Bentham).

CHAPTER 3: Measuring Global Well-Being

A&W compiled 68 items that had been used to measure global well-being.

Formal Structure of the Typology

A&W provided a taxonomy of the various global measures.

Accordingly, measures can differ in the perspective of the evaluation, the generality of the evaluation, and the range of the evaluation. For the measurement of global well-being general measures that cover the full-range from an absolute perspective are most widely used.

We find that the Type A measures, involving a general evaluation of the respondent’s life-as-a-whole from an absolute perspective, tend to cluster together into what we shall call the core cluster” (p. 76).

A study of the retest stability for the same item in the same survey showed a retest correlation of r = .68. This estimate for the reliability of a single global well-being rating has been replicated in numerous studies (see Ansic & Schimmack, 2005; Schimmack & Oishi, 2005; for meta-analyses).

A&W also provided some evidence about measurement invariance across subgroups (gender, racial groups, age groups) and found very similar results.

“The results (not shown) indicated very substantial stabilities across the subgroups. In nearly all cases the correlations within the subgroups were within 0.1 of the correlations within the total population.” (p. 83).

The next results show that different global well-being measures tend to be highly correlated with each other. Exceptions are the 3-point happiness scale in the GSS, which lacks sensitivity, and the affect measure because affect measures show some discriminant validity from life evaluations (Zou, Schimmack, & Gere, 2013). That is, an individuals’ perception of well-being is not fully determined by their perception of how much pleasure versus displeasure they experienced.

A principal component analysis showed items with high loadings that best capture the shared variance among global well-being measures.

The results show that the 7-point delighted-terrible (Life 1, Life 2) or a 7-point happiness scale capture this variance well.

These results lead A&W to conclude that these measures are valid and useful indicators of well-being.

“We believe the Type A measures clearly deserve our primary attention. Thus, it is reassuring to find that the Type A measures provide a statistically defensible set of general evaluations of the level of current wellbeing” (p. 106).

CHAPTER 4: Predicting Global Well-Being: I

A&W argue that statistical predictors of global well-being ratings provide useful information about the cognitive processes (what is going on in the minds of respondents) underlying well-being ratings.

Finding a statistical model that fits the data has real substantive interest, as well as methodological, because in these data the statistical model can also be considered as a psychological model. Not only is the model that method of combining feelings that provides the best predictions, it is also our best indication of what may go on in the minds of the respondents when they themselves combine feelings about specific life concerns to arrive at global evaluations. Thus, our statistical model can also be considered as a simulation of psychological processes (p. 109).

This assumption is reasonable as ratings are clearly influenced by some information in memory that is activated during the formation of a response. However, the actual causal mechanism can be more complicated. For example, job satisfaction may be correlated with global well-being only because respondents’ think about income and income satisfaction is related with job satisfaction. Moreover, Diener (1984) pointed out that causality may flow from global well-being to domain satisfaction, which is now called a top-down process. Thus, rather than job satisfaction being used to make a global well-being judgment, respondents’ affective disposition may influence their job satisfaction.

A&W’s next finding has been replicated and emphasized in many review articles on well-being.

The prediction of global well-being from the demographic characteristics of the respondents produced straightforward results that have proved surprising to some observers: The demographic variables, either singly or jointly, account for very little of the variance in perceptions of global well-being (less than 10 percent), and they add nothing to what can be predicted (more accurately) from the concern measures (p. 109).

This finding has also been misinterpreted as evidence that objective life circumstances have a small influence on well-being. The problem with this interpretation is that demographic variables do not represent all environmental influences and many of them are not even environmental factors (e.g., sex, age, race). It is true, however, that there are relatively small differences in well-being perceptions across different groups. The main exception is a persistent gap in well-being of White and Black Americans (Iceland & Ludwig-Dehm, 2019).

A&W conducted numerous tests to look for non-linear relationships. For example, only very low income satisfaction or health satisfaction may be related to global well-being if moderate levels of income or health are sufficient to be satisfied with life. However, they found no notable non-linear relationships.

However, after examining many associations between feelings about specific life concerns and life-as-a-whole, we conclude that substantial curvilinearities do not occur when affective evaluations are assessed using the DelightedTerrible Scale (p. 110).

Exhibit 4.1 shows simple linear correlations of various life concerns with the averaged repeated ratings on the Delighted-Terrible scale (Life 3).

The main finding is that all correlations are positive, most are moderate, and some are substantial (r > .5), such as the correlations for fun/enjoyment, self-efficacy, income, and family/marriage.

It is important to interpret differences in the strength of correlations with caution because several factors influence how strong these correlations are. One factor is the amount of variability in a predictor variable. For example, while incomes can vary dramatically, the national government is the same for everybody. Thus, there is no variability in government that can produce variability in well-being across respondents; although perceptions of government can vary and could influence well-being perceptions. Keeping this caveat in mind, the results suggest that concerns about standard of living and family life seem to matter most. Interestingly, health is not a major factor, but once again, this might simply reflect relatively small variability in actual health, while health may become more of a concern later in life.

Nevertheless, while the causal processes that produce these correlations are unclear, any theory of well-being has to account for this robust pattern of correlations. between global well-being perceptions and concerns.

MULTIVARIATE PREDICTION OF LIFE 3

Regression analysis aims to identify variables that make a unique contribution to the prediction of an outcome. That is, they share variance with the outcome that is not shared by other predictor variables.

As mentioned before, there was no evidence of marked non-linearity, so all variables were entered as measured without transformation or quadratic terms. A&W also examined potential interaction effects, but did not find evidence for these either.

Weighting Schemes

One of the most important analyses was the exploration of different weighting schemes. Intuitively, it makes sense that some domains are more important than others (e.g., standard of living vs. weather). If this is the case, a predictor that weights standard of living more should be a better predictor of well-being than a predictor that weights all concerns equally.

A&W found that a simple average of 12 concerns was highly correlated with the global measure.

Several explorations provide consistent and clear answers to these questions. A simple summing of answers to any of certain alternative sets of concern items (coded from 1 to 7 on the Delighted-Terrible Scale) provides a prediction of feelings about life-as-a-whole that correlates rather well with the respondent’s actual scores on Life 3. Using twelve selected concerns12 (mainly domains) and data from the May respondents, this correlation was .67 (based on 1,278 cases); using eight selected concerns13 (all criteria) and data from the April respondents, the correlation was .77 (based on 1,070 cases). These relatively high values obtain in subgroups of the population as well as in the population as a whole: When the sum of answers to eight of the April concern items was correlated with Life 3 in twenty-one different subgroups of the national adult population, the correlation was never lower than .70 nor higher than .82. (p. 118).

However, more important is the question whether other weighing schemes produce higher correlations, and the important finding is that optimal weights (using regression coefficients) produced only a small improvement in the multiple correlation.

What is extremely interesting is that the optimally weighted combination of concern measures provides a prediction of Life 3 that is, at most, only modestly better than that provided by the simple sum. In the May data, the previous correlation of .67 could be increased to .71 by optimally weighting the twelve concern measures.

A&W conclude that a model with equal weights is a parsimonious and good model of well-being.

Our conclusion is that the introduction of weights when summing answers to the concern measures is likely to produce a modest improvement, but that even a simple sum of the answers provides a prediction that is remarkably close to the best that can be statistically derived.

However, the use of a fixed regression weight implies that all respondents attach the same importance of different domains. It is plausible that this is not the case. For example, some people live to work and others work to live. So, work would have different importance for different respondents. A&W tested this by asking respondents about the importance of several domains and used this information to weight concerns on a person by person basis. They found that this did not improve prediction.

The significant finding that emerged is that there was no possible use of these importance data that produced the slightest increase in the accuracy with which feelings about life-as-a-whole could be predicted over what could be achieved using an optimally weighted combination of answers to the concern measures alone. Although a number of questions remain with respect to the nature and meaning of the importance measures (some are explored in chap. 7), we have an unambiguous answer to our original question: Data about the importance people assign to concerns did not increase the accuracy with which
feelings about life-as-a-whole could be predicted
(p. 119).

This surprising result has been replicated numerous times (Rohrer & Schmukle, 2018). However, nobody has attempted to explain why importance weights do not improve prediction. After all, it is theoretically nonsensical within A&W’s theoretical framework to say that work, family, and health are very important, to be extremely dissatisfied in these domains, and then to report high global well-being. If global well-being judgments are, indeed, based on information about concerns and life domains, then important life domains should have a stronger relationship with global well-being ratings than unimportant domains (cf. Schimmack, Diener, & Oishi, 2002).

While I share A&W’s surprise, I am much less pleased by this finding.

Our results point to a simple linear additive one, in which an optimal set of weights is only modestly better than no weights (i.e. equal weights).19 We confess to both surprise and pleasure at these conclusions (p. 120).

A&W are pleased because a simple additive, linear model is simple and science favors simple models, when they fit actual data reasonably well, which seems to be the case here.

However, A&W are too quick to interpret their results as support for their bottom-up model of well-being, where everybody weights concerns equally and well-being perceptions depend only on the relative standing in life domains (good job, happy marriage, etc.).

Interpreted in this light, the linear additive model suggests that somehow individuals themselves “add up” their joys and sorrows about specific concerns to arrive at a feeling about general well-being. It appears that joys in one area of life may be able to compensate for sorrows in other areas; that multiple joys accumulate to raise the level of felt well-being; and that multiple sorrows also accumulate to lower it.

In discussing these findings with various colleagues and commentators, the question has sometimes been raised as to whether the model implies a policy of “give them bread and circuses.” The model does suggest that bread and circuses are likely to increase a population’s sense of general well-being. However, the model does not suggest that bread and circuses alone will ensure a high level of well-being. On the contrary, it is quite specific in noting that concerns that are evaluated more negatively than average (e.g., poor housing, poor government, poor health facilities, etc.) would be expected to pull down the general sense of well-being, and that multiple negative feelings about life would be expected to have a cumulative impact on general well-being.

At least at this stage of the investigation in Chapter 4 other, more troubling interpretations of the results are possible. Maybe most of the variance in these evaluative judgments are response biases and social desirable responding. This alone could produce strong correlations and they would be independent of the actual concerns that are being rated. Respondents could rate the weather on Mars and we would still see that those who are more satisfied with the weather on Mars have higher global well-being.

However, A&W’s subsequent analyses are inconsistent with their conclusions. Exhibit 4.2 shows the regression weights for various concerns that are sorted by the amount of unique contribution to the global impressions. It is clear that averaging the top 12 concerns would give a measure that is more strongly related to global well-being than averaging the last 12 concerns. Thus, domains do differ in importance. The results in the last column (E) are interesting because here 12 domains explain 51% of the total variance, but squaring the regression coefficients provides only 17% of variance, which implies that most of the explained variance stems from variance that is shared among the predictor variables. Thus, it is important to examine the nature of this shared variance more closely. In this mode, the first five domains account for 15 of the 17 percent in total. These domains are efficacy, family, money, fun, and housing. Thus, there is some support for the bottom-up model for some domains, but most of the explained variance may stem from shared method variance between concern ratings and well-being ratings.

Exhibit 4.3 confirms this with a stepwise regression analysis where concerns are entered according to their unique importance. Self-efficacy alone accounts for 30% of the explained variance. Then family adds 9%, money adds 5%, fun adds 3%, and housing 1%. The remaining variables add less than 1% individually.

Exhibit 4.4 shows that demongraphic variables are weak predictors of well-being ratings and that these relationships are weakened when concerns are added as predictors (column 3).

This suggests that effects of demongraphic variables are at least partially mediated by concerns (Schimmack, 2008). For example, the influence of income on well-being ratings could be explained by the effect of income on satisfaction with money, which is a unique predictor of well-being ratings (income -> money satisfaction-> life satisfaction). The limitation of regression analysis is that it does not show which of the concerns mediates the influence of income. A better way to examine mediation is to test mediation models with structural equation modeling (Baron & Kenny, 1986).

A&W draw the conclusion that there are no major differences in perceived well-being between social groups. As mentioned before, this is only partially correct. The GSS shows consistent racial differences in well-being.

The conclusion seems inescapable that there is no strong and direct relationship between membership in these social subgroups and feelings about life-as-a-whole” (p. 142).

CHAPTER 5: Predicting Global Well-Being: II

Chapter 5 does not make a major novel contribution. It mainly explores how concerns are related to the broader set of global measures. The results show that the findings in Chapter 4 generalize to other global measures.

CHAPTER 6: Evaluating the Measures of Well-Being

Chapter 6 tackles the important question of construct validity. Do global measures measure individuals’ true evaluations of their lives?

How good are the measures of perceived well-being reported in previous chapters of this book? More specifically, to what extent do the data produced by the various measurement methods indicate a person’s true feelings about his life? (p. 175).

A&W note several reasons why global ratings may have low validity.

Unfortunately, evaluating measures of perceived well-being presents formidable problems. Feelings about one’s life are internal, subjective matters. While very real and important to the person concerned, these feelings are not necessarily manifested in any direct way. If people are asked about these feelings, most can and will speak about them, but a few may lie outright, others may shade their answers to some degree, and probably most are influenced to some extent by the framework in which the questions are put and the format in which the answers are expected. Thus, there is no assurance that the answers people give fully represent their true feelings. (p. 176)

ESTIMATION OF THE VALIDITY AND ERROR COMPONENTS OF THE
MEASURES

Measurement Theory and Models

A&W are explicit in their ambition. They want to estimate the proportion of variance in global well-being measures that reflects respondents’ true evaluations of their lives. They want to separate this variance from random measurement error, which is relatively easy, and systematic measurement error, which is hard (Campbell & Fiske, 1959).

The analyses to be reported in this major section of the chapter begin from the fact that the variance of any measure can be partitioned into three parts: a valid component, a correlated (i.e., systematic) error component, and a random error (or “residual”) component. Our general analysis goal is to estimate, for measures of different types, from different surveys, and derived from different methods, how the total variance can be divided among these three components (p. 178).

A “validity coefficient,” as this term is commonly used by social scientists, is the correlation (Pearson’s product-moment r) between the true conditions and the obtained measure of those conditions. The square of the validity coefficient gives the proportion of observed variance that is true variance; e.g., a measure that has a validity coefficient of .8 contains 64 percent valid variance; similarly, a measure that contains 49 percent valid variance has a validity of .7. (p. 179; cf. Schimmack, 2010).

One source of systematic measurement error are response sets such as aquiescence bias. Another one is halo bias.

A special form of bias is what is sometimes known as “halo.” One would hope that a respondent, when answering a series of questions about different aspects of something-e.g., his own life, or someone else’s life-would distinguish clearly among those aspects. Sometimes, however, the answers are substantially affected by the respondent’s general impression and are not as distinct from one another as an external observer might think they should be. This is particularly likely to happen when the respondent is not well acquainted with the details of what is being investigated or when the questions and/or
answer categories are themselves unclear. Of course, “halo,” which produces an undesired source of correlation among the measures, must be distinguished from the sources of true correlation among the measures.
(p. 179).

Exhibit 6.1 uses the graphical language of structural equation modelling to illustrate the measurement model. Here the oval on the left represents the true variation in well-being perceptions in a sample. The boxes in the middle represent two measures of well-being (e.g., two ratings on a delighted-terrible scale). The oval on the right reflects sources that produce systematic measurement error (e.g., halo bias). In this model, the observed correlation is the retest reliability of a single global measure and it is a function of the strength of the causal effects of the true variance (path a and a’) and the systematic measurement (b and b’) on the two measures.

The problem with this model is that there is only one observed correlation and two possible causal effects (assuming equal strength for a and a’, and b and b’). Thus, it is unclear how much of the reliable variance reflects actual variation in true well-being.

To make empirical claims about the validity of global well-being ratings, it is necessary to embed them in a network of variables that shows theoretically predicted relationships. Within a larger set of variables, the path from the construct to the observed measures may be identifiable (Cronbach & Meehl, 1955; Schimmack, 2019).

Estimates Derived from the July Data

Scattered throughout the July questionnaire was a set of thirty-seven items that, when brought together for the purposes of the present analysis, forms a nearly complete six-by-six multimethod-multitrait matrix; i.e., a matrix in which six different “traits” (aspects of well-being) are assessed by each of six different methods. (p. 183)

The concerns were chosen to be well spread in the perceptual structure (as shown in Exhibit 2.2) and to include both domains and criteria. The following six aspects of well-being are represented: Life-as-a-whole, House or apartment, Spare-time activities, National government, Standard of living, and Independence or Freedom. The six measurement methods involve: self-ratings on the Delighted-Terrible, Faces, Circles, and Ladder Scales, the Social Comparison technique, and Ratings by others. (The exact wording of each of the concern-level items appears in Exhibit 2.1; see items 3D, 44, 85, 87, and 105. For descriptions of the six methods used to assess life-as-a-whole, see Exhibit 3.1, measures G1, G5, G6, G7, G13, and G54, these same methods were also used to assess the concern-level aspects. (p. 184).

Exhibit 6.2 shows the partial structural equation model for the global ratings and two domains. The key finding is that the correlations between residuals of the same rating scale tend to be rather small, while the validity coefficients are high. This seems to suggest that most of the reliable variance in global and domain measures is valid variance rather than systematic measurement error.

As much as I like Andrews and Withey’s work and recognize their contribution to well-being science in its infancy, I am disappointed by their discussion of the model.

Because the model shown in Exhibit 6.2 incorporates our theoretical expectations about how various phenomena influenced the validity and error components of the observed measures, because serious alternative theories have not come to our attention, and because the model in fact fits the data rather well (as will be described shortly), it seems reasonable to use it to estimate the validity and error components of the measures (p. 187)

On the basis of these results we infer that single item measures using the D-T, Faces, or Circles Scales to assess any of a wide range of different aspects of perceived well-being contain approximately 65 percent valid variance (p. 189).

Their own discussion of halo bias suggests that their model fails to account for systematic measurement error that is shared by different rating formats (Schimmack, , Böckenholt, & Reisenzein, 2002). It is well known that response sets have a negligible influence on ratings, but halo bias has a stronger influence.

It is important that the model actually includes measures that are based on the aggregated ratings of three informants that knew the respondent well (others’ rating). This makes the study a multi-method study that not only varies the response format, but also the rater. Other research has shown that halo bias is rather unique to a single rater (Anusic et al., 2009). Thus, halo bias cannot inflate correlations between respondents’ self-ratings and ratings by others. The problem is that a model with two-methods is unstable. It is only identified here because there are multiple self-ratings. In this case, halo bias can be hidden in higher loadings of the self-ratings on the true well-being factor than the other’ ratings. This is clearly the case. The loading for the informant ratings , as ratings by others’ are typically called, for global well-being is only .40, despite the fact that it is an average of three ratings and averaging increases validity. Based on the factor loadings, we can infer that the self-informant correlations are .4 * .8 = .32, which is in line with meta-analytic results from other studies (Schneider & Schimmack, 2009). A&W’s model gives the false impression that self-ratings are much more valid than informant ratings, but models that can test this assumption by using each informant as a separate method show that this is not the case (Zou et al., 2013). Thus, A&W’s work may have given a false impression about the validity of global well-being ratings. While they claimed that two-thirds of the variance is valid variance, other studies suggest it is only one-third, after taking halo bias into account.

A&W’s model shows high residual correlations among the three others’ ratings of life, housing, and freedom. They interpret this finding as evidence that halo bias has a strong influence on informant ratings.

The relatively high method effects in measures obtained from Others’ ratings is notable, but not terribly surprising. Since other people have less direct access to the respondents’ feelings than do the respondents themselves, one would expect substantially more “halo” in the Others’ ratings than in the respondents’ own ratings. This would be a reasonable explanation for the large amount of correlated error in these scores. (p. 189).

However, if these concerns are related to global well-being, but informant ratings have low loadings on the true factors, the model has to find another way to relate informant ratings of related to domains. The model is unable to test whether these correlations reflect bias or valid relationships. In contrast, other studies find no evidence that halo bias is considerably less present in self-ratings than in informant ratings (Anusic et al., 2009; Kim, Schimmack, & Oishi, 2012).

It is unfortunate that A&W and many researchers after them aggregated informant ratings, instead of treating informants as separate methods. As a result, 30 years of research failed to provide information about the amount of valid variance in self-ratings and informant ratings, leaving A&W’s estimate of 65% and 15% unquestioned. It was only in 2013, when Zou et al. (2013) showed that family members are as valid as the self in ratings of global well-being and that the proportions of variance are more equal around 30-40% valid variance for both. Self-ratings are only more valid when informant ratings are obtained from recent friends with less than two years of acquaintance (Schneider et al., 2010).

The low validity of informant ratings in A&W’s model led them to suggest that people are rather private about their true feelings about live.

What is of interest is that even people who the respondents felt knew them pretty well were in fact relatively poor judges of the respondents’ perceptions. This suggests that perceptions of well-being may be rather private matters. While people can-and did-give reasonably reliable answers (and, we estimate reasonably valid answers) regarding their affective evaluations of a wide range of life concerns, it would seem that they do not communicate their perceptions even to their friends and neighbors with much precision (p. 191).

While this may be true for neighbors, it is not true for family members. Moreover, other studies have found stronger correlations when informant ratings were aggregated across more informants and when informants are family members rather than neighbors (Schneider & Schimmack, 2009), and these stronger correlations have been used as evidence for the validity of self-ratings (Diener, Lucas, Schimmack, & Helliwell, 2009). Based on the same logic, A&W results would undermine the use of informant ratings as evidence of convergent validity of self-ratings.

There are several reasons why the modes validity of global well-being ratings has been ignored. First, it seems plausible that self-ratings are more valid because individuals have access to all of the relevant information. They know how things are going in their lives and they know what is important to them. In contrast, it is virtually certain that informants do not have access to all of the relevant information. However, these differences in accessibility of relevant information do not automatically ensure that self-ratings are more valid. This would only be the case if respondents are motivated to engage in an exhaustive search that retrieves all of the relevant information. This assumption has been questioned (Schwarz & Strack, 1999). Thus, we cannot simply assume that self-ratings are valid. The aim of validation research is to test this assumption. A&W’s model was unable to test it because they had several self-ratings and only one aggregated other-rating as indicators.

The second reason may be self-serving interest. The assumption that a single-item happiness rating can be used to measures something as complex and important as an individuals’ well-being makes these ratings very appealing for social scientists. If the assessment of well-being would require a complex set of questions about 20 life-domains with 10 criteria, it would be impossible to survey the well-being of nations and populations. The reason well-being is one of the most widely studied social constructs across several disciplines is that a single happiness item was easy to add to a survey.

DISTRIBUTIONS PRODUCED BY THE MORE VALID METHODS

A&W also examine and care about the distribution of responses. They developed the delighted-terrible scale because they observed that even on a 7-point satisfaction scale responses clustered at the top.

Our data clearly show that the Delighted-Terrible Scale produces greater differentiation at the positive end of the scale than the seven-point Satisfaction Scale (p. 207).

However, the differences that they mention are rather small and both formats produce similar means.

RELATIONSHIPS BETWEEN MEASURES OF PERCEIVED WELL-BEING AND OTHER TYPES OF VARIABLES

a reasonably consistent and not very surprising pattern emerged. Nearly always, relationships were in the “expected” direction, and most were rather weak (p. 214).

However, these weak correlations were systematic and stronger when researchers expected stronger correlations.

Where concerns had been judged relevant to the other items, the average correlation was .31; where staff members had been uncertain as to the concerns’ relevance, the average correlation was .25; and where concerns had been judged irrelevant the average correlation was .15″ (p. 214).

The problem is that A&W interpreted these results as evidence that perceived well-being is relatively independent of life conditions or actual behaviors.

Our general conclusion is that one will not usually find strong and direct relationships
between measures of perceived well-being and reports of most life conditions or
behaviors (p.
214).

This conclusion is partially based on the false inference that most of the variance in well-being ratings is valid variance. Another problem is that well-being is a broad construct and that a single behavior (e.g., sexual frequency; cf. Muise, Schimmack, & Impett, 2016) will only influence a small slice of the pizza of life. Other designs like twin studies or studies of spouses who are exposed to similar life circumstances are better suited to make claims about the importance of life conditions for well-being (Schimmack & Lucas, 2010). If differences in life circumstances do not explain variation in perceptions of well-being, what else could produce these differences? A&W do not address this question.

It would be naive to think that a person’s feelings about various aspects of life could be
perfectly predicted by knowing only the characteristics of the person’s present
environment. Developing adequate explanations for why people feel as they do
about various life concerns would be a challenging undertaking in its own
right. While we believe this could prove scientifically fruitful, such an investigation
is not part of the work we are presently reporting.

This question became the focus of personality theories of well-being (Costa & McCrae, 1980; Diener, 1984) and it is now well-established that stable dispositions to experience more pleasure (positive affect) and less displeasure (negative affect) contribute to perceptions of well-being (Schimmack, Oishi, & Diener, 2002;

CHAPTER 7: Exploring the Dynamics of Evaluation

Explorations 1 and 2 examine how response categories of different formats correspond to each other.

EXPLORATION 3: HYPOTHETICAL FAMILY INCOMES AND AFFECTIVE EVALUATIONS ON THE D-T SCALE

This exploration examined response options on the delighted-terrible scale in relation to hypothetical income levels.

The dollar amounts would need to be translated into current dollar amounts to be meaningful. Nevertheless, it is surprising how small the gaps are even for the highest, delighted, category.

EXPLORATION 6: AN IMPLEMENTATION OF THE DOMAINS-BY-CRITERIA MODEL

Design of the Analysis and Measures Employed

A&W wrote items for each of the 48 cells in the 6 domains x 8 criteria matrix. They found that all items had small to moderate correlations with the global measure.

“The fortyeight correlations involved range from .13 to .41 with a mean of .20.”

Domain-criterion items were also more strongly correlated with judgments of the same domain than with other domains.

If the model is right, each concern-level variable should tend to have higher relationships with the cell variables that are assumed to influence it than with other cell variables. This expectation also proves to be supported by the data. For the domains, the average of the forty-eight correlations with “relevant” cell variables is .48 (as noted previously) while the average of the 240 correlations with “irrelevant” cell variables is .20 (p. 236).

For the criteria, a similar but somewhat smaller difference exists: The forty-eight correlations with “relevant” cell variables average .37, while the 320 correlations with “irrelevant” cell variables average .27. Furthermore, these differences are not reversed for any of the fourteen concern measures considered individually (p. 236).

Exhibit 7.5 shows regression weights from a multiple regression with (a) criteria as predictors (top) and with domains as predictors (bottom). At the top, the key criteria are fun and accomplishments. Standard of living matters only for evaluations of housing and national government, beauty for housing and neighbourhood. At the bottom, housing family and free time contribute to fun, and job and free time contribute to accomplishments. Free time probably does so by means of hobbies or volunteering. For life in general, fun and accomplishment (top) and housing, family, free time, and job are the key predictors.

EXPLORATION 7: COMPARISONS BETWEEN ONE’S OWN WELL-BEING AND THAT OF OTHERS

When Life-as-a-whole is being assessed, the consistent finding is that most people think they are better off than either other people in general (“all the adults in the U.S.”) or their nearest same-sexed neighbor. (p. 240).

This finding is probably just the typical better-than-average effect that is obtained for desirable traits. One explanation for it is that people do not overestimate themselves, but rather underestimate others, and do not sufficiently adjust the comparison. After all, if A&W are right we do not know much about the well-being of others, especially when we do not know them well.

Interestingly, the results switch for national government, which receives low ratings. So, here the adjustment problem works in the opposite direction and respondents underestimate how dissatisfied others are with the government.

EXPLORATION 8: JUDGMENTS OF THE “IMPORTANCE” OF CONCERNS

“One of the hypotheses with which we started was that the relative importance a person assigned to various life concerns should be taken into account when combining concern-level evaluations to predict feelings about Life-as-awhole. The hypothesis is based on the expectation that when forming evaluations of overall well-being people would give greater “weight” to those concerns they felt were important, and less weight to those they regarded as less significant. As described in chapter 4, a careful examination of this hypothesis showed it to be untrue.”

In one analysis we looked to see whether the mean importance assigned to a given concern bore any relationship to its association with feelings about Life-as-
a-whole. If our original hypothesis had been correct, one would have expected a high relationship here; feelings about Life-as-a-whole would have had more to do with feelings about the important concerns than with feelings about the others. Using the data from our colleagues’ survey the answer was essentially “no.” Over ten concerns, the rank correlation between mean importance and the size of the simple bivariate relationship (measured by the eta statistic) was – .39: There was a modest tendency for the concerns that had higher relationships to Life-as-a-whole to be judged less important: When we performed the same analysis using a more complex multivariate relationship derived by holding constant the effects of all other nine concerns (measured by the beta statistic from Multiple classification Analysis), the rank correlation was + .15. A similar analysis in the July data produced a rank correlation of + .30 between the importance of concerns and the size of their (bivariate) relationships to the Life 3 measure. It seems clear that the mean importance assigned to a concern has little to do with the relationship between that concern and feelings about Life-as-a-whole (p. 243).

This is a puzzling finding and seems to undermine A&W’s bottom-up model of well-being perceptions. One problem is that Pearson correlations are sensitive to the amount of variance and the distribution of variables. For example, health could be important, but because it is important it is at high levels for most respondents. As a result, the Pearson correlation with perceived well-being would be low, which it actually is. A different kind of correlation coefficient or analysis would be needed for a better test of the hypothesis that more important domains are stronger predictors of well-being perceptions.

Further insight about the meaning of importance judgments emerged when we checked to see whether the importance assigned to a concern has anything to do with the position of the concern in the psychological structure. We compared the importance data for the ten concerns as assessed in our colleagues’ survey with the position of those concerns in the structural maps derived from our own national sample of May respondents (see Exhibit 2.4). There was a distinct tendency for concerns that were closer to Self and Family to
receive higher importance ratings than those that were more remote (rho = .52). When the analysis was repeated using the importance of the concerns as judged by our July respondents, a parallel result emerged (rho = .43). Still a third version of the analysis took the importance of the concerns as judged by the July respondents and checked the location of the concerns in the plot derived from these same respondents (Exhibit 2.2). Here the relationship was somewhat higher (rho = .59). We conclude that importance ratings are substantially linked to the position of the concern in the perceptual structure, and that concerns that are seen as being closely associated with oneself and one’s family tend to be ranked as more important than others
. (p. 244)

This is an interesting observation, but A&W do not elaborate further on it. Thus, it remains a mystery why respondents’ rate some domains as more important, but variation in these domains is not a stronger predictor of well-being evaluations.

END OF PART I

Conclusion

A&W provided a groundbreaking and exemplary examination of the construct validity of global well-being ratings. They presented a coherent theory that assumes global well-being judgments are integrative, mostly additive, evaluations of several life domains based on several criteria. Surprisingly, they found that weighing domains by importance did not improve predictions. They also tested a multi-trait-multi-method model to separate valid variance from method variance in well-being ratings. They concluded that two-thirds of the variance in self-ratings is valid, but only 15% of the variance in informant ratings are valid. Based on these results, they concluded that even a single global well-being rating is a valid measure of individuals’ true feelings about their lives, which we would rather call attitudes towards their lives, these days.

It is unfortunate that few well-being researchers have tried to build on A&W’s seminal work. To my knowledge, I am the only one who has fitted MTMM models to self-ratings and informant ratings of well-being to separate valid variance from systematic measurement error. Ironically, A&W’s impressive results may be the reason why further validation research has been neglected. However, A&W made some mistakes in their MTMM model and never explained the inconsistency between the bottom-up theory and their finding that importance weights do not improve prediction. Unfortunately, it is possible that their model was wrong and that a much larger portion of the variance in single-item well-being measures is method variance and that bottom-up effects on these measures are relatively weak and do not reflect the true importance of life circumstances for individuals’ well-being. Health is a particularly relevant domain. According to A&W’s results, variation in health satisfaction has relatively little unique effect on global well-being ratings. Does this really mean that health is unimportant? Does this mean that the massive increase in health spending over the past years is a waste of money? Or does it mean, global life evaluations are not as valid as we think they are and they fail to capture the relative importance of life domains for individuals’ well-being?

Brain Nosek explains the IAT

I spent 20 minutes, actually more than 20 minutes because I had to rewind to transcribe, listening to a recent podcast in which Brain Nosek was asked some questions about the IAT and implicit bias training (The Psychology Podcast, August 1, 2019).

Scott Barry Kaufman: How do you see the IAT now and how did you see it when you started work on Project Implicit? How discrepant are these stats of mind?

Brian Nosek: I hope I have learned a lot from all the research that we have done on it over the years. In the big picture I have the same view that I have had since we did the first set of studies. It is a great tool for research purposes and we have been able to learn a lot about the tool itself and about human behavior and interaction with the tool and a lot about the psychology of things that are [gap] occur with less control AND less awareness than just asking people how they feel about topics. So that has been and continues to be a very productive research area for trying to understand better how humans work.

And then the main concern that we had at onset and that is actually a lot of the discussion of even creating the website is the same anticipated some of the concerns and overuses that happened with the IAT in the present and that is the natural – I don’t know if natural is the right word – the common desire that people have for simple solutions and thinking well a measure is a direct indicator of something that we care about and it shouldn’t have any error in measurement and it should be applicable to lots and lots of situations.  And thus lots of potential of misuse of the IAT despite it being a very productive research tool and education too.  I like the experience of doing it and delivering to an audience and the discussion it provokes; what is it that it means, what does it mean about me, what does it mean about the world; those are really productive intellectual discussions and debates.  But the risk part the overapplication of the IAT for selection processes. We should use this. We should [?] use this for deciding who gets a job or not; we should [?] use this who is on a jury or not. Those are the kind of real-world applications of it as a measure that go far beyond its validity.  And so this isn‘t exact answering your question because even at the very beginning when we launched the website we said explicitly it should not be used for these purposes and I still believe this to be true. What has changed over time is the refinement of where it is we understand the evidence base against some of the major questions. And what is amazing about it is that there has been so much research and we still don’t have a great handle on really big questions relating to the IAT and measures like it.  So this is just part of [unclear]  how hard it is to actually make progress in the study of human behavior.   

Scott Barry Kaufman:  Let’s talk shop for a second [my translation; enough with the BS]. My dissertation at Yale a couple of year after years was looking at the question are there individual differences in implicit cognition.  And the idea was to ask this question because from a trait perspective I felt that was a huge gap in the literature. There was so much research on the reliability and validity of IQ tests for instance, but I wanted to ask the question if we adapt some of these implicit cognition measures from the social psychological experimental literature for an individual differences paradigm you know are they reliable and stable differences. And I have a whole appendix of failed experiments – by the way, you should tell how to publish that some day but we’ll get to that in a second, but so much of my dissertation, I am putting failed in quotes because you know I mean that was useful information … it was virtually impossible to capture reliable individual differences that cohered over time but I did find one that did and I published that as a serial reaction time task, but anyway, before we completely lose my audience which is a general audience I just want to say that I am trying to link this because for me one of the things that I am most wary about with the IAT is like – and this might be more of a feature than a bug – but it may be capturing at this given moment in time when a person is taking the test it is capturing a lot of the societal norms and influences are on that person’s associations but not capturing so much an intrinsic sort of stable individual differences variable. So I just wanted to throw that out and see what your current thoughts on that are.

Brian Nosek:   Yeah, it is clear that it is not trait like in the same way that a measure like the Big Five for personality is trait-like.  It does show stability over time, but much more weakly than that.  Across a variety of topics you might see a test-retest correlation for the IAT measuring the same construct of about .5  The curiosity for this is;  I guess it is a few curiosities. One is does that mean we have have some degree of trait variance because there is some stability over time and what is the rest? Is the rest error or is it state variance in some way, right. Some variation that is meaningful variation that is sensitive to the context of measurement. Surely it is some of both, but we don’t know how much. And there isn’t yet a real good insight on where the prediction components of the IAT are and how it anticipates behavior, right.  If we could separate in a real reliable way the trait part, the state part, and the error part, than we should be able to uniquely predict different type of things between the trait, the state, and the trait components. Another twist which is very interesting that is totally understudied in my view is the variations in which it is state or trait like seems to vary by the topic you are investigating. When you do a Democrat – Republican IAT, to what extent do people favor one over the other, the correlation with self-report is very strong and the stability over time is stronger than when you measure Black-White or some of the other types of topics. So there is also something about the attitude construct itself that you are assessing that is not as much measurement based but that is interacting with the measure that is anticipating the extent to which it is trait or state like. So these are all interesting things that if I had time to study them would be the problems I would be studying, but I had to leave that aside

Scott Barry Kaufman: You touch on a really interesting point about this. How would you measure the outcome of this two-day or week- training thing? It seems that would not be a very good thing to then go back to the IAT and see a difference between the IAT, IAT pre and IAT-post, doesn’t seem like the best outcome you know you’d want, I mean you ….

Brian Nosek I mean you could just change the IAT and that would be the end of it. But, of course, if that doesn’t actually shift behavior then what was the point?

Scott Barry Kaufman:  to what extent are we making advances in demonstrating that there are these implicit influences on explicit behavior that are outside of our value system? Where are we at right now? 

[Uli, coughs, Bargh, elderly priming]

Brian Nosek: Yeah, that is a good question. I cannot really comment on the micro-aggression literature. I don’t follow that as a distinct literature, but on the general point I think it is the big picture story is pretty clear with evidence which is we do things with automaticity, we do things that are counterproductive to our interests all the time, and sometimes we recognize we are doing it, sometimes we don’t, but a lot of time it is not controllable.  But that is a very big picture, very global, very non-specific point.

If you want to find out what 21 years of research on the IAT have shown, you can read my paper (Schimmack, in press, PoPS). In short,

  • most of the variance in the race IAT (Black-White) is random and systematic measurement error.
  • Up to a quarter of the variance reflects racial attitudes that are also reflected in self-report measures of racial attitudes; most clearly in direct ratings of feelings towards Blacks and Whites.
  • there is little evidence that any of the variance in IAT scores reflects some implicit attitudes that are outside of people’s awareness
  • there is no reliable evidence that IAT scores predict discriminatory behavior in the real world
  • visitors of Project Implicit are given invalid feedback that they may hold unconscious biases and are not properly informed about the poor psychometric properties of the test.
  • Founders of Project Implicit have not disclosed how much money they make from speaking engagements related to Project Implicit, royalties from the book “Blindspot,” and do not declare conflict of interest in IAT-related publications.
  • It is not without irony that educators on implicit bias may fail to realize that they have an implicit bias in reading the literature and to dismiss criticism.

How Valid are Short Big-Five Scales?

The first measures of the Big Five used a large number of items to measure personality. This made it difficult to include personality measures in studies as the assessment of personality would take up all of the survey time. Over time, shorter scales became available. One important short Big Five measure is the BFI-S (Lang et al., 2011).  This 15-item measure has been used in several national representative, longitudinal studies such as the German Socio-Economic Panel (Schimmack, 2019a). These results provide unique insights into the stability of personality (Schimmack, 2019b) and the relationship of personality with other constructs such as life-satisfaction (Schimmack, 2019c). Some of these results overturn textbook claims about personality. However, critics argue that these results cannot be trusted because the BFI-S is an invalid measure of personality.

Thus, it is is critical importance to evaluate the validity of the BFI-S. Here I use Gosling and colleagues data to examine the validity of the BFI-S. Previously, I fitted a measurement model to the full 44-item BFI (Schimmack, 2019d). It is straightforward to evaluate the validity of the BFI-S by examining the correlation of the 3-item BFI-S scale scores with the latent factors based on all 44 BFI items. For comparison purposes, I also show the correlations for the BFI scale scores. The complete results for individual items are shown in the previous blog post (Schimmack, 2019d).

The measurement model for the BFS has seven independent factors. Five factors represent the Big Five and two factors represent method factors. One factor represents acquiescence bias. The other factor represents evaluative bias that is present in all self-ratings of personality (Anusic et al., 2009). As all factors are independent, the squared coefficients can be interpreted as the amount of variance that a factor explains in a scale score.

The results show that the BFI-S scales are nearly as valid as the longer BFI scales (Table 1).

Scale#ItemsNEOACEVBACQ
N-BFI80.79-0.08-0.01-0.05-0.02-0.420.05
N-BFI-S30.77-0.13-0.050.07-0.04-0.290.07
E-BFI8-0.020.830.04-0.050.000.440.06
E-BFI-S30.050.820.000.04-0.070.320.07
O-BFI100.04-0.030.76-0.04-0.050.360.19
O-BFI-S30.090.000.66-0.04-0.100.320.25
A-BFI9-0.070.00-0.070.780.030.440.04
A-BFI-S3-0.03-0.060.000.750.000.330.09
C-BFI9-0.050.00-0.050.040.820.420.03
C-BFI-S3-0.090.00-0.020.000.750.440.06

For example, the factor-scale correlations for neuroticism, extraversion, and agreeableness are nearly identical. The biggest difference was observed for openness with a correlation of r = .76 for the BFI-scale and r = .66 for the BFI-S scale. The only other notable systematic variance in scales is the evaluative bias influence which tends to be stronger for the longer scales with the exception of conscientiousness. In the future, measurement models with an evaluative bias factor can be used to select items with low loadings on the evaluative bias factor to reduce the influence of this bias on scale scores. Given these results, one would expect that the BFI and BFI-S produce similar results. The next analyses tested this prediction.

Gender Differences

I examined gender differences three ways. First, I examined standardized mean differences at the level of latent factors in a model with scalar invariance (Schimmack, 2019d). Second, I computed standardized mean differences with the BFI scales. Finally, I computed standardized mean differences with the BFI-S scales. Table 2 shows the results. Results for the BFI and BFI-S scales are very similar. The latent mean differences show somewhat larger differences for neuroticism and agreeablness because these mean differences are not attenuated by random measurement error. The latent means also show very small gender differences for the method factors. Thus, mean differences based on scale scores are not biased by method variance.

Table 2. Standardized Mean Differences between Men and Women

NEOACEVBACQ
Factor0.640.17-0.180.310.150.090.16
BFI0.450.14-0.100.200.14
BFI-S0.480.21-0.030.180.12

Note. Positive values indicate higher means for women than for men.

In short, there is no evidence that using 3-item scales invalidates the study of gender differences.

Age Differences

I demonstrated measurement invariance for different age groups (Schimmack, 2019d). Thus, I used simple correlations to examine the relationship between age and the Big Five. I restricted the age range from 17 to 70. Analyses of the full dataset suggest that older respondents have higher levels of conscientiousness and agreeableness (Soto, John, Gosling, & Potter, 2011).

Table 3 shows the results. The BFI and the BFI-S both show the predicted positive relationship with conscientiousness and the effect size is practically identical. The effect size for the latent variable model is stronger because the relationship is not attenuated by random measurement error. Other relationships are weaker and also consistent across measures except for Openness. The latent variable model reveals the reason for the discrepancies. Three items (#15 ingenious, #l35 like routine work, and #10 sophisticated in art) showed unique relationships with age. The art-related items showed a unique relationship with age. The latent factor does not include the unique content of these items and shows a positive relationship between openness and age. The scale scores include this content and show a weaker relationship. The positive relationship of openness with age for the latent factor is rather surprising as it is not found in nationally representative samples (Schimmack, 2019b). One possible explanation for this relationship is that older individuals who take an online personality test are more open.

NEOACEVBACQ
Factor-0.08-0.020.180.120.330.01-0.11
BFI-0.08-0.010.080.090.26
BFI-S-0.08-0.04-0.020.080.25

In sum, the most important finding is that the 3-item BFI-S conscientiousness scale shows the same relationship with age as the BFI-scale and the latent factor. Thus, the failure to find aging effects in the longitudinal SOEP data with the BFI-S cannot be attributed to the use of an invalid short measure of conscientiousness. The real scientific question is why the cross-sectional study by Soto et al. (2011) and my analysis of the longitudinal SOEP data show divergent results.

Conclusion

Science has changed since researchers are able to communicate and discuss research findings on social media. I strongly believe that open science outside of peer-controlled journals is beneficial for the advancement of science. However, the downside of social media of open science is that it becomes more difficult to evaluate expertise of online commentators. True experts are able to back up their claims with scientific evidence. This is what I did here. I showed that Brenton Wiernik’s comment has as much scientific validity as a Donald Trump tweet. Whatever the reason for the lack of personality change in the SOEP data will be, it is not the use of the BFI-S to measure the Big Five.

Personality Measurement with the Big Five Inventory

In one of the worst psychometric articles every published (although the authors still have a chance to retract their in press article before it is actually published), Hussey and Hughes argue that personality psychologists intentionally fail to test the validity of personality measures. They call this practice validity-hacking. They also conduct some psychometric tests of popular personality measures and claim that they fail to demonstrate structural validity.

I have demonstrated that this claim is blatantly false and that the authors failed to conduct a proper test of structural validity (Schimmack, 2019a). That is, the authors fitted a model to the data that is known to be false. Not surprisingly, they found that their model didn’t meet standard criteria of model fit. This is exactly what should happen when a false model is subjected to a test of structural validity. Bad models should not fit the data. However, a real test of structural validity requires fitting a plausible model to the data. I already demonstrated with several Big Five measures that these measures have good structural validity and that scale scores can be used as reasonable measures of the latent constructs (Schimmack, 2019b). Here I examine the structural validity of the Big Five Inventory (Oliver John) that was used by Hussay and Hughes.

While I am still waiting to receive the actual data that were used by Hussay and Hughes, I obtained a much larger and better dataset from Sam Gosling that includes data from 1 million visitors to a website that provides personality feedback (https://www.outofservice.com/bigfive/).

For the present analyses I focused on the subgroup of Canadian visitors with complete data (N = 340,000). Subsequent analyses can examine measurement invariance with the US sample and samples from other nations. To examine the structure of the BFI, I fitted a structural equation model. The model has seven factors. Five factors represent the Big Five personality traits. The other two factores represent rating biases. One bias is an evaluative bias and the other bias is acquiescence bias. Initially, loadings on the method factors were fixed. This basic model was then modified in three ways. First, item loadings on the evaluative bias factor were relaxed to allow for some items to show more or less evaluative bias. Second, secondary loadings were added to allow for some items to be influenced by more than one factor. Finally, items of the same construct were allowed to covary to allow for similar wording or shared meaning (e.g., three arts items from the openness factor were allowed to covary). The final model and the complete results can be found on OSF (https://osf.io/23k8v/).

Model fit was acceptable, CFI = .953, RMSEA = .030, SRMR = .032. In contrast, fitting a simple structure without method factors produced unacceptable fit for all three fit indices, CFI = .734, RMSEA = .068, SRMR = .110. This shows that the model specification by Hussey and Hughes accounted for the bad fit. It has been known for over 20 years that a simple structure does not fit Big Five data (McCrae et al., 1996). Thus, Hussay and Hughes claim that the BFI lacks validity is based on an outdated and implausible measurement model.

Table 1 shows the factor loading pattern for the 44 BFI items on the Big Five factors and the two method factors. It also shows the contribution of the seven factors to the scale scores that are used to provide visitors with personality feedback and in many research articles that use scale scores as proxies for the latent constructs.

Item#NEOACEVBACQ
Neuroticism
depressed/blue40.33-0.150.20-0.480.06
relaxed9-0.720.230.18
tense140.51-0.250.20
worry190.60-0.080.07-0.210.17
emotionally stable24-0.610.270.18
moody290.43-0.330.18
calm34-0.58-0.04-0.14-0.120.250.20
nervous390.52-0.250.17
SUM0.79-0.08-0.01-0.05-0.020.420.05
Extraversion
talkative10.130.70-0.070.230.18
reserved6-0.580.09-0.210.18
full of energy110.34-0.110.580.20
generate enthusiasm160.070.440.110.500.20
quiet21-0.810.04-0.210.17
assertive26-0.090.400.14-0.240.180.240.19
shy and inhibited310.180.64-0.220.17
outgoing360.720.090.350.18
SUM-0.020.830.04-0.050.000.440.06
Openness 
original50.53-0.110.380.21
curious100.41-0.070.310.24
ingenious 150.570.090.21
active imagination200.130.53-0.170.270.21
inventive25-0.090.54-0.100.340.20
value art300.120.460.090.160.18
like routine work35-0.280.100.13-0.210.17
like reflecting40-0.080.580.270.21
few artistic interests41-0.26-0.090.15
sophisticated in art440.070.44-0.060.100.16
SUM0.04-0.030.76-0.04-0.050.360.19
Agreeableness
find faults w. others20.15-0.42-0.240.19
helpful / unselfish70.440.100.290.23
start quarrels 120.130.20-0.50-0.09-0.240.19
forgiving170.47-0.140.240.19
trusting 220.150.330.260.20
cold and aloof27-0.190.14-0.46-0.350.17
considerate and kind320.040.620.290.23
rude370.090.12-0.63-0.13-0.230.18
like to cooperate420.15-0.100.440.280.22
SUM-0.070.00-0.070.780.030.440.04
Conscientiousness
thorough job30.590.280.22
careless 8-0.17-0.51-0.230.18
reliable worker13-0.090.090.550.300.24
disorganized180.15-0.59-0.200.16
lazy23-0.52-0.450.17
persevere until finished280.560.260.20
efficient33-0.090.560.300.23
follow plans380.10-0.060.460.260.20
easily distracted430.190.09-0.52-0.220.17
SUM-0.050.00-0.050.040.820.420.03

Most of the secondary loadings are very small, although they are statistically highly significant in this large sample. Most items also have the highest loading on the primary factor. Exceptions are the items blue/depressed, full of engery, and generate enthusiasm that have higher loadings on the evaluative bias factor. Except for two openness items, all items also have loadings greater than .3 on the primary factor. Thus, the loadings are consistent with the intended factor structure.

The most important results are the loadings of the scale scores on the latent factors. As the factors are all independent, squaring these coefficients shows the amount of explained variance by each factor. By far the largest variance component is the intended construct with correlations ranging from .76 for openness to .83 for extraversion. Thus, the lion share of the reliable variance in scale scores reflects the intended construct. The next biggest contributor is evaluative bias with correlations ranging from .36 for openness to .44 for extraversion. Although this means only 15 to 20 percent of the total variance in scale scores reflects evaluative bias, this systematic variance can produce spurious correlations when scale scores are used to predict other self-report measures (e.g., life satisfaction, Schimmack, 2019c).

In sum, a careful psychometric evaluation of the BFI shows that the BFI has good structural validity. The key problem is the presence of evaluative bias in scale scores. Although this requires caution in the interpretation of results obtained with BFI scales, it doesn’t justify the conclusion that the BFI is invalid.

Measurement Invariance

Hussey and Hughes also examined measurement invariance across age-groups and the two largest gender groups. They claimed that the BFI lacks measurement invariance, but this claim was based on a cunning misrepresentation of the results (Schimmack, 2019a). The claim is based on the fact that the simple-structure model does not fit in any group. However, fit did not decrease when measurement invariance was imposed on different groups. Thus, all groups showed the same structure and fit did not increase when measurement invariance was imposed, but this fact was hidden in the supplementary results.

I replicated their analyses with the current dataset. First, I fitted the model for the whole sample separately to the male and female samples. Fit for the male sample was acceptable, CFI = .949, RMSEA = .029, SRMR = .033. So was fit for the female sample, CFI = .947, RMSEA = .030, SRMR = .037.

Table 2 shows the results side by side. There are no notable differences between the parameter estimates for males and females (m/f). This finding replicates results with other Big Five measures (Schimmack, 2019a).

Item#NEOACEVBACQ
Neuroticism
depressed/blue4.33/ .30-.18/-.11.19/ .20-.45/-.50.07/.05
relaxed9-.71/-.72.24/ .23.19/.18
tense14.52/ .49-.17/-.14.11/ .13-.27/-.32.20/ .20
worry19.58/ .57-.10/-.08.05/ .07-.22/-.22.17/ .17
emotionally stable24-.58/-.58.10/ .06.25/ .30.19/ .17
moody29.41/ .38-.26/-.25-.30/-.38.18/ .18
calm34-.55/-.59-.02/-.03.14/ .13.12/ .13-.27/-.24.21/ .19
nervous39.51/ .49-.21/.26-.10/-.10.08/ .08-.11/-.11-.27/-.25.18/ .17
SUM.78/ .77-.09/-.08-.01/-.01-.07/-.05-.02/-.02-.42-.46.05/ .04
Extraversion
talkative1.09/ .11.69/ .70-.10/-.08.24/ .24.19/ .18
reserved6-.55/-.60.08/.10.21/ .22.19/ .18
full of energy11.33/ .32-.09/-.04.56/ .59.21/ .20
generate enthusiasm16.04/ .03.44/ .43.12/ .13.48/ .50.20/ .20
quiet21-.79/-.82.03/ .04-.22/-.21.17/ .16
assertive26-.08/-.10.39/ .40.12/ .14-.23/-.25.18/ .17.26/ .24.20/ .18
shy and inhibited31.19/ .15.61/ .66.23/ .22.18/ .17
outgoing36.71/ .71.10/ .07.35/ .38.18/ .18
SUM-.02/-.02.82/ .82.04/ .05-.04-.06.00/ .00.45/ .44.07/ .06
Openness 
original5.50/ .54-.12/-.12.40/ .39.22/ .20
curious10.40/ .42-.05/-.08.32/ .30.25/ .23
ingenious 150.00/0.00.60/ .56.18/ .16.10/ .04.22/ .20
active imagination20.50/ .55-.07/-.06-.17/-.18.29/ .26.23/ .21
inventive25-.07/ -.08.51/ .55-.12/-.10.37/ .34.21/ .19
value art30.10/ .03.43/ .52.08/ .07.17/ .14.18/ .19
like routine work35-.27/-.27.10/ .10.09/ .15-.22/-.21.17/ .16
like reflecting40-.09/-.08.58/ .58.28/ .26.22/ .20
few artistic interests41-.25/-.29-.10/-.09.16/ .15
sophisticated in art44.03/ .00.42/ .49-.08/-.08.09/ .09.16/ .16
SUM.01/ -.01-.01/-.01.74/ .78-.05/-.05-.03/-.06.38/ .34.20/ .19
Agreeableness
find faults w. others2.14/ .17-.42/-.42-.24/-.24.19/ .19
helpful / unselfish7.45/ .43.09/.11.29/ .29.23/ .23
start quarrels 12.12/ .16.23/ .18-.49/-.49-.07/-.08-.24/-.24.19/ .19
forgiving17.49/ .46-.14/-.13.25/ .24.20/ .19
trusting 22-.14/-.16.38/ .32.27/ .25.21/ .19
cold and aloof27-.20/-.18.14/ .12.44/ .46-.34/-.37.18/ .17
considerate and kind32.02/.01.62/.61.28/ .30.22/ .23
rude37.10/ .12.12/ .12-.62/-.62-.13/-.08-.23/-.23.19/ .18
like to cooperate42.18/ .11-.09/-.10.43/ .45.28/ .29.23/ .22
SUM-.07/-.08.00/ .00-.07/-.07.78/ .77.03/ .03.43/ .44.04/ .04
Conscientiousness
thorough job3.58/ .59.29/ .28.23/ .22
careless 8-0.16-.49/-.51.24/ .23.19/ .18
reliable worker13-.10/-.09.09/ .10.55/ .55.30/ .31.24/ .24
disorganized18.13/ .16-.58/-.59-.21/-.20.17/ .15
lazy23-.52/-.51-.45/-.45.18/ .17
persevere until finished28.54/ .58.27/ .25.21/ .19
efficient33-.11/-.07.52/ .58.30/ .29.24/ .23
follow plans38.00/ .00-.06/-.07.45/ .44.27/ .26.21/ .20
easily distracted43.17/ .19.07/ .06-.53/-.53-.22/-.22.18/ .17
SUM-.05/-.05-.01/-.01-.05/-.06.04/ .04.81/ .82.43/ .41.03/ .03

I then fitted a multi-group model with metric invariance. Despite the high similarity between the individual models, model fit decreased, CFI = .925, RMSEA = .033, SRMR = .062. Although RMSEA and SRMR were still good, the decrease in fit might be considered evidence that the invariance assumption is violated. Table 2 shows that it is insufficient to examine changes in global fit indices. What matters is whether the decrease in fit has any substantial meaning. Given the results in Table 2, this is not the case.

The next model imposed scalar invariance. Before presenting the results, it is helpful to know what scalar invariance implies. Take extraversion as an example. Assume that there are no notable gender differences in extraversion. However, extraversion has multiple facets that are represented by items in the BFI. One facet is assertiveness and the BFI includes an assertiveness item. Scalar invariance implies that there cannot be gender differences in assertiveness if there are no gender differences in extraversion. It is obvious that this is an odd assumption because gender differences can occur at any level in the hierarchy of personality traits. Thus, evidence that scalar invariance is violated does not imply that we can not examine gender differences in personality. Rather, it would require further examination of the pattern of mean differences at the level of the factors and the item residuals.

However, imposing scalar invariance did not produce a decrease in fit, CFI = .921, RMSEA = .034, SRMR = .063. Inspection of the modification indices showed the highest modification index for item O6 “valuing art” with an implied mean difference of 0.058. This implies that there are no notable gender differences at the item-level. The pattern of mean differences at the factor level is consistent with previous studies, showing higher levels of neuroticism (d = .64) and agreeableness (d = .31), although the difference in agreeableness is relatively small compared to some other studies.

In sum, the results show that the BFI can be used to examine gender differences in personality and that the pattern of gender differences observed with the BFI is not a measurement artifact.

Age Differences

Hussey and Hughes used a median split to examine invariance across age-groups. The problem with a median split is that online samples tend to be very young. Figure 1 shows the age distribution for the Canadian sample. The median age is 22.

To create two age-groups, I split the sample into a group of under 30 and 30+ participants. The unequal sample size is not a problem because both groups are large given the large overall sample size (young N = 221,801, old N = 88,713). A published article examined age differences in the full sample, but the article did not use SEM to test measurement invariance (Soto, John, Gosling, & Potter, 2011). Given the cross-sectional nature of the data, it is not clear whether age differences are cohort differences or aging effects. Longitudinal studies suggest that age differences may reflect generational changes rather than longitudinal changes over time (Schimmack, 2019d). In any case, the main point of the present analyses is to examine measurement invariance across different age groups.

Fit for the model with metric invariance was similar to the fit for the gender model, CFI = .927, RMSEA = .033, SRMR = .062. Fit for the model with scalar invariance was only slightly weaker for CFI and better for RMSEA. More important, inspection of the modification indices showed the largest difference for O10 “sophisticated in art” with a standardized mean difference of .068. Thus, there were no notable differences between the two age groups at the item level.

The results at the factor level reproduced the finding with scale scores by Soto et al. (2011). The older group had a higher level of conscientiousness (d = .61) than the younger group. Differences for the other personalty dimensions were statistically small. There were no notable differences in response styles.

In sum, the results show that the BFI shows reasonable measurement invariance across age groups. Contrary to the claims by Hussey and Hughes, this finding is consistent with the results reported in Hussay and Hughes’s supplementary materials. These results suggest that BFI scale scores provide useful information about personality and that published articles that used scale scores produced meaningful results.

Conclusion

Hussey and Hughes accused personality researchers of validity hacking. That is, they do not report results of psychometric tests because these tests would show that personality measures are invalid. This is a strong claim that requires strong evidence. However, closer inspection of this claim shows that the authors used an outdated measurement model and misrepresented the results of their invariance analyses. Here I showed that the BFI has good structural validity and shows reasonable invariance across gender and age groups. Thus Hussay and Hughes’s claims are blatantly false.

So far, i have only examined the BFI, but I have little confidence in the authors’ conclusions about other measures like Rosenberg’s self-esteem scale. I am still waiting for the authors to share all of their data to examine all of their claims. At present, there is no evidence of v-hacking. Of course, this does not mean that self-ratings of personality are perfectly valid. As I showed, self-ratings of the Big Five are contaminated with evaluative bias. I presented a measurement model that can test for the presence of these biases and that can be used to control for rating biases. Future validation studies might benefit from using this measurement model as a basis for developing better measures and better measurement models. Substantive articles might also benefit from using a measurement model rather than scale scores, especially when the BFI is used as a predictor of other self-report measures to control for shared rating biases.

Measuring Well-Being in the SOEP

Psychology has a measurement problem. Big claims about personality, self-esteem, or well-being are based on sum-scores of self-ratings; or sometimes a single rating. This would be a minor problem if thorough validation research had demonstrated that sum-scores of self-ratings are valid measures of the constructs they are intended to represent, but such validation research is often missing. As a result, the validity of widely used measures in psychology and claims based on these measures is unknown.

The well-being literature is an interesting example of the measurement crisis because two opposing views about the validity of well-being measures co-exist. On the one hand, experimental social psychologists argue that life-satisfaction ratings are invalid and useless (Schwarz & Strack, 1999); a view that has been popularized by Noble Laureate Daniel Kahneman in his book “Thinking: Fast and Slow” (cf. Schimmack, 2018). On the other hand, well-being scientists often assume that life-satisfaction ratings are near perfect indicators of individuals’ well-being.

An editor of JPSP, which presumably means he or she is an expert, has no problem to mention both positions in the same paragraph without noting the contradiction.

There is a huge literature on well-being. Since Schwarz and Strack (1999), to take that arbitrary year as a starting point, there have been more than 11,000 empirical articles with “wellbeing” (or well-being or well being) in the title, according to PsychInfo. The vast majority of them, I submit, take the subjective evaluation of one’s own life as a perfectly valid and perhaps the best way to assess one’s own evaluation of one’s life. “

So, since Schwarz and Strack concluded that life-satisfaction judgments are practically useless, 11,000 articles have used life-satisfaction judgments as perfectly valid measures of life-satisfaction and nobody thinks this is a problem. No wonder, natural scientists don’t consider psychology a science.

The Validity of Well-Being Measures

Any attempt at validating well-being measures requires a definition of well-being that leads to testable predictions about correlations of well-being measures with other measures. Testing these predictions is called construct validation (Cronbach & Meehl, 1955; Schimmack, 2019).

The theory underlying the use of life-satisfaction judgments as measures of well-being assumes that well-being is subjective and that (healthy, adult) individuals are able to compare their actual lives to their ideal lives and to report the outcome of these comparison processes (Andrews & Whithey, 1973; Diener, Lucas, Schimmack, & Helliwell, 2009).

One prediction that follows from this model is that global life-satisfaction judgments should be correlated with judgments of satisfaction in important life domains, but not in unimportant life domains. The reason is that satisfaction with life as a whole should be related to satisfaction with (important) parts. It would make little sense for somebody to say that they are extremely satisfied with their life as a whole, but not satisfied with their family life, work, health, or anything else that matters to them. The whole point of asking a global question is the assumption that people will consider all important aspects of their lives and integrate this information into a global judgment (Andrews & Whithey, 1973). The main criticism of Schwarz and Strack (1999) was that this assumption does not describe the actual judgment process and that actual life-satisfaction judgments are based on transient and irrelevant information (e.g., current mood, Schwarz & Clore, 1983).

Top-Down vs. Bottom-Up Theories of Global and Domain Satisfaction

To muddy the waters, Diener (1984) proposed on the one hand that life-satisfaction judgments are, at least somewhat, valid indicators of life-satisfaction, while also proposing that correlations between satisfaction with life as a whole and satisfaction with domains might reflect a top-down effect.

A top-down effect implies that global life-satisfaction influences domain satisfaction. That is, health satisfaction is not a cause of life-satisfaction because good health is an important part of a good life. Instead, life-satisfaction is a content-free feeling of satisfaction that creates a halo in evaluations of specific life aspects independent of the specific evaluations of a life domain.

Diener overlooked that top-down processes invalidate life-satisfaction judgments as valid measures of wellbeing because a top-down model implies that global life-satisfaction judgments reflect only a general disposition to be satisfied without information about the actual satisfaction in important life domains. In the context of a measurement model, we can see that the top-down model implies that life-satisfaction judgments only capture the shared variance among specific life-satisfaction judgments, but fail to represent the part of satisfaction that reflects unique variance in satisfaction with specific life domains. In other words, top-down models imply that well-being does not encompass evaluations of the parts that make up an individuals entire life.

The problem that measurement models in psychology often consider unique or residual variances error variances that are often omitted from figures does not help. In the figure, the residual variances are shown and represent variation in life-aspects that are not shared across domains.

Some influential articles that examined top-down and bottom-up processes have argued in favor of top-down processes without noticing that this invalidates the use of life-satisfaction judgments as indicators of well-being or at least requires a radically different conception of well-being (well-being is being satisfied independent of how things are actually going in your life) (Heller, Watson, & Ilies, 2004).

An Integrative Top-Down vs. Bottom-Up Model

Brief et al. (1993) proposed an integrative model of top-down and bottom-up processes in life-satisfaction judgments. The main improvement of this model was to distinguish between a global disposition to be more satisfied and a global judgment of important aspects of life. As life-satisfaction judgments are meant to represent the latter, life-satisfaction judgments are the ultimate outcome of interest, not a measure of the global disposition. Brief et al. (1993) used neuroticism as an indicator for the global disposition to be less satisfied, but there are probably other factors that can contribute to a general disposition to be satisfied. The integrative model assumes that any influence of the general disposition is mediated by satisfaction with important life domains (e.g., health).

FIGURE 1. DisSat = Dispositional Satisfaction, DS1 = Domain Satisfaction 1 (e.g., health), DS2 = Domain Satisfaction 2, DS3 = Domain Satisfaction 3, LS = Life-Satisfaction.

It is important to realize that the mediation model separates two variances in domain satisfaction judgments, namely the variance that is explained by dispositional satisfaction and the variance that is not explained by dispositional satisfaction (residual variance). Both variances contribute to life-satisfaction. Thus, objective aspects of health that contribute to health satisfaction can also influence life-satisfaction. This makes the model an integrative model that allows for top-down and bottom-up effects.

One limitation of Brief et al.’s (1993) model was the use of neuroticism as sole indicator of dispositional satisfaction. While it is plausible that neuroticism is linked to more negative perceptions of all kinds of life-aspects, it may not be the only trait that matters.

Another limitation was the use of a health satisfaction as a single life domain. If people also care about other life domains, other domain satisfactions should also contribute to life-satisfaction and they could be additional mediators of the influence of neuroticism on life-satisfaction. For example, neurotic individuals might also worry more about money and financial satisfaction could influence life-satisfaction, making financial satisfaction another mediator of the influence of neuroticism on life-satisfaction.

One advantage of structural equation modeling is the ability to study constructs that do not have a direct indicator. This makes it possible to examine top-down effects without “direct” indicators of dispositional satisfaction. The reason is that dispositional satisfaction should influence satisfaction with various life domains. Thus, dispositional satisfaction is reflected in the shared variance among different domain satisfaction judgments and domain satisfaction judgments serve as indicators that can be used to measure dispositional satisfaction (see Figure 2).

Domain Satisfactions in the SOEP

It is fortunate that the creators of the Socio-Economic Panel in the 1980s included domain satisfaction measures and that these measures have been included in every wave from 1984 to 2017. This makes it possible to test the integrative top-down bottom-up model with the SOEP data.

The five domains that have been included in all surveys are health, household income, recreation, housing, and job satisfaction. However, job satisfaction is only available for those participants who are employed. To maximize the number of domains, I used all five domains and limited the analysis to working participants. The model can be used to build a model with four domains for all participants.

One limitation of the SOEP is the use of single-item indicators. This makes sense for expensive panel studies, but creates some psychometric problems. Fortunately, it is possible to estimate the reliability of single-item indicators in panel data by using Heise’s (1969) model which estimates reliability based on the pattern of retest correlations for three waves of data.

REL = r12 * r23 / r13

More data would be better and are available, but the goal was to combine the well-being model with a model of personality ratings that are available for only three waves (2005, 2009, & 2013). Thus, the same three waves for used to create an integrative top-down bottom-up model that also examined how domain satisfaction is related to global life-satisfaction across time.

The data set consisted of 3 repeated measures of 5 domain satisfaction judgments and a single life-satisfaction judgments for a total of 18 variables. The data were analyzed with MPLUS (see OSF for syntax and detailed results https://osf.io/vpcfd/ ).

Results

Overall model fit was acceptable, CFI = .988, RMSEA = .023, SRMR = .029.

The first results are the reliability and stability estimates of the five domain satisfactions and global life satisfaction (Table 1). For comparison purposes, the last column shows the estimates based on a panel analyses with annual retests (Schimmack, Krause, Wagner, & Schupp, 2010). The results show fairly consistent stability across domains with the exception of job satisfaction. Job satisfaction is less stable than other domains. The four-year stability is high, but not as high as for personality traits (Schimmack, 2019). A comparison with the panel data shows higher stability, which indicates that some of the error variance in 4-year retest studies is reliable variance that fluctuates over the four-year retest period. However, the key finding is that there is high stability in domain satisfaction judgments and life-satisfaction judgments. which makes it theoretically interesting to examine the relationship between the stable variances in domain satisfaction and life-satisfaction.

ReliabilityStability1Y-StabilityPanel
Job Satisfaction0.620.620.89
Health Satisfaction0.670.790.940.93
Financial Satisfaction0.740.810.950.91
Housing Satisfaction0.660.810.950.89
Leisure Satisfaction0.670.800.950.92
Life Satisfaction0.660.780.940.89

Table 2 examines the influence of top-down processes on domain satisfaction. Results show the factor loadings of domain satisfaction on a common factor that reflects dispositional satisfaction; that is, a general disposition to report higher levels of satisfaction. The results show that somewhere between 30% and 50% of the reliable variance in life-satisfaction judgments is explained by a general disposition factor. While this leaves ample room for domain-specific factors to influence domain satisfaction judgments, the results show a strong top-down influence.

T1T2T3
Job Satisfaction0.690.680.68
Health Satisfaction0.680.660.65
Financial Satisfaction0.600.610.63
Housing Satisfaction0.720.740.76
Leisure Satisfaction0.610.610.61

Table 3 shows the unique contribution of the disposition and the five domains to life-satisfaction concurrently and longitudinally.

DS1-LS1DS1-LS2DS1-LS3DS2-LS2DS2-LS3DS3-LS3
Disposition0.560.590.570.610.590.60
Job 0.140.100.050.170.080.12
Health0.230.220.210.280.270.33
Finances0.340.200.140.240.180.22
Housing0.040.030.030.040.040.06
Leisure0.060.100.060.130.070.09

The first notable finding is that the disposition factor accounts for the lion share of the explained variance in life-satisfaction judgments. The second important finding is that the relationship is very stable over time. The disposition measured at time 1 is an equally good predictor of life-satisfaction at time 1 (r = .56), time 2 (r = .59), and at time 3 (r = .57). This suggests that about one-third of the reliable variance in life-satisfaction judgments reflects a stable disposition to report higher or lower levels of satisfaction.

Regarding domain satisfaction, health is the strongest predictor with correlations between .21 and .33. Finances is the second strongest predictor with correlations between .14 and .34. For health satisfaction there is high stability over time. That is, time 1 health satisfaction predicts time 1 life-satisfaction nearly as well (r = .23) as time 3 life-satisfaction (r = .21). In contrast, financial satisfactions shows a bit more change over time with concurrent correlations at time 1 of r = .34 and a drop to r = .14 for life-satisfaction at time 3. This suggests that changes in financial satisfaction produces changes in life-satisfaction.

Job satisfaction has a weak influence on life-satisfaction with correlations ranging from r = .14 to .05. Like financial satisfaction, there is some evidence that changes in job satisfaction predict changes in life-satisfaction.

Housing and leisure have hardly any influence on life-satisfaction judgments with most relationships being less than .10. There is also no evidence that changes in these domain produce changes in life-satisfaction judgments.

These results show that most of the reliable variance in global life-satisfaction judgments remains unexplained and that a stable disposition accounts for most of the explained variance in life-satisfaction judgments.

Implications for the Validity of Life-Satisfaction Judgments

There are two ways to interpret the results. One interpretation is that is common in the well-being literature and hundreds of studies with the SOEP data is that life-satisfaction judgments are valid measures of well-being. Accordingly, well-being in Germany is determined mostly by a stable disposition to be satisfied. Accordingly, changing actual life-circumstances will have negligible effects on well-being. For example, Nakazato et al. (2011) used the SOEP data to examine the influence of moving on well-being. They found that decreasing housing satisfaction triggered a decision to move and that moving produces lasting increases in housing satisfaction. However, moving had no effect on life-satisfaction. This is not surprising given the present results that housing satisfaction has a negligible influence on life-satisfaction judgments. Thus, we would conclude that people are irrational by investing money in a better house, if we assume that life-satisfaction judgments are a perfectly valid measure of well-being.

The alternative interpretation is that life-satisfaction judgments are not as good as well-being researchers think they are. Rather than reflecting a weighted summary of all important aspects of life, they are based on accessible information that does not include all relevant information. The difference to Schwarz and Strack’s (1999) criticism is that bias is not due to temporarily accessible information (e.g., mood) that makes life-satisfaction judgments unreliable. As demonstrated here and elsewhere, a large portion of the variance in life-satisfaction judgments is stable. The problem is that the stable factors may be biases in life-satisfaction ratings rather than real determinants of well-being.

It is unfortunate that psychologist and other social sciences have neglected proper validation research of a measure that has been used to make major empirical claims about the determinants of well-being, and that this research has been used to make policy recommendation (Diener, Lucas, Schimmack, & Helliwell, 2009). The present results suggest that any policy recommendations based on life-satisfaction ratings alone are premature. It is time to take measurement more seriously and to improve the validity of measuring well-being.

Measuring Personality in the SOEP

The German Socio-Economic-Panel (SOEP) is a longitudinal study of German households. The core questions address economic issues, work, health, and well-being. However, additional questions are sometimes added. In 2005, the SOEP included a 15-item measure of the Big Five; the so-called BFI-S (Lang et al., 2011). As each personality dimension is measured with only three items, scale scores are rather unreliable measures of the Big Five. A superior way to examine personality in the SOEP is to build a measurement model that relates observed item scores to latent factors that represent the Big Five.

Anusic et al. (2009) proposed a latent variable model for an English version of the BFI-S.

The most important feature of this model is the modeling of method factors in personality ratings. An acquiescence factor accounts for general response tendencies independent of item content. In addition, a halo factor accounts for evaluative bias that inflates correlations between two desirable or two undesirable items and attenuates correlations between a desirable and an undesirable item. The Figure shows that the halo factor is bias because it correlates highly with evaluative bias in ratings of intelligence and attractiveness.

The model also includes a higher-order factor that accounts for a correlation between extraversion and openness.

Since the article was published I have modified the model in two ways. First, the Big Five are conceptualized as fully independent which is in accordance with the original theory. Rather than allowing for correlations among Big Five factors, secondary loadings are used to allow for relationships between extraversion and openness items. Second, halo bias is modeled as a characteristic of individual items rather than the Big Five. This approach is preferable because some items have low loadings on halo.

Figure 2 shows the new model.

I fitted this model to the 2005 data using MPLUS (syntax and output: https://osf.io/vpcfd/ ). The model had acceptable fit to the data, CFI = .962, RMSEA = .035, SRMR = .029.

Table 1 shows the factor loadings. It also shows the correlation of the sum scores with the latent factors.

Item#NEOACEVBACQ
Neuroticism
worried50.49-0.020.19
nervous100.64-0.310.18
relaxed15-0.550.350.21
SUM0.750.000.000.000.00-0.300.09
Extraversion
talkative20.600.130.400.23
sociable80.640.370.22
reserved12-0.520.20-0.110.19
SUM0.000.750.00-0.100.050.360.09
Openess
original40.260.41-0.330.380.22
artistic90.150.360.290.17
imaginative140.300.550.220.21
SUM0.000.300.57-0.130.000.390.26
Agreeableness
rude30.12-0.51-0.320.19
forgiving60.230.320.24
considerate130.490.480.29
SUM0.00-0.070.000.580.000.500.11
Conscientiousness
thorough10.710.350.30
lazy7-0.16-0.41-0.350.20
efficient110.390.480.28
SUM0.000.000.000.090.640.510.11

The results show that all items load on their primary factor although some loadings are very small (e.g., forgiving). Secondary loadings tend to be small (< .2), although they are highly significant in the large sample. All items load on the evaluative bias factor, with some fairly large loadings for considerate, efficient, and talkative. Reserved is the most evaluatively neutral item. Acquiescence bias is rather weak.

The scale scores are most strongly related to the intended latent factor. The relationship is fairly strong for neuroticism and extraversion, suggesting that about 50% of the variance in scale scores reflects the intended construct. However, for the other three dimensions, correlations suggest that less than 50% of the variance reflects the intended construct. Moreover, the remaining variance is not just random measurement error. Evaluative bias contributes from 10% up to 25% of additional variance. Acquiescence bias plays a minor role because most scales have a reverse scored item. Openness is an exception and acquiescence bias contributes 10% of the variance in scores on the Openness scale.

Given the good fit of this model, I recommend it for studies that want to examine correlates of the Big Five or that want to compare groups. Using this model will produce better estimates of effect sizes and control for spurious relationships due to method factors.