The first five parts built a model that related personality traits with well-being. Part six added sex (male/female) to the model. It may not come as a surprise that part 7 adds age to the model because sex and age are two commonly measured demographic variables.
Age and Wellbeing
Diener et al.’s (1999) review article pointed out that early views of old age as a period of poor health and misery was not supported by empirical studies. Since then, some studies with national representative samples have found a U-shaped relationship between age and well-being. Accordingly, well-being decreases from young adulthood to middle age and then increases again into old age before well-being declines at the end of life. Thus, there is some evidence for a mid-life crisis (Blanchflower, 2021).
The present dataset cannot examine this U-shaped pattern because data are based on students and their parents, but the U-shaped pattern would predict that students have higher well-being than their middle-aged parents.
McAdams, Lucas, and Donnellan (2012) found that the relationship between age and life-satisfaction was explained by effects of age on life-domains. According to their findings in a British sample, health satisfaction decreased with age, but housing satisfaction increased with age. The average trend across domains mirrored the pattern for life-satisfaction judgments.
Based on these findings, I expected that age was a negative predictor of life-satisfaction and that this negative relationship is mediated by domain satisfaction. To test this prediction I added age as a predictor variable. As for sex, age is an exogeneous variable because age can influence personality and well-being, but personality cannot influence (biological) age. Although age was added as a predictor for all factors in the model, overall model fit decreased, chi2(1478) = 2198, CFI = .973, RMSEA = .019. This can happen when a new variable is also related to the unique variances of indicators. Inspection of the modification indices showed some additional relationships with self-ratings that suggested older respondents have a positive bias in their self-ratings. To allow for this possibility, I allowed all self-ratings to be influenced by age. This modification substantially increased model fit, chi2(1462) = 1970, CFI = .981, RMSEA = .016. I will further examine this positivity bias in the next model. Here I focus on the findings for age and well-being.
As expected, age was a negative predictor of life-satisfaction, b = -.21, se = .04, Z = 5.5. This effect was fully mediated. The direct effect of age on life-satisfaction was close to zero and not significant, b = -.01, se = .04, Z = 0.34. Age also had no direct effect on positive affect (happy), b = .00, se = .00, Z = 0.44, and only a small effect on negative affect (sadness), b = -.03, se = .01, Z = 2.5. Yet, the sign of this relationship shows lower levels of sadness in middle age, which does not explain the lower level of life-satisfaction. In contrast, age was a negative predictor of average domain satisfaction (DSX) and the effect size was close to the effect size for life-satisfaction, b = -.20, se = .05, Z = 4.1. This results replicates McAdams et al.’s (2012) finding that domain satisfaction mediates the effect of age on life-satisfaction.
However, the monster model shows that domain satisfaction is influenced by personality traits. Thus, it is possible that some of the age effects on domain satisfaction are not only influenced by objective domain aspects, but also by top-down effects of personality traits. To examine this, I traced the indirect effects of age on average domain satisfaction.
Age was a notable negative predictor of cheerfulness, b = -.29, se = .04, Z = 7.5. This effect was partially mediated by extraversion, b = -.07, se = 02, Z = 3.5 and agreeableness, b = -.08, se = .02, Z = 4.5, while some of the effect was direct, b = -.14, se = .03, Z = 4.4. There was no statistically significant effect of age on depressiveness, b = .07, se = 04, Z = 1.9.
Age also had direct relationships with some life domains. Age was a positive predictor of romantic satisfaction, b = .36, se = .04, Z = 8.2. Another strong relationship emerged for health satisfaction, b = -.36, se = .04, Z = 8.4. Another negative relationship was observed for work, b = -.26, se = .04, Z = 6.4, reflecting the difference between studying and working. Age was also a negative predictor of housing satisfaction, b = -.10, se = .04, Z = 2.8, recreation satisfaction, b = -.15, se = .05, Z = 3.4, financial satisfaction, b = -.10, se = .05, Z = 2.1, and friendship satisfaction, b = -.09, se = .04, Z = 2.1. In short, age was a negative predictor of satisfaction with al life domains even after controlling for the effects of age on cheerfulness.
The only positive effect of age was an increase in conscientiousness, b = .15, se = .04, Z = 3.7, which is consistent with the personality literature (Roberts, Walton, & Viechtbauer, 2006). However, the indirect positive effect on life-satisfaction is small, b = .04
In conclusion, the present results replicate that well-being decreases from young adulthood to middle age. The effect is mainly explained by a decrease in cheerfulness and decreasing satisfaction with a broad range of life domains. The only exception was a positive effect on romantic satisfaction. These results have to be interpreted in the context of the specific sample. Younger participants were students. It is possible that young adults who already join the workforce have lower well-being than students. The higher romantic satisfaction for parents may also be due to the recruitment of parents who remained married with children. Singles and divorced middle-aged individuals show lower life-satisfaction. The fact that age effects were fully mediated shows that studies of age and well-being can benefit from the inclusion of personality measures and the measurement of domain satisfaction (McAdams et al., 2012).
The first five parts of this series built a model that related the Big Five personality traits as well as the depressiveness facet of neuroticism and the cheerfulness facet of extraversion to well-being. In this model, well-being is conceptualized as a weighted average of satisfaction with life domains and experiences of happiness and sadness (Part 5).
Part 6 adds sex/gender to the model. Although gender is a complex construct, most individuals identify as either male or female. As sex is frequently assessed as a demographic characteristic, the simple correlations of sex with personality and well-being are fairly well known and were reviewed by Diener et al. (1999).
A somewhat surprising finding is that life-satisfaction judgments show hardly any sex differences. Diener et al. (1999) point out that this finding seems to be inconsistent with findings that women report higher levels of neuroticism (neuroticism is a technical term for a disposition to experience more negative affects and does not imply a mental illness), negative affect, and depression. Accordingly, gender could have a negative effect on well-being that is mediated by neuroticism and depressiveness. To explain the lack of a sex difference in well-being, Diener et al. proposed that women also experience more positive emotions. Another possible mediator is agreeableness. Women consistently score higher in agreeableness and agreeableness is a positive predictor of well-being. Part 5 showed that most of the positive effect of agreeableness was mediated by cheerfulness. Thus, agreeableness may partially explain higher levels of cheerfulness for women. To my knowledge, these mediation hypotheses have never been formally tested in a causal model.
Adding sex to the monster model is relatively straightforward because sex is an exogeneous variable. That is causal paths can originate from sex, but no causal path can be pointed at sex. After all, we know that sex is determined by the genetic lottery at the moment of conception. It is therefore possible to add sex as a cause to all factors in the model. Despite adding all causal pathways, model fit decreased a bit, chi2(1432) = 2068, CFI = .976, RMSEA = .018. The main reason for reduced fit would be that sex predicts some of the unique variances in individual indicators. Inspection of modification indices showed that sex was related to higher student ratings of neuroticism and lower ratings of neuroticism by mothers’ as informants. While freeing these parameters improved model fit, the effect on sex differences in neuroticism were opposite. Assuming (!) that mothers’ underestimate neuroticism, increased sex differences in neuroticism from d = .69, se = .07 to d = .81, se = .07. Assuming that students’ overestimate neuroticism resulted in a smaller sex difference of d = .54, se = .08. Thus, the results suggest that sex differences in neuroticism are moderate to large (d = .5 to .8), but there is uncertainty due to some rating biases in ratings of neuroticism. A model that allowed for both biases had even better fit and produced the compromise effect size estimate of d = .67, se = .08. Overall fit was now only slightly lower than for the model without sex, chi2(1430) = 2024, CFI = .978, RMSEA = .017. Figure 2 shows the theoretically significant direct effects of sex with effect sizes in units of standard deviations (Cohen’s d).
The model not only replicated sex differences in neuroticism. It also replicated sex differences in agreeableness, although the effect size was small, d = .29, se = .08, Z = 3.7. Not expected was the finding that women also scored higher in extraversion, d = .38, se = .07, Z = 5.6, and conscientiousness, d = .36, se = .07, Z = 5.0. The only life domain with a notable sex difference was romantic relationships, d = -.41, se = .08, Z = 5.4. The only other statistically significant difference was found for recreation, d = -.19, se = .08, Z = 2.4. Thus, life domains do not contribute substantially to sex differences in well-being. Even the sex difference for romantic satisfaction is not consistently found in studies of marital satisfaction.
The model indirect results replicated the finding that there are no notable sex differences in life-satisfaction, total effect d = -.07, se = .06, Z = 1.1. Thus, tracing the paths from sex to life-satisfaction provides valuable insights into the paradox that women tend to have higher levels of neuroticism, but not lower life-satisfaction.
Consistent with prior studies, women had higher levels of depressiveness and the effect size was small, d = .24, se = .08, Z = 3.0. The direct effect was not significant, d = .06, se = .08, Z = 0.8. The only positive effect was mediated by neuroticism, d = .42, se = .06, Z = 7.4. Other indirect effects reduced the effect of sex on depressiveness. Namely, women’s higher conscientiousness (in this sample) reduced depressiveness, d = -.14, as did women’s higher agreeableness, d = -.06, se = .02, Z = 2.7, and women’s higher extraversion, d = -.04, se = .02, Z = 2.4. These results show the problem of focusing on neuroticism as a predictor of well-being. While neuroticism shows a moderate to strong sex difference, it is not a strong predictor of well-being. In contrast, depressiveness is a stronger predictor of well-being, but has a relatively small sex difference. This small sex difference partially explains why women can have higher levels of neuroticism without lower levels of well-being. Men and women are nearly equally disposed to suffer from depression. Consistent with this finding, men are actually more likely to commit suicide than women.
Consistent with Diener et al.’s (1999) hypothesis, cheerfulness also showed a positive relationship with sex. The total effect size was larger than for depressiveness, d = .50, se = .07, Z = 7.2. The total effect was partially explained by a direct effect of sex on cheerfulness, d = .20, se = .06, Z = 3.6. Indirect effects were mediated by extraversion, d = .27, se = .05, Z = 5.8, agreeableness d = .11, se = .03, Z = 3.6, and conscientiousness, d = .05, se = .02, Z = 3.2. However, neuroticism reduced the effect size by d = -.12, se = .03, Z = 4.4.
The effects of gender on depressiveness and cheerfulness produced corresponding differences in experiences of NA (sadness) and PA (happiness), without additional direct effects of gender on the sadness or happiness factors. The effect on happiness was a bit stronger, d = .35, se = .08, Z = 4.6 than the effect on sadness, d = .28, se = .07, Z = 4.1.
In conclusion, the results provide empirical support for Diener et al.’s hypothesis that sex differences in well-being are small because women have higher levels of positive affect and negative affect. The relatively large difference in neuroticism is also deceptive because neuroticism is not a direct predictor of well-being and gender differences in depressiveness are weaker than gender differences in neuroticism or anxiety. In the present sample, women also benefited from higher levels of agreeableness and conscientiousness that are linked to higher cheerfulness and lower depressiveness.
The present study also addresses concerns that self-report biases may distort gender differences in measures of affect and well-being (Diener et al., 1999). In the present study, well-being of mothers and fathers was not just measured by their self-reports, but also by students’ reports of their parents’ well-being. I have also asked students in my well-being course whether their mother or father has higher life-satisfaction. The answers show pretty much a 50:50 split. Thus, at least subjective well-being does not appear to differ substantially between men and women. This blog post showed a theoretical model that explains why men and women have similar levels of well-being.
This is Part 5 of the blog series on the monster model of well-being. The first parts developed a model of well-being that related life-satisfaction judgments to affect and domain satisfaction. I then added the Big Five personality traits to the model (Part 4). The model confirmed/replicated the key finding that neuroticism has the strongest relationship with life-satisfaction, b ~ .3. It also showed notable relationships with extraversion, agreeableness, and conscientiousness. The relationship with openness was practically zero. The key novel contribution of the monster model is to trace the effects of the Big Five personality traits on well-being. The results showed that neuroticism, extraversion, and agreeableness had broad effects on various life domains (top-down effects) that mediated the effect on global life-satisfaction (bottom-up effect). In contrast, conscientiousness was only instrumental for a few life domains.
The main goal of Part 5 is to examine the influence of personality traits at the level of personality facets. Various models of personality assume a hierarchy of traits. While there is considerable disagreement about the number of levels and the number of traits on each level, most models share a basic level of traits that correspond to traits in the everyday language (talkative, helpful, reliable, creative) and a higher-order level that represents covariations among basic traits. In the Five factor model, the Big Five traits are five independent higher-order traits. Costa and McCrae’s influential model of the Big Five recognizes six basic-level traits called facets for each of the Big Five traits. Relatively few studies have conducted a comprehensive examination of personality and well-being at the facet level (Schimmack, Oishi, Furr, & Funder, 2004). A key finding was that the depressiveness facet of neuroticism was the only facet with unique variance in the prediction of life-satisfaction. Similarly, the cheerfulness facet of extraversion was the only extraversion facet that predicted unique variance in life-satisfaction. Thus, the Mississauga family study included measures of these two facets in addition to the Big Five items.
In Part 5, I add these two facets to the monster model of well-being. Consistent with Big Five theory, I allowed for causal effects of Extraversion on Cheerfulness and from Neuroticism to Depressiveness. Strict hierarchical models would assume that each facet is related to only one broad factor. However, in reality basic-level traits can be related to multiple higher-order factors, but not much attention has been paid to secondary loadings of the depressiveness and cheerfulness facets on the other Big Five factors. In one study that controlled for evaluative bias, I found that depressiveness had a negative loading on conscientiousness (Schimmack, 2019). This relationship was confirmed in this dataset. However, additional relations improved model fit. Namely, cheerfulness was related to lower neuroticism and higher agreeableness and depressiveness was related to lower extraversion and agreeableness. Some of these relations were weak and might be spurious due to the use of short three-item scales to measure the Big Five.
The monster model combines two previous mediation models that link the Big Five personality traits to well-being. Schimmack, Diener, and Oishi (2002) proposed that affective experiences mediate the effects of extraversion and neuroticism. Schimmack, Oishi, Furr, and Funder (2004) suggested that the Depressiveness and Cheerfulness facets mediate the effects of Extraversion and Neuroticism. The monster model proposes that extraversion’s effect is mediated by trait cheerfulness which influences positive experiences, whereas neuroticism’s effect is mediated by trait depressiveness which in turn influences experiences of sadness.
When this model was fitted to the data, depressiveness and cheerfulness fully mediated the effect of extraversion and neuroticism. However, extraversion became a negative predictor of well-being. While it is possible that the unique aspects of extraversion that are not shared with cheerfulness have a negative effect on well-being, there is little evidence for such a negative relationship in the literature. Another possible explanation for this finding is that cheerfulness and positive affect (happy) share some method variance that inflates the correlation between these two factors. As a result, the indirect effect of extraversion is overestimated. When this shared method variance is fixed to zero and extraversion is allowed to have a direct effect, SEM will use the free parameter to compensate for the overestimation of the indirect path. The ability to model shared method variance is one of the advantages of SEM over mediation tests that rely on manifest variables and assume perfect measurement of constructs. Figure 1 shows the correlation between measures of trait PA (cheerfulness) and experienced PA (happy) as a curved arrow. A similar shared method effect was allowed for depressiveness and experienced sadness (sad), although it turned out be not significant.
Exploratory analysis showed that cheerfulness and depressiveness did not fully mediate all effects on well-being. Extraversion, agreeableness, and conscientiousness had additional direct relationships on some life-domains that contribute to well-being. The final model remained good overall fit and modification indices did not show notable additional relationships for the added constructs, chi2(1387) = 1914, CFI = .980, RMSEA = .017.
The standardized model indirect effects were used to quantify the effect of the facets on well-being and to quantify indirect and direct effects of the Big Five on well-being. The total effect of Depressiveness was b = -.47, Z = 8.8. About one-third of this effect was directly mediated by sadness, b = -.19. Follow-up research needs to examine how much of this relationship might be explained by risk factors for mood disorders as compared to normal levels of depressive moods. Valuable new insights can emerge from integrating the extensive literature on depression and life-satisfaction. The remaining effects were mediated by top-down effects of depressiveness on domain satisfactions (Payne & Schimmack, 2020). The present results show that it is important to control for these top-down effects in studies that examine the bottom-up effects of life domains on life-satisfaction.
The total effect of cheerfulness was as large as the effect of depressiveness, b = .44, Z = 6.6. Contrary to depressiveness, the indirect effect through happiness was weak, b = .02, Z = 0.6 because happy did not make a significant unique contribution to life-satisfaction. Thus, all of the effects were mediated by domain satisfaction.
In sum, the results for depressiveness and cheerfulness are consistent with integrated bottom-up-top-down models that postulate top-down effects of affective dispositions on domain satisfaction and bottom-up effects from domain satisfaction to life-satisfaction. The results are only partially consistent with models that assume affective experiences mediate the effect (Schimmack, Diener, & Oishi, 2002).
The effect of neuroticism on well-being, b = -.36, Z = 10.7, was fully mediated by depressiveness, b = -.28 and cheerfulness, b = -.08. Causality is implied by the assumption that neuroticism is a common cause of specific dispositions for anger, anxiety, depressiveness and other negative affects that is made in hierarchical models of personality traits. If this assumption were false, neuroticism would only be a correlate of well-being and it would be even more critical to focus on depressiveness as the more important personality trait related to well-being. Thus, future research on personality and well-being needs to pay more attention to the depressiveness facet of neuroticism. Too many short neuroticism measures focus exclusively or predominantly on anxiety.
Following Costa and McCrae (1980), extraversion has often been considered a second important personality trait that influences well-being. However, quantitatively the effect of extraversion on well-being is relatively small, especially in studies that control for shared method variance. The effect size for this sample was b = .12, a statistically small effect, and a much smaller effect than for its cheerfulness facets. The weak effect was a combination of a moderate positive effect mediated by cheerfulness, b = .32, and a negative effect that was mediated by direct effects of extraversion on domain satisfactions, b = -.23. These results show how important it is to examine the relationship between extraversion and well-being at the facet level. Whereas cheerfulness explains why extraversion has positive effects on well-being, the relationship of other facets with well-being require further investigation. The present results make it clear that a simple reason for positive relationships between extraversion and well-being is the cheerfulness facet. The finding that individuals with a cheerful disposition evaluate their lives more positively may not be surprising or may even appear to be trivial, but it would be a mistake to omit cheerfulness from a causal theory of well-being. Future research needs to uncover the determinants of individual differences in cheerfulness.
Agreeableness had a moderate effect on well-being, b = .21, Z = 5.8. Importantly, the positive effect of agreeableness was fully mediated by cheerfulness, b = .17 and depressiveness, b = .09, with a small negative direct effect on domain satisfactions, b = -.05, which was due to lower work satisfaction for individuals high in agreeableness. These results replicate Schimmack et al.’s (2004) findings that agreeableness was not a predictor of life-satisfaction, when cheerfulness and depressiveness were added to the model. This finding has important implications for theories of well-being that see a relationship between morality, empathy, and prosociality and well-being. The present results do not support this interpretation of the relationship between agreeableness and well-being. The results also show the importance of taking second order relationships more seriously. Hierarchical models consider agreeableness to be unrelated to cheerfulness and depressiveness, but simple hierarchical models do not fit actual data. Finally, it is important to examine the causal relationship between agreeableness and affective facets. It is possible that cheerfulness influences agreeableness rather than agreeableness influencing cheerfulness. In this case, agreeableness would be a predictor but not a cause of higher well-being. However, it is also possible that an agreeable disposition contributes to a cheerful disposition because agreeableness people may be more easily satisfied with reality. In any case, future studies of agreeableness and related traits and well-being need to take potential relationships with cheerfulness and depressiveness into account.
Conscientiousness also has a moderate effect on well-being, b = .19, Z = 5.9. A large portion of this effect is mediated by the Depressiveness facet of Neuroticism, b = .15. Although a potential link between Conscientiousness and Depressiveness is often omitted from hierarchical models of personality, neuropsychological research is consistent with the idea that conscientiousness may help to regulate negative affective experiences. Thus, this relationship deserves more attention in future research. If causality were reversed, conscientiousness would have only a trivial causal effect on well-being.
In short, adding cheerfulness and depressiveness facets to the model provided several new insights. First of all, the results replicated prior findings that these two facets are strong predictors of well-being. Second, the results showed that Big Five predictors are only weak unique predictors of well-being when their relationship with Cheerfulness and Depressiveness is taken into account. Omitting these important predictors from theories of well-being is a major problem of studies that focus on personality traits at the Big Five level. It also makes theoretical sense that cheerfulness and depressiveness are related to well-being. These traits influence the emotional evaluation of people’s lives. Thus, even when objective life circumstances are the same, a cheerful individual is likely to look at the bright side and see the their lives with rose colored glasses. In contrast, depression is likely to color live evaluations negatively. Longitudinal studies confirm that depressive symptoms, positive affect, and negative affect are influenced by stable traits (Anusic & Schimmack, 2016; Desai et al., 2012). Furthermore, twin studies show that shared genes contribute to the correlation between life-satisfaction judgments and depressive symptoms (Nes et al., 2013). Future research needs to examine the biopsychosocial factors that cause stable variation in dispositional cheerfulness and depressiveness that contribute to individual differences in well-being.
This is part 4 in a mini-series of blogs that illustrate the usefulness of structural equation modeling to test causal models of well-being. The first causal model of well-being was introduced in 1980 by Costa and McCrae. Although hundreds of studies have examined correlates of well-being since then, hardly any progress has been made in theory development. In 1984, Diener (1984) distinguished between top-down and bottom-up theories of well-being, but empirical tests of the different models have not settled this issue. The monster model is a first attempt to develop a causal model of well-being that corrects for measurement error and fits empirical data.
The first part (Part1) introduced the measurement of well-being and the relationship between affect and well-being. The second part added measures of satisfaction with life-domains (Part 2). Part 2 ended with the finding that most of the variance in global life-satisfaction judgments is based on evaluations of important life domains. Satisfaction in important life domains also influences the amount of happiness and sadness individuals experience, whereas positive affect had no direct effect on life-evaluations. In contrast, sadness had a unique negative effect on life-evaluations that was not mediated by life domains.
Part 3 added extraversion to the model. This was a first step towards a test of Costa and McCrae’s assumption that extraversion has a direct effect on positive affect (happiness) and no effect on negative affect (sadness). Without life domains in the model, the results replicated Costa and McCrae’s (1980) results. Yes, personality psychology has replicable findings. However, when domain satisfactions were added to the model, the story changed. Costa and McCrae (1980) assumed that extraversion increases well-being because it has a direct effect on cheerfulness (positive affect) that adds to well-being. However, in the new model, the effect of extraversion on life-satisfaction was mediated by life domains rather than positive affect. The strongest mediation was found for romantic satisfaction. Extraverts tended to have higher romantic satisfaction and romantic satisfaction contributed significantly to overall life-satisfaction. Other domains like recreation and work are also possible mediators, but the sample size was too small to produce more conclusive evidence.
Part 4 is a simple extension of the model in part 3 by adding the other personality dimensions to the model. I start with neuroticism because it is by far the most consistent and strongest predictor of well-being. Costa and McCrae (1980) assumed that neuroticism is a general disposition to experience more negative affect without any relation to positive affect. However, most studies show that neuroticism has a negative relationship with positive aspect as well, although it is not as strong as the relationship with negative affect. Moreover, neuroticism is also related to lower satisfaction in many life domains. Thus, the model simply allowed for neuroticism to be a predictor of both affects and all domain satisfaction. The only assumption made by this model is that the negative effect of neuroticism on life-satisfaction is fully mediated by domain satisfaction and affect.
Figure 1 shows the model and the path coefficients for neuroticism. The first important finding is that neuroticism has a strong direct effect on sadness that is independent of satisfaction with various life domains. This finding suggests that neuroticism may have a direct effect on individuals’ mood rather than interacting with situational factors that are unique to individual life domains. The second finding is that neuroticism has sizeable effects on all life domains ranging from b = -.19 for satisfaction with housing to -31 for satisfaction with friendships.
Following the various paths from neuroticism to life-satisfaction produces a total effect of b = -.38, which confirms the strong negative effect of neuroticism on well-being. About a quarter of this effect is directly mediated by negative affect (sadness), b = -.09. The rest is mediated by the top-down effect of neuroticism on satisfaction with life domains and the bottom-up effect of life domains on global life-evaluations.
McCrae and Costa (1991) expanded their model to include the other Big Five factors. They proposed that agreeableness has a positive influence on well-being that is mediated by romantic satisfaction (adding Liebe) and that conscientiousness has a positive influence on well-being that is mediated by work satisfaction (adding Arbeit). Although this proposal was made three decades ago, it has never been seriously tested because few studies measure domain satisfaction (but see Heller et al., 2004).
To test these hypotheses, I added conscientiousness and agreeableness to the model. Adding both together was necessary because agreeableness and conscientiousness were correlated as reflected in a large modification index when the two factors were assumed to be independent. This does not mean that agreeableness and conscientiousness are correlated factors, an issue that is debated among personality psychologists (Anusic et al., 2009; Biesanz & West, 2004; DeYoung, 2006). One problem is that secondary loadings can produce spurious correlations among scale scores that were used for this model. This could be examined by using a more complex item-level model in the future. For now, agreeableness and conscientiousness were allowed to correlate. The results showed no direct effects of conscientiousness on PA, NA, and LS. In contrast, agreeableness was a positive predictor of PA and a negative predictor of NA. Most important are the relationships with domain satisfactions.
Confirming McCrae and Costa’s (1991) prediction, work satisfaction was predicted by conscientiousness, b = .21, z = 3.4. Also confirming McCrae and Costa, romantic satisfaction was predicted by agreeableness, although the effect size was small, b = .13, z = 2.9. Moreover, conscientiousness was an even stronger predictor, b =.28, z = 6.0. This confirms the old saying “marriage is work.” Also not predicted by McCrae and Costa was that conscientiousness is related to higher housing satisfaction, b = .20, z = 3.7, presumably because conscientious individuals take better care of their houses. The other domains were not significantly related to conscientiousness, |b| < .1.
Also not predicted by McCrae and Costa are additional relationships of agreeableness with other domains such as health, b = .18, z = 3.7, housing, a = .17, z = 2.9, recreation, b = .25, z = 4.0, and friendships, b = .35, z = 5.9. The only domains that were not predicted by agreeableness were financial satisfaction, b = .05, z = 0.8, and work satisfaction, b = .07, z = 1.3. Some of these relationships could reflects benefits for social relationships aside from romantic relationships. Thus, the results are broadly consistent with McCrae and Costa’s assumption that agreeableness is beneficial for well-being.
The total effect of agreeableness in this dataset was b = .21, z = 4.34. All of this effect was mediated by indirect paths, but only the path through romantic satisfaction achieved statistical significance due to a lack of power, b = .03, z = 2.6.
The total effect of conscientiousness was b = .18, z = 4.14. Three indirect paths were significant, namely work, b = .06, z = 3.3. romantic satisfaction, b = .06, z = 4.2, and housing satisfaction, b = .04, z = 2.51.
Overall, these results confirm previous findings that agreeableness and conscientiousness are also positive predictors of well-being and shed first evidence on potential mediators of these relationships. These results need to be replicated in datasets from other populations.
When openness was added to the model, a modification index suggested a correlation between extraversion and openness, which has been found in several multi-method studies (Anusic et al., 2009; DeYoung, 2006). Thus, the two factors were allowed to correlate. Openness had no direct effects on positive affect, negative affect, or life-satisfaction. Moreover, there were only two, weak, just significant relationships with domain satisfaction for work, b = .12, z = 2.0, and health, b = .12, z = 2.2. Consistent with meta-analysis, the total effect is negligible, b = .06, z = 1.3. In short, the results are consistent with previous studies and show that openness is not a predictor of higher or lower well-being. To keep the model simple, it is therefore possible to omit openness from the monster model.
At this point, we have built a complex, but plausible model that links personality traits to subjective well-being by means of domain satisfaction and affect. However, just because this model is plausible and fits the data, does not ensure that it is the right model. An important step in causal modeling is to consider alternative models and to do model comparisons. Overall fit is less important than relatively better fit among alternative models.
The previous model assumed that domain satisfaction causes higher levels of PA and lower levels of NA. Accordingly, affect is a summary of the affect generated in different life domains. This assumption is consistent with bottom-up models of well-being. However, a plausible alternative model assumes that affect is largely influenced by internal dispositions which in turn color our experiences of different life domains. Accordingly neuroticism may simply be a disposition to be more often in a negative mood and this negative mood colors perception of marital satisfaction, job satisfaction, and so on. Costa and McCrae (1980) proposed that neuroticism and extraversion are global affective dispositions. So, it makes sense to postulate that their influence on domain satisfaction and life satisfaction is mediated by affect. McCrae and Costa (1991) postulated that agreeableness and conscientiousness are not affective dispositions, but rather only instrumental for higher satisfaction in some life domains. Thus, their effects should not be mediated by affect. Consistent with this assumption, conscientiousness showed only significant relationships with some domains, including work satisfaction. However, agreeableness was a positive predictor of all life domains, suggesting that it is also a broad affective disposition. I thus modeled agreeableness as a third global affective disposition (see Figure 2).
The effect sizes for affect on domain satisfaction are shown in Table 1.
A comparison of the fit indices for the top-down and bottom-up models shows that both models meet standard criteria for global model fit (CFI > .95; RMSEA < .06). In addition, the results show no clear superiority of one model over the other. CFI and RMSEA show slightly better fit for the bottom-up model, but the Bayesian Information Criterion favors the more parsimonious top-down model. Thus, the data are unable to distinguish between the two models.
Both model assume that conscientiousness is instrumental for higher well-being in only some domains. The key difference between the models is the assumption of the top-down model that changes in domain satisfaction have no influence on affective experiences. That is, an increase in relationship satisfaction does not produce higher levels of PA or a decrease in job satisfaction does not produce a change in NA. These competing predictions can be tested in longitudinal studies.
To conclude part 4 of the monster model series. As surprising as it may sound, the present results provide one of the first tests of McCrae and Costa’s causal theory of well-being (Costa & McCrae, 1980, McCrae & Costa, 1991). Although the present results are consistent with their proposal that agreeableness and conscientiousness are instrumental for higher well-being because they foster higher romantic and job satisfaction, respectively, the present results also show that this model is too simplistic. For example, conscientiousness may also increase well-being because it contributes to higher romantic satisfaction (marriage is work).
One limitation of the present model is the focus on the Big Five as a measure of personality traits. The Big Five are higher-order personality traits of more specific personality traits that are often called facets. Facet level traits may predict additional variance in well-being that is not captured by the Big Five (Schimmack ,Oishi, Furr, & Funder, 2004). Part 5 will add the strongest facet predictors to the model, namely the Depressiveness facet of Neuroticism and the Cheerfulness facet of Extraversion (see also Payne & Schimmack, 2020).
Anusic, I., & Schimmack, U. (2016). Stability and change of personality traits, self-esteem, and well-being: Introducing the meta-analytic stability and change model of retest correlations. Journal of Personality and Social Psychology, 110(5), 766-781. https://doi.org/10.1037/pspp0000066
Campbell, D. T. (1963). From description to experimentation: Interpreting trends as quasi-experiments. In C. W. Harris (Ed.), Problems in measuring change. Madison: University of Wisconsin Press.
Hamaker, E. L., Kuiper, R. M., & Grasman, R. P. P. P. (2015). A critique of the cross-lagged panel model. Psychological Methods, 20(1), 102–116. https://doi.org/10.1037/a0038889
Heise, D. R. (1970). Causal inference from panel data. Sociological Methodology, 2, 3–27.
Orth, U., Clark, D. A., Donnellan, M. B., & Robins, R. W. (2021). Testing prospective effects in longitudinal research: Comparing seven competing cross-lagged models. Journal of Personality and Social Psychology, 120(4), 1013-1034. http://dx.doi.org/10.1037/pspp0000358
Orth, U., Robins, R. W., & Roberts, B. W. (2008). Low self-esteem prospectively predicts depression in adolescence and young adulthood. Journal of Personality and Social Psychology, 95, 695–708. http://dx.doi.org/10.1037/0022-3522.214.171.1245
Pelz, D. C., & Andrews, F. (1964). Detecting causal priorities in panel study data, American Sociological Review, 29, 836-848.
Background: A previous blog post shared a conversation between Bill von Hippel and Ulrich Schimmack about Bill’s Replicability Index (Part 1). To recapitulate, I had posted statistical replicability estimates for several hundred social psychologists (Personalized P-Values). Bill’s scores suggested that many of his results with p-values just below .05 might not be replicable. Bill was dismayed by his low R-Index but thought that some of his papers with very low values might be more replicable than the R-Index would indicate. He suggested that we put the R-Index results to an empirical test. He chose his paper with the weakest statistical evidence (interaction p = .07) for a replication study. We jointly agreed on the replication design and sample size. In just three weeks the study was approved, conducted, and the results were analyzed. Here we discuss the results.
…. Three Weeks Later
Bill: Thanks to rapid turnaround at our university IRB, the convenience of modern data collection, and the programming skills of Sam Pearson, we have now completed our replication study on Prolific. We posted the study for 2,000 participants, and 2,031 people signed up. For readers who are interested in a deeper dive, the data file is available at https://osf.io/cu68f/ and the pre-registration at https://osf.io/7ejts.
To cut to the chase, this one is a clear win for Uli’s R-Index. We successfully replicated the standard effect documented in the prior literature (see Figure A), but there was not even a hint of our predicted moderation of that effect, which was the key goal of this replication exercise (see Figure B: Interaction F(1,1167)=.97, p=.325, and the nonsignificant mean differences don’t match predictions). Although I would have obviously preferred to replicate our prior work, given that we failed to do so, I’m pleased that there’s no hint of the effect so I don’t continue to think that maybe it’s hiding in there somewhere. For readers who have an interest in the problem itself, let me devote a few paragraphs to what we did and what we found. For those who are not interested in Darwinian Grandparenting, please skip ahead to Uli’s response.
Previous work has established that people tend to feel closest to their mother’s mother, then their mother’s father, then their father’s mother, and last their father’s father. We replicated this finding in our prior paper and replicated it again here as well. The evolutionary idea underlying the effect is that our mother’s mother knows with certainty that she’s related to us, so she puts greater effort into our care than other grandparents (who do not share her certainty), and hence we feel closest to her. Our mother’s father and father’s mother both have one uncertain link (due to the possibility of cuckoldry), and hence put less effort into our care than our mother’s mother, so we feel a little less close to them. Last on the list is our father’s father, who has two uncertain links to us, and hence we feel least close to him.
The puzzle that motivated our previous work lies in the difference between our mother’s father and father’s mother; although both have one uncertain link, most studies show that people feel closer to their mother’s father than their father’s mother. The explanation we had offered for this effect was based on the idea that our father’s mother often has daughters who often have children, providing her with a more certain outlet for her efforts and affections. According to this possibility, we should only feel closer to our mother’s father than our father’s mother when the latter has grandchildren via daughters, and that is what our prior paper had documented (in the form of a marginally significant interaction and predicted simple effects).
Our clear failure to replicate that finding suggests an alternative explanation for the data in Figure A:
People are closer to their maternal grandparents than their paternal grandparents (possibly for the reasons of genetic certainty outlined above).
People are closer to their grandmothers than their grandfathers (possibly because women tend to be more nurturant than men and more involved in childcare).
As a result of these two main effects, people tend to be closer to their mothers’ father than their father’s mother, and this particular difference emerges in the presence or absence of other more certain kin.
Does our failure to replicate mean that the presence or absence of more certain kin has no impact on grandparenting? Clearly not in the manner I expected, but that doesn’t mean it has no effect. Consider the following (purely exploratory, non-preregistered) analyses of these same data: After failing to find the predicted interaction above, I ran a series of regression analyses, in which closeness to maternal and paternal grandparents were the dependent variables and number of cousins via fathers’ and mothers’ brothers and sisters were the predictor variables. The results are the same whether we’re looking at grandmothers or grandfathers, so for the sake of simplicity, I’ve collapsed the data into closeness to paternal grandparents and closeness to maternal grandparents. Here are the regression tables:
We see three very small but significant findings here (all of which require replication before we have any confidence in them). First, people feel closer to their paternal grandparents to the degree that those grandparents are not also maternal grandparents to someone else (i.e., more cousins through fathers’ sisters are associated with less closeness to paternal grandparents). Second, people feel closer to their paternal grandparents to the degree that their maternal grandparents have more grandchildren through daughters other than their mother (i.e., more cousins through mothers’ sisters are associated with more closeness to paternal grandparents). Third, people feel closer to their maternal grandparents to the degree that those grandparents are not also maternal grandparents to someone else (i.e., more cousins through mothers’ sisters are associated with less closeness to maternal grandparents). Note that none of these effects emerged via cousins through father’s or mother’s brothers. These findings strike me as worthy of follow-up, as they suggest that the presence or absence of equally or more certain kin does indeed have a (very small) influence on grandparents in a manner that evolutionary theory would predict (even if I didn’t predict it myself).
Uli: Wow, I am impressed how quickly research with large samples can be done these days. That is good news for the future of social psychology, at least the studies that are relatively easy to do.
Bill: Agreed! But benefits rarely come without cost and studies on the web are no exception. In this case, the ease of working on the web also distorts our field by pushing us to do the kind of work that is ‘web-able’ (e.g., self-report) or by getting us to wangle the methods to make them work on the web. Be that as it may, this study was a no brainer, as it was my lowest R-Index and pure self-report. Unfortunately, my other papers with really low R-Indices aren’t as easy to go back and retest (although I’m now highly motivated to try).
Uli: Of course, I am happy that R-Index made the correct prediction, but N = 1 is not that informative.
Bill: Consider this N+1, as it adds to your prior record.
Uli: Maybe you set yourself up for failure by picking a marginally significant result.
Bill: That was exactly my goal. I still believed in the finding, so it was a great chance to pit your method against my priors. Not much point in starting with one of my results that we both agree is likely to replicate.
Uli: The R-Index analysis implied that we should only trust your results with p < .001.
Bill: That seems overly conservative to me, but of course I’m a biased judge of my own work. Out of curiosity, is that p value better when you analyze all my critical stats rather than just one per experiment? This strikes me as potentially important, because almost none of my papers would have been accepted based on just a single statistic; rather, they typically depend on a pattern of findings (an issue I mentioned briefly in our blog).
Uli: The rankings are based on automatic extraction of test statistics. Selecting focal tests would only lead to an even more conservative alpha criterion. To evaluate the alpha = .001 criterion, it is not fair to use a single p = .07 result. Looking at the original article about grandparent relationships, I see p < .001 for mother’s mother vs. mother’s father relationships. The other contrasts are just significant and do not look credible according to R-Index (predicting failure for same N). However, they are clearly significant in the replication study. So, R-Index made two correct predictions (one failure and one success), and two wrong predictions. Let’s call it a tie. 🙂
Bill: Kind of you, but still a big win for the R-Index. It’s important to keep in mind that many prior papers had found the other contrasts, whereas we were the first to propose and find the specific moderation highlighted in our paper. So a reasonable prior would set the probability much higher to replicate the other effects, even if we accept that many prior findings were produced in an era of looser research standards. And that, in turn, raises the question of whether it’s possible to integrate your R-Index with some sort of Bayesian prior to see if it improves predictive ability.
Your prediction markets v. R-Index blog makes the very good point that simple is better and the R-Index works awfully well without the work involved in human predictions. But when I reflect on how I make such predictions (I happened to be a participant in one of the early prediction market studies and did very well), I’m essentially asking whether the result in question is a major departure from prior findings or an incremental advance that follows from theory. When the former, I say it won’t replicate without very strong statistical evidence. When the latter, I say it will replicate. Would it be possible to capture that sort of Bayesian processing via machine learning and then use it to supplement the R-Index?
Uli: There is an article that tried to do this. Performance was similar to prediction markets. However, I think it is more interesting to examine the actual predictors that may contribute to the prediction of replication outcomes. For example, we know cognitive psychology and within-subject designs are more replicable than social psychology and between-subject designs. I don’t think, however, we will get very far based on single questionable studies. Bias-corrected meat-analysis may be the only way to salvage robust findings from the era of p-hacking.
To broaden the perspective from this single article to your other articles, one problem with the personalized p-values is that they are aggregated across time. This may lead to overly conservative alpha levels (p < .001) for new research that was conducted in accordance with new rules about transparency, while the rules may be too liberal for older studies that were conducted in a time when awareness about the problems of selection for significance was lacking (say before 2013). Inspired by the “loss of confidence project” (Rohrer et al., 2021), I want to give authors the opportunity to exclude articles from their R-Index analysis that they no longer consider credible themselves. To keep track of these loss-of-confidence declaration, I am proposing to use PubPeer (https://pubpeer.com/). Once an author posts a note on PubPeer that declares loss of confidence in the empirical results of an article, the article will be excluded from the R-Index analysis. Thus, authors can improve their standing in the rankings and, more importantly, change the alpha level to a more liberal level (e.g., from .005 to .01) by (a) publicly declaring loss of confidence in a finding and (b) publishing new research with studies that have more power and honestly report non-significant results.
I hope that the incentive to move up in the rankings will increase the low rate of loss of confidence declarations and help us to clean up the published record faster. Declarations could also be partial. For example, for the 2005 article, you could post a note on PubPeer that the ordering of the grandparent relationships was successfully replicated and the results for cousins were not with a link to the data and hopefully eventually a publication. I would then remove this article from the R-Index analysis. What do you think about this idea?
Bill: I think this is a very promising initiative! The problem, as I see it, is that authors are typically the last ones to lose confidence in their own work. When I read through the recent ‘loss of confidence’ reports, I was pretty underwhelmed by the collection. Not that there was anything wrong with the papers in there, but rather that only a few of them surprised me.
Take my own case as an example. I obviously knew it was possible my result wouldn’t replicate, but I was very willing to believe what turned out to be a chance fluctuation in the data because it was consistent with my hypothesis. Because I found that hypothesis-consistent chance fluctuation on my first try, I would never have stated I have low confidence in it if you hadn’t highlighted it as highly improbable. In other words, there’s no chance I’d have put that paper on a ‘loss of confidence’ list without your R-Index telling me it was crap and even then it took a failure to replicate for me to realize you were right.
Thus, I would guess that uptake into the ‘loss of confidence’ list would be low if it emphasizes work that people feel was sloppy in the first place, not because people are liars, but because people are motivated reasoners.
With that said, if the collection also emphasizes work that people have subsequently failed to replicate, and hence have lost confidence in it, I think it would be used much more frequently and could become a really valuable corrective. When I look at the Darwinian Grandparenting paper, I see that it’s been cited over 150 times on google scholar. I don’t know how many of those papers are citing it for the key moderation effect that we now know doesn’t replicate, but I hope that no one else will cite it for that reason after we publish this blog. No one wants other investigators to waste time following up their work once they realize the results aren’t reliable.
Uli: (feeling a bit blue today). I am not very optimistic that authors will take note of replication failures. Most studies are not conducted after a careful review of the existing literature or a meta-analysis that takes publication bias into account. As a result, citations in articles are often picked because they help to support a finding in an article. While p-hacking of data may have decreased over the past decade in some areas, cherry-picking of references is still common and widespread. I am not really sure how we can speed up self-correction of science. My main hope is that meta-analyses are going to improve and take publication bias more seriously. Fortunately, new methods show promising results in debiasing effect sizes estimates (Bartoš, Maier, Wagenmakers, Doucouliagos, & Stanley, 2021). Z-curve is also being used by meta-analysists and we are hopeful that z-curve 2.0 will soon be accepted for publication in Meta-Psychology (Bartos & Schimmack, 2021). Unfortunately, it will take another decade for these methods to become mainstream and meanwhile many resources will be wasted on half-baked ideas that are grounded in a p-hacked literature. I am not optimistic that psychology will become a rigorous science during my lifetime. So, I am trying to make the best of it. Fortunately, I can just do something else when things are too depressing, like sitting in my backyard and watch Germany win at the Euro cup. Life is good, psychological science not so much.
Bill: I don’t blame you for your pessimism, but I completely disagree. You see a science that remains flawed when we ought to know better, but I see a science that has improved dramatically in the 35 years since I began working in this field. Humans are wildly imperfect actors who did not evolve to be dispassionate interpreters of data. We hope that training people to become scientists will debias them – although the data suggest that it doesn’t – and then we double down by incentivizing scientists to publish results that are as exciting as possible as rapidly as possible.
Thankfully bias is the both the problem and the solution, as other scientists are biased in favor of their theories rather than ours, and out of this messy process the truth eventually emerges. The social sciences are a dicier proposition in this regard, as our ideologies intersect with our findings in ways that are less common in the physical and life sciences. But so long as at least some social scientists feel free to go wherever the data lead them, I think our science will continue to self-correct, even if the process often seems painfully slow.
Uli: Your response to my post is a sign that progress is possible, but 1 out of 400 may just be the exception to the rule to never question your own results. Even researchers who know better become promoters of their own theories, especially when they become popular. I think the only way to curb false enthusiasm is to leave the evaluation of theories (review articles, meta-analysis) to independent scientists. The idea that one scientist can develop and evaluate a theory objectively is simply naive. Leaders of a paradigm are like strikers in soccer. They need to have blinders on to risk failure. We need meta-psychologists to distinguish real contributions from false ones. In this way meta-psychologists are like referees. Referees are not glorious heroes, but they are needed for a good soccer game, and they have the power to call of a goal because a player was offside or used their hands. The problem for science is the illusion that scientists can control themselves.
The past decade has revealed many flaws in the way psychologists conduct empirical tests of theories. The key problem is that psychologists lacked an accepted strategy to conclude that a prediction was not supported. This fundamental flaw can be traced back to Fisher’s introduction of significance testing. In Fisher’s the null-hypothesis is typically specified as the absence of an effect in either direction. That is, the effect size is exactly zero. Significance testing examines how much empirical results deviate from this prediction. If the probability of the result or even more extreme deviations is less than 5%, the null-hypothesis is rejected. However, if the p-value is greater than .05, no inferences can be drawn from the finding because there are two explanations for this finding. Either the null-hypothesis is true or it is false and the result is a false negative result. The probability of this false negative results is unspecified in Fisher’s framework. This asymmetrical approach to significance testing continues to dominate psychological science.
Criticism of this one-sided approach to significance testing is nearly as old as nil-hypothesis significance testing itself (Greenwald, 1975; Sterling, 1959). Greenwald’s (1975) article is notable because it provided a careful analysis of the problem and it pointed towards a solution to this problem that is rooted in Neyman-Pearson’s alternative to Fisher’s significance testing. Greenwald (1975) showed how it is possible to “Accept the Null-Hypothesis Gracefully” (p. 16).
“Use a range, rather than a point, null hypothesis. The procedural recommendations to follow are much easier to apply if the researcher has decided, in advance of data collection, just what magnitude of effect on a dependent measure or measure of association is large enough not to be considered trivial. This decision may have to be made somewhat arbitrarily but seems better to be made somewhat arbitrarily before data collection than to be made after examination of the data.” (p. 16).
The reason is simply that it is impossible to provide evidence for the nil-hypothesis that an effect size is exactly zero, just like it is impossible to show than an effect size equals any other precise value (e..g., r = .1). Although Greenwald made this sensible suggestion over 40 years ago, it is nearly impossible to find articles that specify a range of effect sizes a priori (e.g.., we expected the effect size to be in the range between r = .3 and r = .5 or we expected the correlation to be larger than r = .1).
Bad training continues to be a main reason for the lack of progress in psychological science. However, other factors also play a role. First, specifying effect sizes a priori has implications for the specification of sample sizes. A researcher who declares that effect sizes as small as r = .1 are meaningful and expected needs large samples to obtain precise effect size estimates. For example, assuming the population correlation is r = .2 and a researcher wants to show that it is at least r = .1, a one-sided test with alpha = .05 and 95% power (i.e., the probability of a successful outcome) is N = 1,035. As most sample sizes in psychology are below N = 200, most studies simply lack the precision to test hypothesis that predict small effects. A solution to this might be to focus on hypotheses that predict large effect sizes. However, to show that a population correlation of r = .4 is greater than r = .3, still requires N = 833 participants. In fact, most studies in psychology barely have enough power to demonstrate that moderate correlations, r = .3, are greater than zero, N = 138. In short, most studies are too small to provide evidence for the null-hypothesis that effect sizes are small than a minimum effect size. Not surprisingly, psychological theories are rarely abandoned because empirical results seemed to support the null-hypothesis.
However, occasionally studies do have large samples and it would be possible to follow Greenwald’s (1975) recommendation to specify a minimum effect size a priori. For example, Greenwald and colleagues conducted a study with N = 1,411 participants who reported their intentions to vote for Obama or McCain in the 2008 US elections. The main hypothesis was that implicit measures of racial attitudes like the race IAT would add to the prediction because some White Democrats might not vote for a Black Democratic candidate. It would have been possible to specify an minimum effect size based on a meta-analysis that was published in the same year. This meta-analysis of smaller studies suggested that the average race IAT – criterion correlation was r = .236. The explicit – criterion correlation was r = .186, effect, and the explicit-implicit correlation was only r = .117. Given the lower estimates for the explicit measures and the low explicit-implicit correlation, a regression analysis would only slightly reduce the effect size for the incremental predictive validity of the race IAT, b = .225. Thus, it would have been possible to test the hypothesis that the effect size is at least b = .1, which would imply that adding the race IAT as a predictor explains at least 1% additional variance in voting behaviors.
In reality, the statistical analyses were conducted with prejudice against the null-hypothesis. First, Greenwald et al. (2009) noted that “conservatism and symbolic racism were the two strongest predictors of voting intention (see Table 1)” (p. 247).
A straightforward way to test the hypothesis that the race IAT contributes to the prediction of voting would simply add the standardized race IAT as an additional predictor and use the regression coefficient to test the prediction that implicit bias as measured with the race IAT contributes to voting against Obama. A more stringent test of incremental predictive validity would also include the other explicit prejudice measures because measurement error alone can produce incremental predictive validity for measures of the same construct. However, this is not what the authors did. Instead, they examined whether the four racial attitude measures jointly predicted variance in addition to political orientation. This was the case, with 2% additional explained variance (p < .0010). However, this result does not tell us anything about the unique contribution of the race IAT. The unique contributions of the four measures were not reported. Instead, another regression model tested whether the race IAT and a second implicit measure (the Affective Misattribution Task) explained incremental variance in addition to political orientation. In this model “the pair of implicit measures incrementally predicted only 0.6% of voting intention variance, p = .05” (p. 247). This model also does not tell us anything about the importance of the race IAT because it was not reported how much of the joint contribution was explained by the race IAT alone. The inclusion of the AMP also makes it impossible to test the statistical significance for the race IAT because most of the prediction may come from the shared variance between the two implicit measures, r = .218. Most important, the model does not test whether the race IAT predicts voting above and beyond explicit measures, including symbolic racism.
Another multiple regression analysis entered symbolic racism and the two implicit measures. In this analysis, the two implicit measures combined explained an additional 0.7% of the variance, but this was not statistically significant, p = .07.
They then fitted the model with all predictor variables. In this model, the four attitude measures explained an additional 1.3% of the variance, p = .01, but no information is provided about the unique contribution of the race IAT or the joint contribution of the two implicit measures. The authors merely comment that “among the four race attitude measures, the thermometer difference measure was the strongest incremental predictor and was also the only one of the four that was individually statistically significant in their simultaneous entry after both symbolic racism and conservatism (p. 247).
To put it mildly, the presented results carefully avoid reporting the crucial result about the incremental predictive validity of the race IAT after explicit measures of prejudice are entered into the equation. Adding the AMP only creates confusion because the empirical question is how much the race IAT adds to the prediction of voting behavior. Whether this variance is shared with another implicit measure or not is not relevant.
Table 1 can be used to obtain the results that were not reported in the article. A regression analysis shows a standardized effect size estimate of 0.000 with a 95%CI that ranges from -.047 to .046. The upper limit of this confidence interval is below the minimum effect size of .1 that was used to specify a reasonable null-hypothesis. Thus, the only study that had sufficient precision to the incremental predictive validity of the race IAT shows that the IAT does not make a meaningful, notable, practically significant contribution to the prediction of racial bias in voting. In contrast, several self-report measures did show that racial bias influenced voting behavior above and beyond the influence of political orientation.
Greenwald et al.’s (2009) article illustrates Greenwald’s (1975) prejudice against the null-hypotheses. Rather than reporting a straightforward result, they present several analyses that disguise the fact that the race IAT did not predict voting behavior. Based on these questionable analyses, the authors misrepresent the findings. For example, they claim that “both the implicit and explicit (i.e., self-report) race attitude measures successfully predicted voting.” They omit that this statement is only correct when political orientation and symbolic racism are not used as predictors.
They then argue that their results “supplement the substantial existing evidence that race attitude IAT measures predict individual behavior (reviewed by Greenwald et al., 2009)” (p. 248). This statement is false. The meta-analysis suggested that incremental predictive validity of the race IAT is r ~ .2, whereas this study shows an effect size of r ~ 0 when political orientation is taken into account.
The abstract, often the only information that is available or read, further misleads readers. “The implicit race attitude measures (Implicit Association Test and Affect Misattribution Procedure) predicted vote choice independently of the self-report race attitude measures, and also independently of political conservatism and symbolic racism. These findings support construct validity of the implicit measures” (p. 242). Careful reading of the results section shows that the statement refers to separate analyses in which implicit measures are tested controlling for explicit attitude ratings OR political orientation OR symbolic racism. The new results presented here show that the race IAT does not predict voting controlling for explicit attitudes AND political orientation AND symbolic racism.
The deceptive analysis of these data has led to many citations that the race IAT is an important predictor of actual behavior. For example, in their popular book “Blindspot” Banaji and Greenwald list this study as an example that “the Race IAT predicted racially discriminatory behavior. A continuing stream of additional studies that have been completed since publication of the meta-analysis likewise supports that conclusion. Here are a few examples of race-relevant behaviors that were predicted by automatic White preference in these more recent studies: voting for John McCain rather than Barack Obama in the 2008 U.S. presidential election” (p. 49)
Kurdi and Banaji (2017) use the study to claim that “investigators have used implicit race attitudes to predict widely divergent outcome measures” (p. 282), without noting that even the reported results showed less than 1% incremental predictive validity. A review of prejudice measures features this study as an example of predictive validity (Fiske & North, 2014).
Of course, a single study with a single criterion is insufficient to accept the null-hypothesis that the race IAT lacks incremental predictive validity. A new meta-analysis by Kurdi with Greenwald as co-author provides new evidence about the typical amount of incremental predictive validity of the incremental predictive validity of the race IAT. The only problem is that this information is not provided. I therefore analyzed the open data to get this information. The meta-analytic results suggest an implicit-criterion correlation of r = .100, se = .01, an explicit-criterion correlation of r = .127, se = .02, and an implicit-explicit correlation of of r = .139, se = .022. A regression analysis yields an estimate of the incremental predictive validity for the race IAT of .084, 95%CI = .040 to .121. While this effect size is statistically significant in a test against the nil-hypothesis, it is also statistically different from Greenwald et al.s’ (2009) estimate of b = .225. Moreover, the point estimate is below .1, which could be used to affirm the null-hypothesis, but the confidence interval includes a value of .1. Thus, there is a 20% chance (an 80%CI would not include .1) that the effect size is greater than .1, but it is unlikely(p < .05) that it is greater than .12.
Greenwald and Lai (2020) wrote an Annual Review article about implicit measures. It mentions that estimates of the predictive validity of IATs have decreased from r = .274 (Greenwald et all, 2009) to r = .097 (Kurdi et al., 2019). No mention is made of a range of effect sizes that would support the null-hypothesis that implicit measures do not add to the prediction of prejudice because they do not measure an implicit cause of behavior that is distinct from causes of prejudice that are reflected in self-report measures. Thus, Greenwald fails to follow the advice of his younger self to provide a strong test of a theory by specifying effect sizes that would provide support for the null-hypothesis and against his theory of implicit cognitions.
It is not only ironic to illustrate the prejudice against falsification with Greenwald’s own research. It also shows that the one-sided testing of theories that avoids failures is not only a lack of proper training in statistics or philosophy of science. After all, Greenwald demonstrated that he is well aware of the problems with nil-hypothesis testing. Thus, only motivated biases can explain the one-sided examination of the evidence. Once a researcher has made a name for themselves, they are no longer neutral observers like judges or juries. They are more like prosecutors who will try as hard as possible to get a conviction and ignore evidence that may support a non-guilty verdict. To make matters worse, science does not really have an adversarial system where a defense lawyer stands up for the defendant (i.e., the null-hypothesis) and no evidence can be presented to support the defendant.
Once we realize the power of motivated reasoning, it is clear that we need to separate the work of theory development and theory evaluation. We cannot let researchers who developed a theory conduct meta-analyses and write review articles, just like we cannot ask film directors to write their own movie reviews. We should leave meta-analyses and reviews to a group of theoretical psychologists who do not conduct original research. As grant money for original research is extremely limited and a lot of time and energy is wasted on grant proposals, there is ample capacity for psychologist to become meta-psychologist. Their work also needs to be evaluated differently. The aim of meta-psychology is not to make novel discoveries, but to confirm that claims by original researchers about their discoveries are actually robust, replicable, and credible. Given the well-documented bias in the published literature, a lot of work remains to be done.
After I posted this post, I learned about a published meta-analysis and new studies of incidental anchoring by David Shanks and colleagues that came to the same conclusion (Shanks et al., 2020).
“The most expensive car in the world costs $5 million. How much does a new BMW 530i cost?”
According to anchoring theory, information about the most expensive car can lead to higher estimates for the cost of a BMW. Anchoring effects have been demonstrated in many credible studies since the 1970s (Kahneman & Tversky, 1973).
A more controversial claim is that anchoring effects even occur when the numbers are unrelated to the question and presented incidentally (Criticher & Gilovich, 2008). In one study, participants saw a picture of a football player and were asked to guess how likely it is that the player will sack the football player in the next game. The player’s number on jersey was manipulated to be 54 or 94. The study produced a statistically significant result suggesting that a higher number makes people give higher likelihood judgments. This study started a small literature on incidental anchoring effects. A variation on this them are studies that presented numbers so briefly on a computer screen that most participants did not actually see the numbers. This is called subliminal priming. Allegedly, subliminal priming also produced anchoring effects (Mussweiler & Englich (2005).
Since 2011, many psychologists are skeptical whether statistically significant results in published articles can be trusted. The reason is that researchers only published results that supported their theoretical claims even when the claims were outlandish. For example, significant results also suggested that extraverts can foresee where pornographic images are displayed on a computer screen even before the computer randomly selected the location (Bem, 2011). No psychologist, except Bem, believes these findings. More problematic is that many other findings are equally incredible. A replication project found that only 25% of results in social psychology could be replicated (Open Science Collaboration, 2005). So, the question is whether incidental and subliminal anchoring are more like classic anchoring or more like extrasensory perception.
There are two ways to assess the credibility of published results when publication bias is present. One approach is to conduct credible replication studies that are published independent of the outcome of a study. The other approach is to conduct a meta-analysis of the published literature that corrects for publication bias. A recent article used both methods to examine whether incidental anchoring is a credible effect (Kvarven et al., 2020). In this article, the two approaches produced inconsistent results. The replication study produced a non-significant result with a tiny effect size, d = .04 (Klein et al., 2014). However, even with bias-correction, the meta-analysis suggested a significant, small to moderate effect size, d = .40.
The data for the meta-analysis were obtained from an unpublished thesis (Henriksson, 2015). I suspected that the meta-analysis might have coded some studies incorrectly. Therefore, I conducted a new meta-analysis, using the same studies and one new study. The main difference between the two meta-analysis is that I coded studies based on the focal hypothesis test that was used to claim evidence for incidental anchoring. The p-values were then transformed into fisher-z transformed correlations and and sampling error, 1/sqrt(N – 3), based on the sample sizes of the studies.
Whereas the old meta-analysis suggested that there is no publication bias, the new meta-analysis showed a clear relationship between sampling error and effect sizes, b = 1.68, se = .56, z = 2.99, p = .003. Correcting for publication bias produced a non-significant intercept, b = .039, se = .058, z = 0.672, p = .502, suggesting that the real effect size is close to zero.
Figure 1 shows the regression line for this model in blue and the results from the replication study in green. We see that the blue and green lines intersect when sampling error is close to zero. As sampling error increases because sample sizes are smaller, the blue and green line diverge more and more. This shows that effect sizes in small samples are inflated by selection for significance.
However, there is some statistically significant variability in the effect sizes, I2 = 36.60%, p = .035. To further examine this heterogeneity, I conducted a z-curve analysis (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020). A z-curve analysis converts p-values into z-statistics. The histogram of these z-statistics shows publication bias, when z-statistics cluster just above the significance criterion, z = 1.96.
Figure 2 shows a big pile of just significant results. As a result, the z-curve model predicts a large number of non-significant results that are absent. While the published articles have a 73% success rate, the observed discovery rate, the model estimates that the expected discovery rate is only 6%. That is, for every 100 tests of incidental anchoring, only 6 studies are expected to produce a significant result. To put this estimate in context, with alpha = .05, 5 studies are expected to be significant based on chance alone. The 95% confidence interval around this estimate includes 5% and is limited at 26% at the upper end. Thus, researchers who reported significant results did so based on studies with very low power and they needed luck or questionable research practices to get significant results.
A low discovery rate implies a high false positive risk. With an expected discovery rate of 6%, the false discovery risk is 76%. This is unacceptable. To reduce the false discovery risk, it is possible to lower the alpha criterion for significance. In this case, lowering alpha to .005 produces a false discovery risk of 5%. This leaves 5 studies that are significant.
One notable study with strong evidence, z = 3.70, examined anchoring effects for actual car sales. The data came from an actual auction of classic cars. The incidental anchors were the prices of the previous bid for a different vintage car. Based on sales data of 1,477 cars, the authors found a significant effect, b = .15, se = .04 that translates into a standardized effect size of d = .2 (fz = .087). Thus, while this study provides some evidence for incidental anchoring effects in one context, the effect size estimate is also consistent with the broader meta-analysis that effect sizes of incidental anchors are fairly small. Moreover, the incidental anchor in this study is still in the focus of attention and in some way related to the actual bid. Thus, weaker effects can be expected for anchors that are not related to the question at all (a player’s number) or anchors presented outside of awareness.
There is clear evidence that evidence for incidental anchoring cannot be trusted at face value. Consistent with research practices in general, studies on incidental and subliminal anchoring suffer from publication bias that undermines the credibility of the published results. Unbiased replication studies and meta-analysis suggest that incidental anchoring effects are either very small or zero. Thus, there exists currently no empirical support for the notion that irrelevant numeric information can bias numeric judgments. More research on anchoring effects that corrects for publication bias is needed.
Social psychology suffers from a replication crisis because publication bias undermines the evidential value of published significant results. Meta-analysis that do not correct for publication bias are biased and cannot be used to estimate effect sizes. Here I show that a meta-analysis of the ease-of-retrieval effect (Weingarten & Hutchinson, 2018) did not fully correct for publication bias and that 200 significant results for the ease-of-retrieval effect can be fully explained by publication bias. This conclusion is consistent with the results of the only registered replication study of ease of retrieval (Groncki et al., 2021). As a result, there is no empirical support for the ease-of-retrieval effect. Implications for the credibility of social psychology are discussed.
Until 2011, social psychology appeared to have made tremendous progress. Daniel Kahneman (2011) reviewed many of the astonishing findings in his book “Thinking: Fast and Slow.” His book used Schwarz et al.’s (2011) ease-of-retrieval research as an example of rigorous research on social judgments.
The ease-of-retrieval paradigm is simple. Participants are assigned to two groups. In one group, they are asked to recall a small number of examples from memory. The number is chosen to make it easy to do this. In the other conditions, participants are asked to recall a larger number of examples. The number is chosen so that it is hard to come up with the requested number of examples. This task is used to elicit a feeling of ease or difficulty. Hundreds of studies have used this paradigm to study the influence of ease-of-retrieval on a variety of judgments.
In the classic studies that introduced the paradigm, participants were asked to retrieve a few or many examples of assertiveness behaviors before answering a question about their assertiveness. Three studies suggested that participants based their personality judgments on the ease of retrieval.
However, this straightforward finding is not always found. Kahneman points out that participants sometimes do not rely on the ease of retrieval. Paradoxically, they sometimes rely on the number of examples they retrieved even though the number was given by the experimenter. What made ease-of-retrieval a strong theory was that ease of retrieval researchers seemed to be able to predict the conditions that made people use ease as information and the conditions when they would use other information. “The proof that you truly understand a pattern of behavior is that you know how to reverse it” (Kahneman, 2011).
This success story had one problem. It was not true. In 2011, it became apparent that social psychologists used questionable research practices to produce significant results. Thus, rather than making amazing predictions about the outcome of studies, they searched for statistical significance and then claimed that they predicted these effects (John, Loewenstein, & , 2012; Kerr, 1998). Since 2011, it has become clear that only a small percentage of results in social psychology can be replicated without questionable practices (Open Science Collaboration, 2015).
I had my doubts about the ease-of-retrieval literature because I had heard rumors that researchers were unable to replicate these effects, but it was common not to publish these replication failures. My suspicions appeared to be confirmed, when John Krosnick gave a talk about a project that replicated 12 experiments in a large nationally representative sample. All but one experiment was successfully replicate. The exception was the ease-of-retrieval study; a direct replication of Schwarz et al.’s (1991) assertiveness studies. These results were published several years later (Yeager et al., 2019).
I was surprised when Weingarten and Hutchinson (2018) published a detailed and comprehensive meta-analysis of published and unpublished ease-of-retrieval studies and found evidence for a moderate effect size (d ~ .4) even after correcting for publication bias. This conclusion based on many small studies seemed inconsistent with the replication failure in the large national representative sample (Yeager et al., 2019). Moreover, the first pre-registered direct replication of Schwarz et al. (1991) also produced a replication failure (Groncki et al., 2021). One possible explanation for the discrepancy between the meta-analytic results and the replication results could be that the meta-analysis did not fully correct for publication bias. To test this hypothesis, I used the openly shared data to examine the robustness of the effect size estimate. I also conducted a new meta-analysis that included studies published after 2014, using a different coding of studies that codes only one focal hypothesis test per study. The results showed that the effect size estimate in Weingarten and Hutchinson’s (2018) is not robust and depends heavily on outliers. I also find that the coding scheme attenuates the detection of bias which leads to inflated effect size estimates. The new meta-analysis shows an effect size estimate close to zero. It also shows that heterogeneity is fully explained by publication bias.
Reproducing the Original Meta-Analysis
All effect sizes are Fisher-z transformed correlation coefficients. The predictor is the standard error; 1/sqrt(N – 3). Figure 1 reproduces the funnel plot in Weingarten and Hutchinson (2018), with the exception that sampling error is plotted on the x-axis and effect sizes are plotted on the y-axis.
Figure 1 also includes the predictions (regression lines) for three models. The first model is an unweighted average. This model assumes that there is no publication bias. The straight orange line shows that this model assumes an average effect size of z = .23 for all sample sizes. The second model assumes that there is publication bias and that bias increases in a linear fashion with sampling error. The slope of the blue regression line is significance and suggests that publication bias is present. The intercept of this model can be interpreted as the unbiased effect size estimate (Stanley, 2017). The intercept is z = .115 with a 95% confidence interval that ranges from .036 to .193. These results reproduce the results in Weingarten and Hutchinson (2018) closely, but not exactly, r = .104, 95%CI = .034 to .172. Simulation studies suggest that this effect size estimate underestimates the true effect size when the intercept is significantly different from zero (Stanley, 2017). In this case, it is recommended to use the variance (sampling error squared) as a model of publication bias. The red curve shows the predictions of this model. Most important, the intercept is now nearly at the same level as the model without publication bias, z = .221, 95%CI = .174 to .267. Once more, these results closely reproduce the published results, r = .193, 95%CI = .153 to .232.
The problem with unweighted models is that data points from small studies are given equal weights to studies with large samples. In this particular case, small studies are given even more weight than larger studies because small studies with extremely small sample sizes (N < 20) are outliers and outliers are weighted more heavily in regression analysis. Inspection of the scatter plot shows that 7 studies with sample sizes less than 10 (5 per condition) have a strong influence on the regression line. As a result, all three regression lines in Figure 1 overestimate effect sizes for studies with more than 100 participants. Thus, the intercept overestimates the effect sizes for large studies, including Yeager et al.’s (2019) study with N = 1,323 participants. In short, the effect size estimate in the meta-analysis is strongly influenced by 7 data points that represent fewer than 100 participants.
A simple solution to this problem is to weight observations by sample size so that larger samples are given more weight. This is actually the default option for many meta-analysis programs like the metafor package in R (Viechbauer, 2010). Thus, I reran the same analyses with weighting of observations by sample size. Figure 2 shows the results. In Figure 2 the size of observations reflects weights. The most important difference in the results is that the intercept for the model with a linear effect of sampling error is practically zero and not statistically significant, z = .006, 95%CI = -.040 to .052. The confidence interval is small enough to infer that the typical effect size is close enough to zero to accept the null-hypothesis.
Proponents of ease-of-retrieval will, however, not be satisfied with this answer. First, inspection of Figure 2 shows that the intercept is now strongly influenced by a few large samples. Moreover, the model does show heterogeneity in effect sizes, I2 = 33.38%, suggesting that at least some of the significant results were based on real effects.
Coding of Studies
Effect size meta-analysis evolved without serious consideration of publication bias. Although publication bias has been known to be present since meta-analysis was invented (Sterling, 1959), it was often an afterthought rather than part of the meta-analytic model (Rosenthal, 1979). Without having to think about publication bias, it became a common practice to code individual studies without a focus on the critical test that was used to publish a study. This practice obscures the influence of publication bias and may lead to an overestimation of the average effect size. To illustrate this, I am going to focus on the 7 data points in Figure 1 that were coded with sample sizes less than 10.
Six of the observations stem from an unpublished dissertation by Bares (2007) that was supervised by Norbert Schwarz. The dissertation was a study with children. The design had the main manipulation of ease of retrieval (few vs. many) as a between subject factor. Additional factors were gender, age (kindergartners vs. second graders) and 5 content domains (books, shy, friendly, nice, mean). The key dependent variable were frequency estimates. The total sample size was 198, with 98 participants in the easy condition and 100 in the difficult condition. The hypothesis was that ease-of-retrieval would influence judgments independent of gender or content. However, rather than testing the overall main effect across all participants, the dissertation presents analyses separately for different ages and contents. This led to the coding of this study with a reasonable sample size of N = 198 as 20 effects with sample sizes of N = 7 to 9. Only six of these effects were included in the meta-analysis. Thus, the meta-analysis added 6 studies with non-significant results, when there was only one study with non-significant results that was put in the file-drawer. As a result, the meta-analysis does no longer represent the amount of publication bias in the ease-of-retrieval literature. Adding these six effects to the meta-analysis makes the data look less biased and attenuates the regression of effect sizes on sampling error, which in turn leads to a higher intercept. Thus, traditional coding of effect sizes in meta-analyses can lead to inflated effect size estimates even in models that aim to correct for publication bias.
An Updated Meta-Analysis of Ease-of-Retrieval
Building on Weingarten and Hutchinson’s (2018) meta-analysis, I conducted a new meta-analysis that relied on test statistics that were reported to test ease-of-retrieval effects. I only used published articles because the only reason to search for unpublished studies is to correct for publication bias. However, Weingarten and Hutchinson’s meta-analysis showed that publication bias is still present even with a diligent attempt to obtain all data. I extended the time frame of the meta-analysis by searching for new publications since the last year that was included in Weingarten and Hutchinson’s meta-analysis (i.e., 2014). For each study, I looked for the focal hypothesis test of the ease-of-retrieval effect. In some studies, this was a main effect. In other studies, it was a post-hoc test following an interaction effect. The exact p-values were converted into t-values and t-values were converted into fisher-z scores as effect sizes. Sampling error was based on the sample size of the study or the subgroup in which the ease of retrieval effect was predicted. For the sake of comparability, I again show unweighted and weighted results.
The effect size estimate for the random effects model that ignores publication bias is z = .340, 95%CI = .317 to .363. This would be a moderate effect size (d ~ .6). The model also shows a moderate amount of heterogeneity, I2 = 33.48%. Adding sampling error as a predictor dramatically changes the results. The effect size estimate is now practically zero, z = .020. and the 95%CI is small enough to conclude that any effect would be small, 95%CI = -.048 to .088. Moreover, publication bias fully explains heterogeneity, I2 = 0.00%. Based on this finding, it is not recommended to use the variance as a predictor (Stanley, 2017). However, for the sake of comparison, Figure 1 also shows the results for this model. The red curve shows that the model makes similar predictions in the middle, but overestimates effect sizes for large samples and for small samples. Thus, the intercept is not a reasonable estimate of the average effect size, z = .183, 95%CI = .144 to .222. In conclusion, the new coding shows clearer evidence of publication bias and even the unweighted analysis shows no evidence that the average effect size differs from zero.
Figure 4 shows that the weighted models produce very similar results to the unweighted results.
The key finding is that the intercept is not significantly different from zero, z = -.016, 95%CI = -.053 to .022. The upper bound of the 95%CI corresponds to an effect size of r = .022 or d = .04. Thus, the typical ease of retrieval effect is practically zero and there is no evidence of heterogeneity.
Meta-analysis treats individual studies as interchangeable tests of a single hypothesis. This makes sense when all studies are more or less direct replications of the same experiments. However, meta-analysis in psychology often combine studies that vary in important details such as the population (adults vs. children) and the dependent variables (frequency judgments vs. attitudes). Even if a meta-analysis would show a significant average effect size, it remains unclear which particular conditions show the effect and which ones do not. This is typically examined in moderator analyses, but when publication bias is strong and effect sizes are dramatically inflated, moderator analyses have low power to detect signals in the noise.
In Figure 4, real moderators would produce systematic deviations from the blue regression line. As these residuals are small and strongly influenced by sampling error, finding a moderator is like looking for a needle in a heystack. To do so, it is useful to look for individual studies that produced more credible results than the average study. A new tool that can be used for this purpose is z-curve (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020).
Z-curve does not decompose p-values into separate effect size and sampling error components. Rather it converts p-values into z-scores and models the distribution of z-scores with a finite mixture model. The results provide complementary information about publication bias that does not rely on variation in sample sizes. As correlations between sampling error and effect sizes can be produced by other factors, z-curve provides a more direct test of publication bias.
Z-curve also provides information about the false positive risk in individual studies. If a literature has a low discovery rate (many studies produce non-significant results), the false discovery risk is high (Soric, 1989). Z-curve estimates the size of the file drawer and provides a corrected estimate of the expected discovery rate. To illustrate z-curve, I fitted z-curve to the ease-of-retrieval studies in the new meta-analysis (Figure 5).
Visual inspection shows that most z-statistics are just above the criterion for statistical significance z = 1.96. This corresponds to the finding that most effect sizes are about 2 times the magnitude of sampling error, which produces a just significant result. The z-curve shows a rapid decline of z-statistics as z-values increase. The z-curve model uses the shape of this distribution to estimate the expected discovery rate; that is, the proportion of significant results that are observed if all tests that were conducted were available. The estimate of 8% implies that most ease-of-retrieval tests are extremely underpowered and can only produce significant results with the help of sampling error. Thus, most of the observed effect size estimates in Figure 4 reflect sampling error rather than any population effect sizes.
The expected discovery rate can be compared to the observed discovery rate to assess the amount of publication bias. The observed discovery rate is simply the percentage of significant results for the 128 studies. The observed discovery rate is 83% and would be even higher if marginally significant results, p < .10, z > 1.65, were counted as significant. Thus, the observed discovery rate is 10 times higher than the expected discovery rate. This shows massive publication bias.
The difference between the expected and observed discovery rate is also important for the assessment of the false positive risk. As Soric (1989) showed, the risk of false positives increases as the discovery risk decreases. The observed discovery rate of 83% implies that the false positive risk is very small (1%). Thus, readers of journals are given the illusion that ease-of-retrieval effects are robust and researchers have a very good understanding of the conditions that can produce the effect. Hence, Kahneman’s praise of researchers’ ability to show the effect and to reverse it seemingly at will. The z-curve results show that this is an illusion because researchers only publish results when a study was successful. With an expected discovery rate of 8%, the false discovery risk is 61%. Thus, there is a high chance that studies with large samples will produce effect size estimates close to zero. This is consistent with the effect size estimates close to zero.
One solution to reduce the false-positive risk is to lower the significance criterion (Benjamin et al., 2017). Z-curve can be fitted with different alpha-levels to examine the influence on the false positive risk. By setting alpha to .005, the false positive risk is below 5% (Figure 6).
This leaves 36 studies that may have produced a real effect. A true positive result does not mean that a direct replication study will produce a significant result. To estimate replicability, we can select only the studies with p < .005 (z > 2.8) and fit z-curve to these studies using the standard significance criterion of .05. The false discovery risk inched up a bit, but may be considered acceptable with 8%. However, the expected replication rate with the same sample sizes is only 47%. Thus, replication studies need to increase sample sizes to avoid false negative results.
Five of the studies with strong evidence are by Sanna and colleagues. This is noteworthy because Sanna retracted 8 articles, including an article with ease-of-retrieval effects under suspicion of fraud (Yong, 2012). It is therefore unlikely that these studies provide credible evidence for ease-of-retrieval effects.
An article with three studies reported consistently strong evidence (Ofir et al., 2008). All studies manipulated the ease of recall of products and found that recalling a few low priced items made participants rate a store as less expensive than recalling many low priced items. It seems simple enough to replicate this study to test the hypothesis that ease of retrieval effects influence judgments of stores. Ease of retrieval may have a stronger influence for these judgments because participants may have less chronically accessible and stable information to make these judgments. In contrast, assertiveness judgments may be harder to move because people have highly stable self-concepts that show very little situational variance (Eid & Diener, 2004; Anusic & Schimmack, 2016).
Another article that provided three studies examined willingness to pay for trips to England (Sinha & Naykankuppam, 2013). A major difference to other studies was that this study supplied participants with information about tourist destinations in England and after a delay used recall of this information to manipulate ease of retrieval. Once more, ease-of-retrieval may have had an effect in these studies because participants had little chronically accessible information to make willingness-to-pay judgments.
A third, three study article with strong evidence found that participants rated the quality of their memory for specific events (e.g., New Year’s Eve) worse when they were asked to recall many (vs. few) facts about the event (Echterhoff & Hirst, 2006). These results suggest that ease-of-retrieval is used for judgments about memory, but may not influence other judgments.
The focus on individual studies shows why moderator analyses in effect-size meta-analysis often produce non-significant results. Most of the moderators that can be coded are not relevant, whereas moderators that are relevant can be limited to a single article and are not coded.
The Original Paradigm
It is not clear why Schwarz et al. (1991) decided to manipulate personality ratings of assertiveness. A look into the personality literature suggests that these judgments are often made quickly and with high temporal stability. Thus, they seemed a challenging target to demonstrate the influence of situational effects.
It was also risky to conduct these studies with small sample sizes that require large effect sizes to produce significant results. Nevertheless, the first study with 36 participants produced an encouraging, marginally significant result, F(1,34) = .07. Study 2 followed up on this result with a larger sample to boost power and did produce a real significant result, F(1, 142) = 6.35, p = .01. However, observed power (70%)) was still below the recommended level of 80%. Thus, the logical next step would have been to test the effect again with an even larger sample. However, the authors tested a moderator hypothesis in a smaller sample, which surprisingly produced a significant three-way interaction, F(1, 70) = 9.75, p < .001. Despite this strong interaction, the predicted ease-of-retrieval effects were not statistically significant because sample sizes were very small, assertive: t(18) = 1.55, p = .14, unassertive: t(18) = 1.91, p = .07.
It is unlikely to obtain supportive evidence in three underpowered studies (Schimmack, 2012), suggesting that the reported results were selected from a larger set of tests. This hypothesis can be tested with the Test of Insufficient Variance (TIVA), a bias test for small sets of studies (Renkewitz & Keiner, 2019; Schimmack, 2015). TIVA shows that the variation in p-values is less than expected, but the evidence is not conclusive. Nevertheless, even if the authors were just lucky, future studies are expected to produce non-significant results unless sample sizes are increased considerably. However, most direct replication studies of the original design used equally small sample sizes, but reported successful outcomes.
Yahalom and Schul (2016) reported a successful replication in another small sample (N = 20), with an inflated effect size estimate, t(18) = 2.58, p < .05, d = 1.15. Rather than showing the robustness of the effect, it strengthens the evidence that bias is present, TIVA p = .05. Another study in the same article finds evidence for the effect again, but only when participants are instructed to hear some background noise and not when they are instructed to listen to background noise, t(25) = 2.99, p = .006. The bias test remains significant, TIVA p = .05. Kuehnen did not find the effect, but claimed an interaction with item-order for questions about ease-of-retrieval and assertiveness. A non-significant trend emerged when ease-of-retrieval questions were asked first, which was not reported, t(34) = 1.10, p = .28. The bias test remains significant, TIVA p = .08. More evidence from small samples comes from Caruso (2008). In Study 1a, 30 participants showed an ease-of-retrieval effect, F(1,56) = 6.91, p = .011. The bias test remains significant, TIVA p = .06. Study 1b with more participants (N = 55), the effect was not significant, F(1, 110) = 1.05, p = .310. The bias test remains significant despite the non-significant result, TIVA p = .08. Tomala et al. (2007) added another just significant result with 79 participants, t(72) = 1.97, p = .05. This only strengthens the evidence of bias, TIVA p = .05. Yahalom and Schul (2013) also found a just significant effect with 130 students, t(124) = 2.54, only to strengthen evidence of bias, TIVA p = .04. Study 2 reduced the number of participants to 40, yet reported a significant result, F(1,76) = 8.26, p = .005. Although this p-value nearly reached the .005 level, there is no theoretical explanation why this particular direct replication of the original finding should have produced a real effect. Evidence for bias remains significant, TIVA p = .05. Study 3 reverts back to a marginally significant result that only strengthens evidence of bias, t(114) = 1.92, p = .06, TIVA bias p = .02. Greifeneder and Bless (2007) manipulated cognitive load and found the predicted trend only in the low-load condition, t(76) = 1.36, p = .18. Evidence for bias remained unchanged, TIVA p = .02.
In conclusion, from 1991 to 2016 published studies appeared to replicate the original findings, but this evidence is not credible because there is evidence of publication bias. Not a single one of these studies produced a p-value below .005, which has been suggested as a significance level that keeps the type-I error rate at an acceptable level (Benjamin et al., 2017).
Even meta-analyses of these small studies that correct for bias are inconclusive because sampling error is large and effect size estimates are imprecise. The only way to provide strong and credible evidence is to conduct a transparent and ideally pre-registered replication study with a large sample. One study like this was published by Yeager et al. (2019). With N = 1,325 participants the study failed to show a significant effect, F(1, 1323) = 1.31, p = .25. Groncki et al. (2021) conducted the first pre-registered replication study with N = 659 participants. They also ended up with a non-significant result, F(1, 657) = 1.34, p = .25.
These replication failures are only surprising if the inflated observed discovery rate is used to predict the outcome of future studies. Accordingly, we would have an 80% probability to get significant results and an even higher probability given the larger sample sizes. However, when we take publication bias into account, the expected discovery rate is only 8% and even large sample sizes will not produce significant results if the true effect size is close to zero.
In conclusion, the clear evidence of bias and the replication failures in two large replication studies suggest that the original findings were only obtained with luck or with questionable research practices. However, naive interpretation of these results created a literature with over 200 published studies without a real effect. In this regard, ease of retrieval is akin to the ego-depletion literature that is now widely considered invalid (Inzlicht, Werner, Briskin, & Roberts, 2021).
2011 has been a watershed moment in the history of social psychology. It has split social psychology into two camps. One camp denies that questionable research practices undermine the validity of published results and continue to rely on published studies as credible empirical evidence (Schwarz & Strack, 2016). The other camp assumes that most published results are false positives and trusts only new studies that are published following open science practices with badges for sharing of materials, data, and ideally pre-registration.
Meta-analysis can help to find a middle ground by examining carefully whether published results can be trusted, even if some publication bias is present. To do so, meta-analysis have to take publication bias seriously. Given the widespread use of questionable practices in social psychology, we have to assume that bias is present (Schimmack, 2020). Published meta-analyses that did not properly correct for publication bias can at best provide an upper limit for effect sizes, but they cannot establish that an effect exists or that the effect size has practical significance.
Weingarten and Hutchinson (2018) tried to correct for publication bias by using the PET-PEESE approach (Stanley, 2017). This is currently the best bias-correction method, but it is by no means perfect (Hong & Reed, 2021; Stanley, 2017). Here I demonstrated one pitfall in the use of PET-PEESE. Coding of studies that does not match the bias in the original articles can obscure the amount of bias and lead to inflated effect size estimates, especially if the PET model is incorrectly rejected and the PEESE results are accepted at face value. As a result, the published effect size of r = .2 (d = .4) was dramatically inflated and new results suggest that the effect size is close to zero.
I also showed in a z-curve analysis that the false positive risk for published ease-of-retrieval studies is high because the expected discovery rate is low and the file drawer of unpublished studies is large. To reduce the false positive risk, I recommend to adjust the significance level to alpha = .005, which is also consistent with other calls for more stringent criteria to claim discoveries (Benjamin et al., 2017). Based on this criterion, neither the original studies, nor any direct replication s of the original studies were significant. A few variations of the paradigm may have produced real effects, but pre-registered replication studies are needed to examine this question. For now, ease of retrieval is a theory without credible evidence.
For many social psychologists, these results are shocking and hard to believe. However, the results are by no means unique to the ease-of-retrieval literature. It has been estimated that only 25% to 50% of published results in social psychology can be replicated (Schimmack, 2020). Other large literatures such as implicit priming, ego-depletion, and facial feedback have also been questioned by rigorous meta-analyses and large replication studies.
For methodologists the replication crisis in psychology is not a surprise. They have warned for decades that selection for significance renders significant results insignificant (Sterling, 1961) and that sample sizes are too low (Cohen, 1961). To avoid similar mistakes in the future, researchers should conduct continuous power analyses and bias tests. As demonstrated here for the assertiveness paradigm, bias tests ring the alarm bells from the start and continue to show bias. In the future, we do not need to wait 40 years before we realize that researchers are chasing an elusive phenomenon. Sample sizes need to be increased or research needs to stop. Amassing a literature of 200 studies with a median sample size of N = 53 and 8% power has to be a mistake that should not be repeated.
Social psychologists should be the least surprised that they fooled themselves in believing their results. After all, they have demonstrated with credible studies that confirmation bias has a strong influence on human information processing. They should therefore embrace the need for open science, bias-checks, and replication studies as necessary evils that are necessary to minimize confirmation bias and to make scientific progress.