All posts by Ulrich Schimmack

About Ulrich Schimmack

Since Cohen (1962) published his famous article on statistical power in psychological journals, statistical power has not increased. The R-Index makes it possible f to distinguish studies with high power (good science) and studies with low power (bad science). Protect yourself from bad science and check the R-Index before you believe statistical results.

Subjective Wellbeing – Chapter 08

Life-Events, Adaptation, and SWB

Summary

Chapter 8 examines whether major life events produce lasting changes in subjective well-being. It begins with adaptation theory, especially the “hedonic treadmill” idea, which claims that people quickly return to their baseline level of happiness after good or bad events. The chapter argues that this view is too pessimistic. People do adapt to some changes, but not all. Life circumstances can have lasting effects, especially when they affect important goals, daily experiences, income, status, relationships, or health.

The chapter distinguishes two mechanisms that can make gains fade over time. First, aspirations can rise. As people get better housing, higher income, or newer products, their standards also increase, so satisfaction may not rise much. Second, emotional reactions are often strongest when circumstances change. A new house or improved condition may feel exciting at first, but the emotional boost fades as the new situation becomes normal. These mechanisms differ across life domains. They may be strong for income or housing, but weaker for close relationships, where ongoing engagement continues to matter.

The chapter then reviews evidence on unemployment. Unemployment is one of the clearest examples of a life event with a strong and persistent negative effect on well-being. It reduces income, status, structure, purpose, and social contact. Panel studies show that people do not simply adapt to long-term unemployment. Their well-being remains lower while they are unemployed and improves when they find new work. Much of the effect appears to operate through income and financial satisfaction, but unemployment also affects status and purpose.

Housing shows a different pattern. Moving to a better home increases housing satisfaction, and this improvement can last. However, global life satisfaction often changes little. This does not mean housing is unimportant. Rather, housing may fade into the background of daily life and may be underweighted when people make global life evaluations. Domain-specific measures show that housing conditions matter, especially when they affect daily life through noise, crowding, poor physical conditions, safety, or comfort. The chapter uses housing to show why domain satisfaction is essential for understanding well-being.

Disability provides a more complex case. Early claims that people adapt almost completely to disability were based on weak evidence. Better panel studies show that acquired disability often produces lasting declines in life satisfaction, especially when it involves broader health deterioration. However, people born with disabilities often report higher well-being than those who acquire disabilities later. This supports the ideal-based framework: people born with a disability form their goals and identity around that condition, whereas people who acquire a disability must revise previously formed ideals. Adaptation depends less on time alone than on whether people can build new goals compatible with their changed circumstances.

The chapter gives special attention to relationships. Cross-sectional studies show that partnered people are generally happier than singles, but earlier research underestimated the effect because it focused on marriage rather than partnership. Weddings may produce only temporary increases in well-being, but having a stable partner appears to have a lasting positive effect for most people. Cohabitation and committed partnership matter more than legal marital status. Most people want a partner, and those without one tend to report lower well-being. Happy lifelong singles exist, but they appear to be the exception rather than the rule.

Partnership improves well-being partly through material advantages, because couples often share income and expenses. However, income explains only a small part of the partnership effect. Family satisfaction and relationship quality explain more. Partnership provides emotional support, shared life management, intimacy, and companionship. Sexual satisfaction contributes somewhat, but relationship satisfaction is much more important. Thus, the benefits of partnership are not reducible to money or sex.

The chapter also discusses spousal similarity in well-being. Spouses are more similar in well-being than would be expected from genetics alone, and their well-being tends to change in the same direction over time. This suggests that shared environments, such as household income, housing, relationship quality, and common life events, influence both partners. Some similarity may reflect assortative mating or stable shared conditions, but the evidence points strongly to environmental influences within couples.

The conclusion is that adaptation is real but not automatic. Some changes, such as improvements in housing, may produce lasting domain-specific satisfaction without strongly affecting global life satisfaction. Other events, such as unemployment, divorce, and disability, can reduce well-being until circumstances or goals change. Pursuing happiness through life changes is not futile, but people need to consider how changes will affect everyday life, goal progress, and long-term priorities. Novelty can be exciting, but lasting well-being depends more on stable fit between actual life, personal ideals, and daily experience.

Subjective Wellbeing – Chapter 05

Subjective Wellbeing Around the World

Summary

Chapter 5 examines subjective well-being around the world. It argues that the most informative comparisons are not small differences among the happiest countries, such as Finland, Denmark, or other Scandinavian nations, but the large differences between countries near the top and bottom of the global distribution. These cross-national differences allow researchers to test whether subjective well-being is shaped by material living conditions, social institutions, culture, and historical change.

The chapter begins with the history of cross-national comparisons. Cantril’s ladder was first used in the 1960s to compare life evaluations across nations, and the Gallup World Poll has used the same basic measure since 2008 in more than 140 countries. Comparing countries measured in both periods suggests that average life evaluations have increased over time. This challenges strong claims that happiness is purely relative or that modern life has made people less happy than in the past. At the same time, changes vary across countries, showing that national well-being is not fixed and can shift with social, economic, and political conditions.

World maps of subjective well-being show clear geographic patterns. Scandinavia, Western Europe, Australia, and other wealthy countries tend to score high, whereas many African countries score low. These patterns make some simple explanations unlikely. Climate cannot explain high Scandinavian well-being because other high-ranking countries have very different climates. Romantic ideas that Eastern societies are generally happier than Western societies are also not supported by the data.

The strongest predictor of national differences in subjective well-being is purchasing power. Median income adjusted for purchasing power predicts average life evaluations very strongly, especially when income is analyzed on a logarithmic scale. The relationship is strongest at low income levels, where money helps meet basic needs such as food, shelter, health, and safety. However, the relationship does not disappear in affluent countries. Additional income still predicts higher life evaluation, although with diminishing returns. This directly challenges simple claims that money does not buy happiness.

At the same time, income does not explain everything. Some regions are happier or less happy than their income levels predict. South America and Scandinavia score higher than expected, whereas Arab countries, East Asia, and Eastern Europe score lower. These deviations suggest that culture, institutions, social relationships, response styles, and political conditions may also matter, although their effects are harder to isolate than the effect of income.

The chapter discusses East Asia as one example. East Asian countries often report lower life satisfaction than expected from their purchasing power. Some of this may reflect response styles, because East Asian respondents are more likely to choose moderate response options and less likely to use extreme ratings. Cultural norms about modesty, realism, and self-enhancement may also influence self-reported well-being. However, it remains unclear whether these patterns reflect reporting differences, real differences in experienced well-being, or both.

Latin America shows the opposite pattern: subjective well-being is often higher than income would predict. Some of this may also reflect response style, especially the tendency to use the top category on life-satisfaction scales. But measurement artifacts do not fully explain the pattern. Social support appears to be the most plausible substantive explanation. Latin American cultures may place especially strong emphasis on close relationships, family support, and social integration. Unpaid family work and remittances may also make material living conditions better than GDP alone suggests.

Scandinavian countries consistently rank near the top, but the difference between Scandinavia and other affluent Anglo countries is small. The chapter argues against overinterpreting a Scandinavian “secret.” Much of the small advantage appears related to higher financial satisfaction, possibly because of lower inequality, stronger welfare systems, and lower material aspirations. The key point is that Scandinavia scores high largely because it combines high purchasing power with strong social and institutional supports.

Arab countries report lower subjective well-being than expected from income. Financial dissatisfaction explains part of the gap, and lower perceived freedom explains a smaller part, but a substantial difference remains. Religion does not explain the lower scores; if anything, religiosity has a small positive association with well-being. The chapter also notes that life circumstances may have different implications in different cultures. For example, marriage appears more strongly related to well-being in Anglo countries than in Arab countries.

The chapter then turns to migration as stronger evidence for the importance of living conditions. Immigrants’ well-being tends to move closer to the average well-being of the country they move to than to that of their country of origin. Immigrants from poorer countries often show large gains after moving to countries such as Canada. This supports the conclusion that national differences in well-being are not just cultural or personality differences; living conditions matter. At the same time, some cultural patterns remain, because immigrants from Latin America and East Asia show some of the same relative patterns observed in their regions of origin.

Migration studies also show that integration matters. Immigrants who identify with Canada, either while maintaining their original identity or through assimilation, report higher well-being than those who remain separated from Canadian identity or feel marginalized. This suggests that migration improves well-being most when people gain access to better living conditions and also develop a sense of belonging in the new society.

The final sections argue that subjective well-being is not the only criterion for evaluating societies. Life expectancy also matters. A country where people are moderately happy for many decades may be preferable to one where people are very happy for a short life. The concept of happy life-years combines average well-being with life expectancy. Wealthier nations often do better on both dimensions because economic resources support health care, safety, and longer lives.

The chapter ends with sustainability. Modern high well-being often depends on resource-intensive lifestyles that may harm future generations. Subjective well-being research cannot solve this moral and political problem, but it can identify societies that achieve high well-being, long lives, and more sustainable living. Scandinavian countries currently do well on these dimensions, and Costa Rica offers a warmer example of relatively high well-being with lower resource use. The broader conclusion is that money matters greatly for well-being, especially through basic needs, but the best societies must also consider longevity, social conditions, and sustainability.

Publication Bias: The Caliper Test

Replicability Index Encyclopedia: Caliper Test

Caliper Test of Publication Bias

The caliper test is a statistical method for detecting publication bias introduced by Gerber and Malhotra (2008a, 2008b). It tests whether the distribution of test statistics is continuous and approximately locally symmetric around a significance threshold, typically z = 1.96, corresponding to p = .05. The key assumption is that, in the absence of publication bias or p-hacking, the expected density of z-scores in a narrow band just above the threshold should be approximately equal to the expected density just below it. A significant excess of results just above the threshold suggests that researchers or publication processes have shifted results across the boundary, either through selective reporting or analytical flexibility.

Procedure

Published p-values are converted to z-scores (z = Φ⁻¹(1 − p/2)). A caliper of width w is placed symmetrically around the threshold, creating two bins: one from 1.96 to 1.96 + w (just significant) and one from 1.96 − w to 1.96 (just nonsignificant). Under the null hypothesis of no bias, the counts in the two bins should be equal. The test is conducted as a one-sided binomial test with expected probability 0.50. Gerber and Malhotra (2008a) recommended bandwidths of 5%, 10%, 15%, and 20% of the threshold value. A 10% caliper around z = 1.96, for example, compares counts in the intervals [1.764, 1.96) and [1.96, 2.156].

Applications

Gerber and Malhotra applied the caliper test to leading political science journals (APSR, AJPS) and sociology journals (ASR, AJS) and found strong evidence of publication bias (Gerber & Malhotra, 2008a; Gerber & Malhotra, 2008b). The test was subsequently adopted in economics, most notably by Brodeur, Lé, Sangnier, and Zylberberg (2016) and Brodeur, Cook, and Heyes (2020), who documented significant bunching of test statistics just above conventional thresholds across top economics journals. Berning and Weiß (2016) applied the caliper test to German social science journals, again finding evidence of bias. The test has become a standard tool in the meta-science toolkit for discipline-wide assessments of publication practices.

Strengths

The caliper test has several practical advantages. The logic is intuitive and easy to communicate. It requires only test statistics or p-values, not standardized effect sizes, making it applicable to heterogeneous literatures where effect-size metrics vary across studies and designs. For discipline-wide analyses where studies address different research questions with different effects, the caliper test avoids the strong assumptions about comparability or homogeneity required by many other methods.

Limitations

The caliper test’s local-symmetry assumption is exact for normally distributed z-values only when the noncentrality parameter equals the critical value. For the conventional threshold z = 1.96, this corresponds to a study with approximately 50% power. If power is lower, the expected distribution slopes downward across the threshold, producing more just-nonsignificant than just-significant results. If power is higher, the distribution slopes upward across the threshold, producing more just-significant than just-nonsignificant results even in the absence of publication bias. Thus, deviations from caliper symmetry can reflect the power distribution of studies rather than selective publication or p-hacking.

This vulnerability becomes more influential with wider caliper intervals. With negative slopes near the threshold, as in low-powered settings, the assumption of local flatness reduces the power of the caliper test to detect publication bias. With positive slopes near the threshold, as in high-powered settings, there are more observations in the interval above the criterion value than below it even without bias. Thus, the caliper test can falsely identify publication bias when the literature has high power or when the mixture distribution slopes upward around the significance threshold. It is therefore unclear whether positive caliper-test results in some applications reflect bias or the expected shape of the z-value distribution.

Schneck (2017) conducted a Monte Carlo simulation comparing the caliper test to Egger’s test, p-uniform, and the test for excess significance (TES). He found that the 5% caliper maintained acceptable false-positive rates but had low power with fewer than 1,000 studies. The 10% and 15% calipers showed inflated false-positive rates at large K, because wider calipers span a larger portion of the density curve where the local-uniformity assumption can break down. Schneck recommended the 5% caliper for discipline-wide analyses with large K. However, a small caliper does not solve the problem of true asymmetric distributions. With large K, even small departures from local symmetry can be estimated precisely, and the caliper test can become significant even if there is no publication bias.

Simulation studies using z-curve’s heterogeneous effect-size framework reveal the problem more starkly. In a simulation with high average power, fewer than 200 studies, and no bias, the caliper test detected bias 100% of the time. Thus, the test should not be interpreted as evidence of publication bias without inspecting the expected or observed shape of the z-value distribution.

This is not merely a calibration problem that can be fixed by adjusting the significance level or caliper width. Narrower calipers can reduce curvature-induced artifacts, but they cannot remove the conceptual mismatch between what the test assumes, local symmetry, and the actual distribution of z-values when the density slopes across the threshold.

This limitation is not shared by all bias-detection methods. Methods that model the full distribution of z-scores, such as z-curve (Brunner & Schimmack, 2020; Bartoš & Schimmack, 2022), can estimate the expected shape of the z-value distribution under heterogeneous power and selection. The advantage of the caliper test is that it can have high power to detect threshold-related discontinuities in some conditions. Its disadvantage is that it can also provide false evidence of bias when the expected distribution is asymmetric. Therefore, the caliper test should be used together with a plot of the z-value distribution. A positive slope for significant values is a red flag because it violates the local-symmetry assumption of the caliper test.

Summary

The caliper test is a simple, widely used tool for detecting threshold-related publication bias in large literatures. It is most reliable when the expected distribution of test statistics is approximately locally symmetric around the significance threshold in the absence of bias. In literatures where the z-value distribution slopes across the threshold — whether because of high power, low power, or heterogeneous true effects — the test can mistake the expected shape of the distribution for evidence of selective publication or p-hacking. This problem is especially relevant in discipline-wide analyses in the social sciences, where studies often address different hypotheses, use different designs, and have heterogeneous statistical power. Researchers using the caliper test in such settings should interpret positive results with caution and consider model-based alternatives that account for the expected shape of the z-score distribution.

References

Bartoš, F., & Schimmack, U. (2022). Z-curve 2.0: Estimating replication rates and discovery rates. Meta-Psychology, 6, MP.2020.2720.

Berning, C. C., & Weiß, B. (2016). Publication bias in the German social sciences: An application of the caliper test to three top-tier German social science journals. Quality & Quantity, 50, 901–917.

Brodeur, A., Cook, N., & Heyes, A. (2020). Methods matter: p-hacking and publication bias in causal analysis in economics. American Economic Review, 110(11), 3634–3660.

Brodeur, A., Lé, M., Sangnier, M., & Zylberberg, Y. (2016). Star Wars: The empirics strike back. American Economic Journal: Applied Economics, 8(1), 1–32.

Brunner, J., & Schimmack, U. (2020). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta-Psychology, 4, MP.2018.874.

Gerber, A. S., & Malhotra, N. (2008a). Do statistical reporting standards affect what is published? Publication bias in two leading political science journals. Quarterly Journal of Political Science, 3(3), 313–326.

Gerber, A. S., & Malhotra, N. (2008b). Publication bias in empirical sociological research: Do arbitrary significance levels distort published results? Sociological Methods & Research, 37(1), 3–30.

Gerber, A. S., Malhotra, N., Dowling, C. M., & Doherty, D. (2010). Publication bias in two political behavior literatures. American Politics Research, 38(4), 591–613.

Schneck, A. (2017). Examining publication bias — a simulation-based evaluation of statistical tests on publication bias. PeerJ, 5, e4115.