Gallup World Poll - Replicability-Index

The latest World Happiness Report gives Jonathan Haidt a megaphone to continue his narrative that decreasing wellbeing among young people can be blamed nearly entirely on social media use (Chapter 3). Chapter 4 shows how assessments of the evidence are biased and (US) American Psychologists, APA, are the most biased, but the (US) Surgeon General report is not much better. Policy is made based on biased readings of the evidence (fortunately, 16-year old will find ways, just like they were watching R-rated movies in the old days).

Chapter 3

Chapter 3 is written by Haidt and the website gives a helpful warning that it is a 61 minute read. That is like asking somebody to listen to 24 hours of Fox News to find out how they misrepresent everything to support a criminal president of the United States, where young people are getting less happy. I do not have time for that. Rather, I asked Clause (not war-supporting ChatGPT) to summarize and evaluate the chapter. Importantly, this is not generic Claude. This is a Claude project that knows everything about SWB that I have written in my textbook on this topic. Yes, unlike Haidt, I have studied SWB for 30 years. Is it unbiased? No. But it is an antidote to Haidt’s noise machine.

My favorite quote from Claude’s review. “The chapter is extraordinarily effective as persuasion and deeply problematic as science.” I don’t have the scientific answer, but I know that social media are at best only a small piece of the puzzle, why young people in Canada and the USA are suffering. Distracting from these problems by blaming social media is problematic and we need to find real answers to create real solutions to this problem.

The Structure

The chapter is organized as a legal prosecution — they literally say “we are making the case for the prosecution” and ask readers to use “preponderance of the evidence” as the standard. They present seven “evidence lines” organized as 15 “exhibits”:

Lines 1–3 (Testimony): Surveys of young people showing regret, surveys of parents/teachers/clinicians expressing concern, and leaked internal corporate documents showing companies knew about harms.

Lines 4–7 (Academic evidence): Cross-sectional studies showing heavy users have elevated depression risk, longitudinal studies showing social media at Time 1 predicts depression at Time 2, RCTs showing reduction in social media improves mental health, and natural experiments showing broadband rollout worsened mental health.

They then extrapolate to population-level harm estimates (millions of US teens affected) and argue this is sufficient to answer the “historical trends question” — that social media caused the youth mental health decline.

The Rhetorical Strategy

The chapter is extraordinarily effective as persuasion and deeply problematic as science. The courtroom framing is the tell. In a courtroom, the prosecution presents one side of the case. The defense presents the other. Haidt has cast himself as prosecutor and explicitly asks readers to lower their evidentiary standard. He then preemptively addresses critics not by engaging with their arguments substantively, but by saying their findings support his case once “unblended.”

The “unblending” argument is his central methodological move: whenever a study finds null or small effects, Haidt argues this is because the researchers combined too many outcomes, populations, or technologies. When you restrict to girls, to internalising symptoms, and to social media specifically, the effects get larger. This is a legitimate analytical point — but it’s also a form of specification searching. You can always find larger effects by narrowing the sample and outcome to where the signal is strongest. The question is whether those specifications were preregistered or selected post hoc.

Critical Problems from Your SWB Framework

1. No personality controls anywhere. Not a single study Haidt cites controls for Neuroticism or Depressiveness. Your Chapter 7 work shows these facets explain ~50% of SWB variance. A high-Neuroticism adolescent girl is simultaneously more likely to use social media heavily (rumination, reassurance-seeking), report depression, report body image problems, and perceive social media as harmful. Without personality controls, every “line of evidence” is confounded by the same omitted variable.

2. The testimony evidence is circular. Lines 1–3 amount to: people believe social media is harmful. But people’s causal attributions about their own mental health are unreliable — that’s one of the core lessons of your measurement chapter. If you asked depressed people in the 1990s what caused their depression, many would have blamed television, or music, or whatever was culturally salient. The fact that Meta’s own employees believed their products were harmful is concerning, but it’s still testimony about beliefs, not evidence of causation. Haidt himself acknowledges victims “could be mistaken” but then proceeds to treat the testimony as though it establishes the case.

3. The RCT evidence is weaker than presented. The headline finding is g = 0.19 for depression from social media reduction experiments. But these are short-term studies (often 1–3 weeks) with self-selected participants who know they’re in a social media reduction study. Demand effects are enormous. If you tell someone “we’re testing whether social media is bad for you, please reduce your usage,” the treatment group has a strong expectation of feeling better. There’s no placebo control for that expectation. Compare this to the evidence standards your textbook applies: lottery studies for money effects, migration quasi-experiments for cross-national differences. The RCT evidence here doesn’t come close.

4. The population extrapolations are absurd. Haidt takes an effect size from a 2-week reduction experiment on college students and extrapolates to “a decline in the prevalence of clinical depression by roughly one-third.” This assumes the experimental effect scales linearly to the population, that it would persist beyond two weeks, that college students are representative of adolescents, and that the experimental context (individual reduction while peers stay online) maps onto population-wide changes. None of these assumptions are justified.

5. The “historical trends question” remains unanswered by his own admission — then he answers it anyway. He repeatedly says the product safety question is easier than the historical trends question, and that his evidence is mainly about the former. But then he pivots and says the scale of individual harm “makes it more plausible” that social media caused the population trends. This is a non sequitur. Alcohol causes individual harm at massive scale too, but alcohol consumption hasn’t changed enough to explain any recent population trend in youth mental health.

6. He never mentions housing, employment, economic prospects, or any structural factor. In a 61-minute chapter organized as an exhaustive prosecution of social media, there is zero engagement with the alternative suspects you identified. No mention of housing affordability. No mention of the gig economy. No mention of declining real wages for young people. No mention of student debt. The word “economic” appears only in passing references to other studies’ control variables. This is exactly the bad criminal investigation you described — single suspect, no alibi checking.

7. The asymmetric framing of evidence standards is self-serving. Haidt argues that the cost of wrongly concluding social media is unsafe is low (kids just wait until 16), while the cost of wrongly concluding it’s safe is “catastrophically high.” This is a policy argument dressed up as an epistemological one. By the same logic, the cost of wrongly attributing the youth mental health crisis to social media — while ignoring structural economic causes — is also catastrophically high, because it diverts political attention from housing, labor market, and inequality reforms that could actually help.

Bottom Line

The chapter is a 61-minute advocacy document by someone who wrote a bestselling book with the same thesis and has enormous reputational stakes in being right. The evidence he presents is real but selectively curated, never contextualized against alternative explanations, and extrapolated well beyond what the data support. From the standards your textbook establishes — multi-method designs, halo bias correction, causal evidence hierarchies, personality controls — this chapter wouldn’t pass peer review in the form it’s presented.

Chapter 4

Alright, let me read the WHR26 Chapter 4 and evaluate it against the standards we’ve been applying to the other chapters.Night and day compared to the Haidt chapter.

This is a proper piece of meta-science rather than advocacy. The authors (this reads like Przybylski’s group) do exactly what a rigorous analysis should do: they ask why three organizations looking at the same literature reached different conclusions, and then systematically show the answer isn’t what evidence was selected but how it was synthesized and communicated.

The key findings are damning for the APA and Surgeon General but in a measured, empirically grounded way. Only 4 out of 617 unique papers appeared in all three reports — yet the methodological and thematic profiles of the cited evidence were statistically indistinguishable. So the divergence is entirely in the synthesis layer: citation accuracy, engagement with disconfirmatory evidence, acknowledgment of limitations, and calibration of certainty to what the evidence actually supports. NASEM comes out well; the APA comes out badly (20% self-citation by advisory panel members, citation clumping, outright mismatches between cited studies and the claims they’re supposed to support); the Surgeon General falls in between but leans toward rhetorical urgency over epistemic humility.

From your book’s perspective, several things stand out.

The chapter’s core argument — that the same evidence base can yield opposite policy conclusions depending on synthesis practices — is essentially a measurement-and-method-variance story applied to the policy level. It’s the same logic as your Chapter 3 telescope metaphor: the instrument (here, the synthesis process) determines what you see, and if the instrument is distorted, the conclusions are distorted. The chapter makes this point more carefully and with better evidence than anything in the Haidt chapter, which was itself an example of the very problem this chapter diagnoses.

The six quality indicators they identify (citation accuracy, contextualisation, limitation acknowledgment, engagement with disconfirmatory evidence, calibrated language, process transparency) are essentially a checklist for distinguishing science from advocacy — which maps directly onto your book’s concern about distinguishing empirical claims from conceptual positions. The Weiss “evidence as ammunition” framing is particularly apt.

The 17% causal-inference figure is striking and worth remembering. Only 59 of 355 on-topic studies used designs that could plausibly support causal claims. That’s the evidence base on which population-level policy is being demanded. Compare that with the causal evidence hierarchy in your Chapter 5 (migration quasi-experiments) and Chapter 6 (lottery studies) — those chapters insist on strong designs before drawing strong conclusions. The social media field is drawing strong conclusions from weak designs, and this chapter documents that quantitatively.

The self-critical limitations section is genuine rather than perfunctory — they flag that their single-coder qualitative analysis is a limitation, that they can’t access internal deliberative processes, and that their classification system produced some inconclusive codes. This is what intellectual honesty looks like in contrast to the other chapters.

Two observations that are more critical. First, the chapter is diplomatically careful to a fault. The APA findings — 20% self-citation, citation mismatches where the cited study literally doesn’t address the claim being made — are described as “concerning” and “worrying, even if rare.” In your blog voice, you’d call that what it is: shoddy scholarship that wouldn’t pass peer review*. The diplomacy is understandable given the WHR context, but it does somewhat blunt the force of findings that are genuinely alarming.

Second, the chapter doesn’t connect its findings back to the WHR’s own social media chapters. It evaluates external organizations’ synthesis practices but doesn’t turn the lens on whether the WHR itself — including the Haidt chapter we just read — meets these same six standards. That’s a missed opportunity, or perhaps a diplomatic impossibility given they’re in the same volume.

But overall: yes, much more balanced. This is what evidence-based analysis looks like when it’s done by people who take methodology seriously.

* Correction: This is not correct. I would not assume that pee-review is quality control Biased work constantly passes pre-publication peer review. This is the reason why we need open post-publication reviews like this one, just like we have thousands of book reviews that range in ratings from 1 to 5.

Chapter 5

This is Twenge’s chapter, and it’s a mixed bag — competent descriptive empirics wrapped in a rhetorical frame that does some of the very things Chapter 4 just criticized.

The strength is the data. PISA gives you nationally representative samples of 15–16-year-olds across 47 countries with the same measures, which is a genuine advantage over the US/Canada/UK-dominated literature the chapter itself flags. The regional breakdowns are useful, and the finding that the social media–life satisfaction association is essentially null for boys outside of English-speaking countries and Western Europe is important — it’s the kind of finding that complicates the “social media is harming youth” narrative rather than confirming it.

The curvilinearity point is well taken and the observation about greater variance among non-users and heavy users is genuinely interesting. Both non-users and heavy users show elevated rates of both very low and very high life satisfaction, which suggests these are heterogeneous groups — some non-users are thriving, some are isolated; some heavy users are socially engaged, some are compulsively scrolling. That’s a finding that resists simple policy prescriptions, and the chapter deserves credit for reporting it.

Now the problems.

The relative risk versus linear r argument is the chapter’s rhetorical centerpiece, and it’s doing a lot of work that isn’t fully warranted. Yes, it’s true that linear r is poorly suited for curvilinear associations, and the polio/aspirin/seatbelt analogies are vivid. But those analogies are misleading in a fundamental way: polio vaccination has a known causal mechanism, aspirin has RCT evidence, and seatbelts have physics. Social media use and life satisfaction have a cross-sectional correlation from a single time point. Relative risk sounds more impressive than r = .10, but repackaging a cross-sectional association as a relative risk doesn’t make it causal. A 50% increase in “risk” of low life satisfaction among heavy users is still a 50% increase in a cross-sectional association that cannot distinguish cause from selection. The chapter acknowledges this in one sentence near the end (“this research is correlational and, thus, cannot rule out reverse causation or third variables”) but spends several paragraphs building the rhetorical frame that makes the effects sound large and practically important before that caveat appears.

This is exactly the “calibrating certainty to conclusion strength” problem that Chapter 4 just documented. The chapter front-loads the impressive-sounding relative risk statistics and buries the causal limitations.

From your book’s measurement framework, several issues stand out. The social media measure is a single item asking about “browsing social networks” on a “typical weekday,” which is essentially asking adolescents to estimate their own screen time — precisely the measure the chapter’s own literature review acknowledges adolescents are poor at estimating (line 30 cites this limitation for the field, then proceeds to use exactly such a measure). The life satisfaction measure is a single 0–10 item. Both are self-reported by the same person at the same time. Your Chapter 3 telescope metaphor applies: we’re looking through a fairly blurry instrument here, and the chapter never discusses the validity limitations of these specific measures.

The response style point the chapter raises almost in passing (line 438 — some respondents may routinely choose extreme responses, linking heavy use and 10/10 satisfaction artifactually) is actually a serious methodological concern that deserves much more than a sentence. If extreme responding is a confound, it could explain the elevated very-high-satisfaction rates among heavy users — which is one of the chapter’s most interesting findings. The chapter identifies the problem and then moves on without grappling with it.

The absence of any control variables is glaring. No personality. No family income (the chapter acknowledges PISA lacks this). No in-person social interaction (also acknowledged). No school belonging — which is ironic given that the WHR’s own Chapter 3 found school belonging effects 4–6 times larger than social media effects. The chapter is essentially reporting raw bivariate associations between two self-report variables measured at a single time point, with no covariates, and then framing them in relative risk language that implies practical importance.

There’s also a notable asymmetry in how the chapter handles regional variation. When the association is significant (girls in Western Europe, English-speaking countries), it gets highlighted. When it’s null (boys in Asia, Latin America, Middle East/North Africa), it gets reported but with less interpretive weight. The null findings are actually the majority pattern for boys — in most of the world, the association between social media use and boys’ life satisfaction is essentially zero. A more balanced reading would lead with that finding: for most adolescent boys globally, there is no meaningful association between social media use and life satisfaction.

The self-citation pattern is worth noting. Twenge cites herself (Twenge & Hamilton 2022, Twenge & Farley 2021, Twenge & Martin 2020, Twenge et al. 2018) repeatedly — four of the roughly 20 references are her own work, including the paper that introduced the relative risk framing. This isn’t disqualifying, but given that Chapter 4 just flagged the APA’s 20% self-citation rate as a concern, it’s notable.

Bottom line: this chapter is more honest about the data than the Haidt chapter — it reports null findings for boys in most regions, it shows the curvilinear pattern, and it flags the variance issue among non-users and heavy users. But the rhetorical packaging oversells the findings. The relative risk framing makes cross-sectional associations sound like established health risks, the causal limitations are acknowledged but not given proportionate weight, and the absence of any covariate adjustment means we have no idea how much of these associations would survive basic controls for personality, socioeconomic status, or social engagement. Chapter 4’s own standards — citation accuracy, engaging with complexity, calibrating certainty to evidence strength — would give this chapter a middling grade: better than the APA and OSG reports, but not meeting the NASEM standard.

Personal note: If men decrease in happiness nearly as much as women, and the social media effect is more gendered and often does not show for males, it directly points to other factors that decrease happiness for young people. The same line of reasoning was used to find out that bad air quality was not the cause of lung cancer. Men got lung cancer, women did not, and now we know that the reason was that men were smoking and women were not.

Replicability-Index

Improving the replicability of empirical research

Tag Archives: Gallup World Poll