Measures are necessary because unmeasured work is easily ignored. If universities want to reward merit, they need some way to identify it. The problem is that every measure can be gamed. A recent example comes from prediction markets: French authorities investigated possible tampering with a weather sensor at Charles de Gaulle Airport after unusual temperature spikes coincided with profitable bets on Paris temperatures on Polymarket. The case illustrates a general principle: once a number has consequences, people have incentives to influence the number rather than the underlying reality. (WSKG)
The same problem exists in academia. Professors’ work is difficult to evaluate. Research quality, originality, theoretical importance, mentorship, service, and long-term influence are not easily reduced to a single number. Yet universities still need to make decisions about hiring, promotion, salaries, awards, and prestige. This need has encouraged the use of quantitative performance indicators, such as the number of publications, total citations, and the H-index.
These measures capture something real, but each creates distortions. Publication counts reward volume, even when many papers have little influence. Citation counts reward visibility and cumulative attention, but can be inflated by a few highly cited papers, large collaborations, review articles, field size, and self-citation. The H-index improves on both by requiring a body of cited work, but it has its own blind spot: it ignores how many low-impact papers were produced alongside the influential ones.nt:
It is well known that academic metrics are biased because researchers can influence both citation counts and publication counts. Self-citations are relatively easy to detect, and can be excluded if necessary. Citations from close peer networks are harder to evaluate. Mutual citation practices, honorary coauthorship, strategic review writing, conference visibility, social media promotion, and aggressive self-marketing can all increase citation counts without necessarily reflecting greater intellectual merit.
Publication counts are even easier to inflate. The simplest strategy is to divide research into many small papers, submit weak papers repeatedly until they are accepted somewhere, or publish in journals with low rejection rates. In some cases, this includes pay-to-publish outlets that rely more on publication fees than on rigorous peer review. These practices do not imply that all highly productive scholars are gaming the system, but they show why raw publication counts are poor indicators of quality.
In principle, the problem could be solved by independent evaluations of scientific quality. In practice, this is difficult. Quality is multidimensional: a paper may be technically rigorous but unimportant, original but wrong, influential but misleading, or methodologically imperfect but theoretically generative. Expert judgment is necessary, but it is also subjective, costly, and vulnerable to reputation, ideology, personal networks, and disciplinary fashions.
As a result, universities rely on imperfect quantitative proxies. These proxies are attractive because they are easy to count, but they are incomplete. They measure visibility and productivity more easily than they measure quality. The challenge is not to find a perfect metric, but to design metrics that are harder to game and that capture dimensions of merit ignored by existing indicators.
This is where the low-impact tail becomes relevant. A publication record with many highly cited papers and few low-cited papers conveys something different from a publication record with the same H-index but hundreds of additional papers that attracted little attention. The conventional H-index ignores this distinction. A quality-adjusted H-index makes it visible.
The draft has a strong argument. The main correction is that the quality-adjusted H-index is not a percentage. Efficiency is a percentage, H/N. But QH = H²/N is on an H-index-like scale. So Ioannidis’s QH is 25.9, not 25.9%, and Diener’s is 32.2, not 32.2%. The uploaded text also has several typos: “ward” should be “reward,” “publication” should be “publications,” “meta-scientists” should be “meta-scientist,” and “Ioanndisis’s” should be “Ioannidis’s.”
Here is a refined version of the section from “A long tail…” onward:
A long tail of low-impact publications has negative effects on science. It crowds out potentially better work by other researchers. It also consumes resources, especially when publications are supported by publicly funded grants or paid publication fees. It may even hurt the authors themselves. Time spent producing many low-impact articles is time not spent developing fewer, more substantial contributions. Rewarding efficiency may therefore benefit science by shifting incentives away from maximizing publication counts and toward producing work that has durable influence.
The proposed index is simple. It requires only two pieces of information: the total number of publications, N, and the H-index, H. The H-index rewards a sustained body of impactful work. It does not solve the problem that citations are only a proxy for quality, but that is not the purpose of the new index. The purpose is to adjust the H-index for publication efficiency.
Efficiency can be defined as the proportion of publications that belong to the H-core:
Efficiency = H / N
A researcher with an H-index of 100 and 400 publications is more efficient than a researcher with the same H-index and 1,000 publications. Both have the same citation core, but the second author needed many more publications to achieve it.
John P. A. Ioannidis is a prominent Stanford scientist and meta-scientist whose work has focused on improving scientific credibility and reducing false findings. He has an impressive H-index of 190 and an even more impressive total of 1,396 publications. Based on traditional metrics, this is an extraordinary record.
However, the record looks different when efficiency is taken into account. To achieve an H-index of 190, Ioannidis produced 1,396 publications. His efficiency is therefore:
190 / 1,396 = .136
Thus, 13.6% of his publications are in the H-core. His quality-adjusted H-index is:
190² / 1,396 = 25.9
Ed Diener was one of the most influential social and personality psychologists and helped establish the scientific study of subjective well-being. His H-index is 126, which is lower than Ioannidis’s H-index of 190. However, Diener produced 493 publications. His efficiency is therefore:
126 / 493 = .256
Thus, 25.6% of his publications are in the H-core, nearly twice Ioannidis’s efficiency. His quality-adjusted H-index is:
126² / 493 = 32.2
The conventional H-index ranks Ioannidis higher. The QH-index ranks Diener higher because Diener achieved a large citation core with a much smaller publication record.
Conclusion
No simple quantitative indicator is a perfect measure of merit. Still, universities and funding agencies need some measures to allocate limited resources. The H-index was designed to avoid rewarding researchers merely for producing many low-impact publications. It improved on simple publication counts by requiring a body of cited work. Yet the H-index still has a blind spot: once the H-core is established, additional low-impact publications carry no penalty.
The QH-index addresses this problem. It preserves the central virtue of the H-index by rewarding sustained impact, but it discounts this impact when it is accompanied by a large number of low-impact publications. Publishing more articles is beneficial only when it increases the citation core. Producing a large long tail of low-impact work lowers the score. This corrective may help reduce incentives to publish as much as possible without regard to the quality or influence of the work.
Every self-interested entity in power wants to control public opinion. Billionaires buy newspapers, not to make more money, but to use their money to push their personal agenda. Totalitarian governments control access to free information to keep their citizens’ uninformed. The same human behavior is also visible in science, but it is often ignored.
British lords invented the “peer” (not you and me, but other lords) review system when they engaged in scientific debates as a hobby. Today, science is a billion dollar industry and scientists are self-interested actors in this system. Closed peer-review is still used to sell the public the impression that scientists control themselves to ensure that published articles meet the highest standards of scientific research. In reality, the closed peer-review system is used to control information and repress criticism.
The ability to influence the information that gets the stamp of peer-review approval is also the main motivation to take on the thankless job as an editor. The only reward is to decide which small number of submissions will get published or not. High rejection rates are used to claim rigorous quality control, but in reality, they give editors power to influence the narrative.
The problem is amplified at journals that focus on a specific narrow topic. These journals were often created by scientists who were not able to publish their work in other journals because their work was not considered important to the editors of those journals. For example, Cognition and Emotion was created in 1991 because psychology shunned research on emotions and even after the affective revolution in the 1980s, it was difficult to publish emoiton research in mainstream psychology journals.
Creating a journal to publish important work itself is a positive response to censorship. Rickard Carlson and I also used this approach to make it easier to publish research on meta-psychological topics that were difficult to publish elsewhere. However, the danger is that oppressed groups become oppressors, when they gain power. And closed peer-review gives editors at these new journals the power to control the narrative, just that it is now their narrative and their self-interests that decide what gets published. The only way to avoid this trap is to dismantle the power structure. That is what Rickard did with Meta-Psychology. First, articles are not rejected. They are improved until they meet basic scientific standards. Thus, there is no tool to suppress work because it is “not novel enough,” “only a small increment,” “outside of the scope of this journal,” or just a desk rejection with a note that the journal just cannot publish all of the important work that is done. The real reason is often that the editor did not like a paper.
In short, closed peer-review is not what the general public thinks it is. Rather than ensuring that research meets basic scientific standards, it is used to reward people to follow the party line and punish people who want to publish critical work.
Open Science Reforms
In psychology, the academic discipline I know because I worked in it for over 30 years now, the problem of censorship became apparent during the replication crisis in the 2010s. Peer-review had failed to ensure that published results are scientifically valid. Lack of training and understanding of science itself was partly to blame, but the bigger reason was that peer-reviewers were happy to publish bad research because they were doing the same bad research and were interested in publishing these results that benefited their own work. Yes, I am talking about the implicit revolution (Greenwald’s words, not mine) that seemed to show that much of human behavior was caused by mindless responses to situational cues without even noticing it. Call it implicit, automatic, or unconscious, experiment after experiment seemed to support these claims. In reality, research on the unconscious worked very much like Freud’s model of unconscious process. Undesirable results were repressed and only results that showed support for researcher’s claims were published. This became apparent after Bem even showed time reversed unconscious processes, which nobody was willing to believe. When other studies were replicated, they also failed to provide support for other claims and the implicit revolution imploded. Peer-review had failed as a quality control mechanism. Rather censorship had created a bubble of false findings. It doesn’t take a psychoanalyst to realize that the realization was painful and that many old researches resorted to defense mechanisms to avoid the emotional consequences of realizing that their achievements were illusory.
Open science requires open sharing of all findings and arguments. It also requires that conclusions are consistent with the evidence and logically consistent. This open exchange cannot happen in a closed peer-review system where editors control the narrative. The new quality assurance is not “peer-reviewed,” but “open peer reviews,” and publication of all arguments on both sides. It is also important to get rid of journal rankings to evaluate the quality of research. Journal rankings only ensure that editors of prestigious journals have even more power to control the narrative. I experienced this first hand. When I submitted my first critique of the Implicit Association Test to the prestigious journal “Perspectives on Psychological Science,” the editor rejected it. When I tried again several years later, a new editor accepted it. Neither decision was based on the quality of the work or the argument, it was just a personal preference.
A Scientific Utopia
Most editors also do not read articles they handle or provide their own comments. The bias is often introduced by picking reviewers that will like or dislike a paper (I know, I was Ed Diener’s henchmen, his words, not mine). So, they really do not add anything of value. Even current AI (large language models) are better able to evaluate the scientific merits of a paper and we can replace human editors with AI, a faster, more cost effective, and less biased way to make decisions about publications that are essential for young scientists’ careers.
Scientific progress has been slow because humans are not disinterested processors of information. Once they have concluded that some belief is true, their information processing is biased towards verifying that truth rather than looking for disconfirming evidence.
Willful ignorance is the selective processing of confirmatory information and the avoidance of sources that may expose the believer to contradictory information. However, sometimes challenging information is unavoidable. Scientists who want to publish their work are constantly exposed to negative comments. When confronted with criticism, there are a number of strategies that serve different purposes. A constructive response examines the validity of the criticism, responds to valid concerns, adjusts claims accordingly, and may still make a useful contribution. A defensive response to valid criticism engages in pseudo-scientific arguments that avoid the key concern and leads to an unproductive exchange that cannot have a resolution because the goal is to maintain a false belief.
While critics initiate a discussion about potential errors, the roles are not fixed. Once the criticism is made, the person criticized responds to it and may find errors in the critic’s arguments. Now the roles are reversed and the critic may respond to this criticism in defensive ways, accusing the person being criticized of being defensive. This exchange quickly deteriorates into a childish exchange of shouting “I am right. You are wrong” at each other. A more mature response is to allow for errors being made on both sides and carefully examine the arguments. This is the aim of my response to Erik van Zwet’s second blog post about z-curve, “More concerns about z-curve.“
The Substance
In this second post, Erik reports one new simulation scenario. In that scenario, he points to two problems. The main criticism is that the confidence interval for the Expected Discovery Rate (EDR) does not achieve its nominal 95% coverage. The second concern is that the confidence interval for the null-component weight can collapse to zero width, which he interprets as a sign of instability or misspecification in the internal mixture fit.
The second point is the less important one. Z-curve is a finite-mixture model that approximates the distribution of test statistics using weights on several discrete components. It is well understood that these component weights are not themselves substantively meaningful parameters when the true data-generating process is continuous. Different mixtures can yield nearly identical estimates of the quantities z-curve is designed to recover. For that reason, poor coverage of confidence intervals for individual component weights is not, by itself, a serious problem. In particular, the weight of the zero component is not used in z-curve the way a null-component weight is used in models that directly estimate false positive rates. These intervals appear in the output, but they are not the primary inferential target.
What matters is coverage for the main estimands: the Expected Replication Rate (ERR) and the Expected Discovery Rate (EDR). Erik does not mention that the ERR interval appears to perform adequately in this scenario. Thus, the central substantive criticism is narrower: in this particular simulation setting, the EDR confidence interval appears to undercover.
The Response
The specific scenario assumed that all studies had the same power, which implies not only the same sample size, but also the same population effect size. Brunner and Schimmack (2020) already noted that z-curve can have problems in this situation when the true noncentrality parameter falls between two default components. That is exactly Erik’s scenario: mean power is 32%, corresponding to z = 1.5, midway between the default components at z = 1 and z = 2.
Brunner and Schimmack (2020) did not emphasize this problem because most real datasets show substantial heterogeneity in sample sizes and effect sizes (van Erp et al., 2017). Even direct replications of the same paradigm across labs vary in effect size (Klein et al., 2017). Thus, Erik’s critique is based on a known difficult case for z-curve.2.0, but not one that resembles most real applications.Use this instead:
To address this valid concern, z-curve 3.0 was revised to first test for very low heterogeneity. When the data appear unusually homogeneous, the model estimates where a single component would best fit the distribution and then shifts the default grid so that one component is centered near that value. In Erik’s scenario, this places a component near z = 1.5 instead of forcing the fit to choose between z = 1 and z = 2.
The new results are therefore limited to Erik’s specific concern: whether z-curve.2.0 provides adequate coverage for homogeneous data when the true noncentrality parameter falls between two default components.
I validated z-curve 3.0 with the standard simulation code that was used to validate z-curve.2.0 in the Uli simulation design. These simulations across 192 scenarios were validated with just 50 significant results to produce coverage over 95% in most scenarios. To simulate a non-centrality parameter of z = 1.5, I used a standardized mean difference of d = .30 and a sample size of N = 100 (.3 / (2/sqrt(100) = 1.5) . Figure 1 shows the results for 50,000 significant results. Z-curve is able to predict the distribution of the non-significant results based on the model fitted to the significant results well and the estimates of EDR and ERR are accurate and the confidence intervals are tight.
Coverage for the ERR and EDR estimates was tested with k = 50, 500, 5,000, and 50,000. All simulations showed coverage over 95% (Results). In short, z-curve.3.0 now also performs well with homogenous data and can do so quickly with the density method.
In sum, Erik noted that the default method of z-curve.2.0 fails to produce adequate confidence intervals for the EDR estimate in one simulation with homogenous data and a non-centrality parameter between two default components. I responded to this valid criticism by improving z-curve. Z-curve.3.0 now handles homogeneity and heterogeneity in power well and provides credible confidence intervals.
In the comment section Erik writes. “Indeed, as I wrote: “Note that I’m violating the assumption of the z-curve method, but in a way that would be difficult to detect from limited data. That’s the point: You can fix this by changing the default “mu grid”, but you wouldn’t know that.”
As I showed here, this statement is an error. It is very easy to diagnose the problem by estimating the heterogeneity of the data and then adjust the grid according to a preliminary model that is more consistent with the data. The ability of z-curve.3.0 to work in this scenario shows that the problem is fixable. Thus, Erik’s criticism is invalidated by the evidence. Any new evaluations of the z-curve method need to examine the performance of z-curve.3.0.
It’s sooooo frustrating when people get things wrong, the mistake is explained to them, and they still don’t make the correction or take the opportunity to learn from their mistakes.
This could have been written by me or many other people who are in the business of calling out other people’s mistakes. In theory, that would be all scientists because science is supposed to progress by correcting mistakes. However, academia is not science and many academics don’t like to face their own mistakes. The more their status and reputation depends on some claim they made in the past, the more reluctant people are to admit that they were wrong. Max Plank famously declared that science only progresses when pig-headed prominent scientists die and the field can move on. But humans are human and public admission of mistakes is not a virtue in modern capitalist science that reward self-promotion and sexed-up research findings.
While it is true that the incentives are against public admission of mistakes, there are notable exceptions. Daniel Kahneman, after he won a Nobel Prize, was able to admit some mistakes. Maybe it takes a Nobel to overcome nagging feelings of self-doubt and defensiveness. I hope not. I have corrected some of my mistakes, but I have to admit, that it sometimes took a long time to admit them. At the same time, I have also pushed back against critics who were wrong. The real problem is of course to know the difference. Accept valid criticism, reject invalid criticism, requires knowing what is valid and what is invalid. Thus, the requestion for all actors, critic, person being criticized, and observers is “Who is right?”
The content of the blog post, however, conflates responding to criticism with responding to an error in one’s work.
Consider the following range of responses to an outsider pointing out an error in your published work:
Look into the issue and, if you find there really was an error, fix it publicly and thank the person who told you about it.
Look into the issue and, if you find there really was an error, quietly fix it without acknowledging you’ve ever made a mistake.
Look into the issue and, if you find there really was an error, don’t ever acknowledge or fix it, but be careful to avoid this error in your future work.
Avoid looking into the question, ignore the possible error, act as if it had never happened, and keep making the same mistake over and over.
If forced to acknowledge the potential error, actively minimize its importance, perhaps throwing in an “everybody does it” defense.
Attempt to patch the error by misrepresenting what you’ve written, introducing additional errors in an attempt to protect your original claim.
Attack the messenger: attempt to smear the people who pointed out the error in your work, lie about them, and enlist your friends in the attack.
As you can see, there is no option to look at the issue, find a mistake in the criticism, point out the mistake, and the critic apologizes and thanks the person being criticized for engaging constructively and taking time to address their concern.
A Case Study
Taken, Erik van Zwet’s post “Concerns about z-curve “as an example. The post contains several mistakes about z-curve. Some mistakes are glaring, like being a reviewer of z-curve and then claiming it was not vetted by experts.
The strange fact, not mentioned by van Zwet on his blog post, is that he wrote a favorable review of z-curve when he was a reviewer of z-curve.2.0. Claiming that z-curve was not reviewed by experts implies that he is not an expert, but if he is not an expert, it undermines his critique of z-curve.
2. van Zwet then claims that the z-curve method is based on the assumption that the absolute values of the SNRs have a discrete distribution supported on 0,1,2,…, 6. That statement confuses the default settings of the z-curve package with the z-curve method. Criticizing these defaults is fine, but confusing default settings and a method is not. Especially Bayesian statisticians like Gelman and van Zwet should know the difference.
If somebody uses Gelman’s statistical tool, stan, with bad priors, it leads to bad results. The problem is not the tool, but the prior. I have made this point clear in the comment section and pointed out that z-curve handles some specific edge-cases where the defaults fail well by changing the defaults.
3. In the conclusion, van Zwet makes generalizes from a single scenario that shows z-curve underestimates uncertainty to imply that z-curve is always unreliable. “In my opinion, statistical methods should be reliable when their assumptions are met. I don’t think unreliable methods should be used because no better methods are available.”
Once again, this is like saying nobody should use Gelman’s stan program to analyze data because one application resulted in a false conclusion. Non-sensical, unscientific, and clearly a mistake that only Reviewer B would make because the goal is not to advance science, but to be a nasty reviewer for reasons that remain unknown (e.g., sexual frustration, grant application failed, realizing that academia is a waste of time, no hobby, etc.).
How I respond to valid criticism
Let me show how I respond to valid concerns. Yes, in the specific scenario picked by van Zwet, z-curve.2.0 was overconfident and produced confidence intervals that were too narrow and missed the true value more often than a 95% confidence interval should, namely more than 5 out of 100 times. That is a valid criticism of z-curve.2.0.
I was already working on improving z-curve. Using van Zwet’s scenario, I was able to use information in the data to alert z-curve to scenarios that provide little information about the expected discovery rate (van Zwet’s own simulation had 40% data that contained absolutely no information). I tested z-curve.3.0 with van Zwet’s scenario and 99 out of 100 simulations contained the true value. Thus, the new confidence intervals provide accurate information about lack of information about the EDR in the data.
Of course, z-curve is not magic. As the plot shows, the EDR is an estimate of the distribution of non-significant results based on only the significant results. When there are few informative z-values just below significance (z = 1.96 to 2.96), the EDR cannot be estimated. Z-curve.3.0 realizes this and gives a wide confidence interval that ranges from 15% to 98%. This is informative because it tells users that the EDR cannot be estimated and the point estimate cannot be trusted. However, the confidence interval will be smaller and more informative in other situations and with larger sets of studies.
In short: z-curve.2.0 is dead. Long live z-curve.3.0
Now, this is how you respond to valid concerns and demonstration of errors. You learn from them and fix them. That is how real science advances and z-curve has been developed, evaluated, and improved for over 10 years now.
Waiting for Gelman and van Zwet’s Response to this Criticism
It will be interesting to see how van Zwet and Gelman respond to this criticism of their criticism. The ladder of responses is clear and now also includes pointing out errors in my response or in z-curve.3.0 In the age of preregistration, let me preregister my prediction.
4. Avoid looking into the question, ignore the possible error, act as if it had never happened, and keep making the same mistake over and over.
I hope this is a mistake that I am happy to correct when proven wrong.
Andrew Gelman is a statistician who is working for Columbia University. He also maintains a blog post where he shares his opinions about many topics, including the replication crisis in psychology and related fields like behavioral economics. He is not an expert in either field, but that does not prevent him from evaluating the research in these areas. But you do not have to read a specific blog post by him because the result is often the same. The research is not credible, sample sizes are too small, studies are selected for significance, and meta-analyses are not trustworthy. In his favorite area of statistics that uses prior assumptions to make sense of actual data, this is known as a dogmatic prior. No amount of data will reverse the conclusion that is already implied by a dogmatic prior. So, you really do not need data.
As you may have guessed, I don’t like the guy. I think he is a jerk, and that may cloud my evaluation of him. However, I do have data to support my claim that the Gelman’s statements often reflect his prior assumptions and are immune to data. He says so himself on his blog post.
After discussing some problems with a meta-analysis of nudging studies (a Nobel prize winning idea in behavioral economics), Gelman writes:
Just to be clear: I would not believe the results of this meta-analysis even if it did not include any of the above 12 papers, as I don’t see any good reason to trust the individual studies that went into the meta-analysis. It’s a whole literature of noisy data, small sample sizes, and selection on statistical significance, hence massive overestimates of effect sizes.
What are small sample sizes (some of these studies have hundreds of participants)? Where is the evidence that selection leads to MASSIVE overestimation. Gelman has no answers to such scientific questions about the evidence because he does not care about the data. His prior is sufficient to dismiss an entire literature, not just a few bad studies.
Did I cheery-pick this example? Should you trust me? To find an answer to these questions you can use AI that can read Gelman’s blog within seconds. Share one of his blog posts where he reversed a prior belief in response to empirical data. I am waiting.
The problem is not that Gelman is opinionated and shares his opinions on a blog (some people may say that is also true of myself). The problem is that he has blind followers that seem to confuse believing Gelman’s opinions with meta-science. Actual understanding of problems in science requires investigating these problems with empirical methods and draw conclusions from data; not believing in conclusions that rest on unproven assumptions.
I am all in favor of open science and a critic of closed pre-publication peer-review. The downside of open communication is that there is no quality control and internet searches will amplify misinformation. This is the case with Erik van Zwet’s critique of z-curve. Even though I addressed his criticisms in the comment section, search engines – like humans – do not scroll to the end and process all information. I have even addressed concerns about z-curve.2.0 by improving z-curve 3.0 to handle edge cases like the one used by van Zwet to cast doubt about z-curves performance in general. In science, facts trump visibility Z-curve.has been validated with many simulations across a wide range of scenarios and works well even with just 50 significant z-values. For more information, check out the Replication Index blog or the FAQ about z-curve page.
The bias in the Bing (AI) summary is evident when we compare it to Google search summary. Still makes a false claim about assumptions based on Erik van Zwet’s blog bost, but also avoids the dismissal of a method based on a single edge case that was easy to address and is no longer of concern in the new z-curve.3.0. In short, don’t trust the first generic response of AI. Use AI to probe arguments.
The latest World Happiness Report gives Jonathan Haidt a megaphone to continue his narrative that decreasing wellbeing among young people can be blamed nearly entirely on social media use (Chapter 3). Chapter 4 shows how assessments of the evidence are biased and (US) American Psychologists, APA, are the most biased, but the (US) Surgeon General report is not much better. Policy is made based on biased readings of the evidence (fortunately, 16-year old will find ways, just like they were watching R-rated movies in the old days).
Chapter 3
Chapter 3 is written by Haidt and the website gives a helpful warning that it is a 61 minute read. That is like asking somebody to listen to 24 hours of Fox News to find out how they misrepresent everything to support a criminal president of the United States, where young people are getting less happy. I do not have time for that. Rather, I asked Clause (not war-supporting ChatGPT) to summarize and evaluate the chapter. Importantly, this is not generic Claude. This is a Claude project that knows everything about SWB that I have written in my textbook on this topic. Yes, unlike Haidt, I have studied SWB for 30 years. Is it unbiased? No. But it is an antidote to Haidt’s noise machine.
My favorite quote from Claude’s review. “The chapter is extraordinarily effective as persuasion and deeply problematic as science.” I don’t have the scientific answer, but I know that social media are at best only a small piece of the puzzle, why young people in Canada and the USA are suffering. Distracting from these problems by blaming social media is problematic and we need to find real answers to create real solutions to this problem.
The Structure
The chapter is organized as a legal prosecution — they literally say “we are making the case for the prosecution” and ask readers to use “preponderance of the evidence” as the standard. They present seven “evidence lines” organized as 15 “exhibits”:
Lines 1–3 (Testimony): Surveys of young people showing regret, surveys of parents/teachers/clinicians expressing concern, and leaked internal corporate documents showing companies knew about harms.
Lines 4–7 (Academic evidence): Cross-sectional studies showing heavy users have elevated depression risk, longitudinal studies showing social media at Time 1 predicts depression at Time 2, RCTs showing reduction in social media improves mental health, and natural experiments showing broadband rollout worsened mental health.
They then extrapolate to population-level harm estimates (millions of US teens affected) and argue this is sufficient to answer the “historical trends question” — that social media caused the youth mental health decline.
The Rhetorical Strategy
The chapter is extraordinarily effective as persuasion and deeply problematic as science. The courtroom framing is the tell. In a courtroom, the prosecution presents one side of the case. The defense presents the other. Haidt has cast himself as prosecutor and explicitly asks readers to lower their evidentiary standard. He then preemptively addresses critics not by engaging with their arguments substantively, but by saying their findings support his case once “unblended.”
The “unblending” argument is his central methodological move: whenever a study finds null or small effects, Haidt argues this is because the researchers combined too many outcomes, populations, or technologies. When you restrict to girls, to internalising symptoms, and to social media specifically, the effects get larger. This is a legitimate analytical point — but it’s also a form of specification searching. You can always find larger effects by narrowing the sample and outcome to where the signal is strongest. The question is whether those specifications were preregistered or selected post hoc.
Critical Problems from Your SWB Framework
1. No personality controls anywhere. Not a single study Haidt cites controls for Neuroticism or Depressiveness. Your Chapter 7 work shows these facets explain ~50% of SWB variance. A high-Neuroticism adolescent girl is simultaneously more likely to use social media heavily (rumination, reassurance-seeking), report depression, report body image problems, and perceive social media as harmful. Without personality controls, every “line of evidence” is confounded by the same omitted variable.
2. The testimony evidence is circular. Lines 1–3 amount to: people believe social media is harmful. But people’s causal attributions about their own mental health are unreliable — that’s one of the core lessons of your measurement chapter. If you asked depressed people in the 1990s what caused their depression, many would have blamed television, or music, or whatever was culturally salient. The fact that Meta’s own employees believed their products were harmful is concerning, but it’s still testimony about beliefs, not evidence of causation. Haidt himself acknowledges victims “could be mistaken” but then proceeds to treat the testimony as though it establishes the case.
3. The RCT evidence is weaker than presented. The headline finding is g = 0.19 for depression from social media reduction experiments. But these are short-term studies (often 1–3 weeks) with self-selected participants who know they’re in a social media reduction study. Demand effects are enormous. If you tell someone “we’re testing whether social media is bad for you, please reduce your usage,” the treatment group has a strong expectation of feeling better. There’s no placebo control for that expectation. Compare this to the evidence standards your textbook applies: lottery studies for money effects, migration quasi-experiments for cross-national differences. The RCT evidence here doesn’t come close.
4. The population extrapolations are absurd. Haidt takes an effect size from a 2-week reduction experiment on college students and extrapolates to “a decline in the prevalence of clinical depression by roughly one-third.” This assumes the experimental effect scales linearly to the population, that it would persist beyond two weeks, that college students are representative of adolescents, and that the experimental context (individual reduction while peers stay online) maps onto population-wide changes. None of these assumptions are justified.
5. The “historical trends question” remains unanswered by his own admission — then he answers it anyway. He repeatedly says the product safety question is easier than the historical trends question, and that his evidence is mainly about the former. But then he pivots and says the scale of individual harm “makes it more plausible” that social media caused the population trends. This is a non sequitur. Alcohol causes individual harm at massive scale too, but alcohol consumption hasn’t changed enough to explain any recent population trend in youth mental health.
6. He never mentions housing, employment, economic prospects, or any structural factor. In a 61-minute chapter organized as an exhaustive prosecution of social media, there is zero engagement with the alternative suspects you identified. No mention of housing affordability. No mention of the gig economy. No mention of declining real wages for young people. No mention of student debt. The word “economic” appears only in passing references to other studies’ control variables. This is exactly the bad criminal investigation you described — single suspect, no alibi checking.
7. The asymmetric framing of evidence standards is self-serving. Haidt argues that the cost of wrongly concluding social media is unsafe is low (kids just wait until 16), while the cost of wrongly concluding it’s safe is “catastrophically high.” This is a policy argument dressed up as an epistemological one. By the same logic, the cost of wrongly attributing the youth mental health crisis to social media — while ignoring structural economic causes — is also catastrophically high, because it diverts political attention from housing, labor market, and inequality reforms that could actually help.
Bottom Line
The chapter is a 61-minute advocacy document by someone who wrote a bestselling book with the same thesis and has enormous reputational stakes in being right. The evidence he presents is real but selectively curated, never contextualized against alternative explanations, and extrapolated well beyond what the data support. From the standards your textbook establishes — multi-method designs, halo bias correction, causal evidence hierarchies, personality controls — this chapter wouldn’t pass peer review in the form it’s presented.
Chapter 4
Alright, let me read the WHR26 Chapter 4 and evaluate it against the standards we’ve been applying to the other chapters.Night and day compared to the Haidt chapter.
This is a proper piece of meta-science rather than advocacy. The authors (this reads like Przybylski’s group) do exactly what a rigorous analysis should do: they ask why three organizations looking at the same literature reached different conclusions, and then systematically show the answer isn’t what evidence was selected but how it was synthesized and communicated.
The key findings are damning for the APA and Surgeon General but in a measured, empirically grounded way. Only 4 out of 617 unique papers appeared in all three reports — yet the methodological and thematic profiles of the cited evidence were statistically indistinguishable. So the divergence is entirely in the synthesis layer: citation accuracy, engagement with disconfirmatory evidence, acknowledgment of limitations, and calibration of certainty to what the evidence actually supports. NASEM comes out well; the APA comes out badly (20% self-citation by advisory panel members, citation clumping, outright mismatches between cited studies and the claims they’re supposed to support); the Surgeon General falls in between but leans toward rhetorical urgency over epistemic humility.
From your book’s perspective, several things stand out.
The chapter’s core argument — that the same evidence base can yield opposite policy conclusions depending on synthesis practices — is essentially a measurement-and-method-variance story applied to the policy level. It’s the same logic as your Chapter 3 telescope metaphor: the instrument (here, the synthesis process) determines what you see, and if the instrument is distorted, the conclusions are distorted. The chapter makes this point more carefully and with better evidence than anything in the Haidt chapter, which was itself an example of the very problem this chapter diagnoses.
The six quality indicators they identify (citation accuracy, contextualisation, limitation acknowledgment, engagement with disconfirmatory evidence, calibrated language, process transparency) are essentially a checklist for distinguishing science from advocacy — which maps directly onto your book’s concern about distinguishing empirical claims from conceptual positions. The Weiss “evidence as ammunition” framing is particularly apt.
The 17% causal-inference figure is striking and worth remembering. Only 59 of 355 on-topic studies used designs that could plausibly support causal claims. That’s the evidence base on which population-level policy is being demanded. Compare that with the causal evidence hierarchy in your Chapter 5 (migration quasi-experiments) and Chapter 6 (lottery studies) — those chapters insist on strong designs before drawing strong conclusions. The social media field is drawing strong conclusions from weak designs, and this chapter documents that quantitatively.
The self-critical limitations section is genuine rather than perfunctory — they flag that their single-coder qualitative analysis is a limitation, that they can’t access internal deliberative processes, and that their classification system produced some inconclusive codes. This is what intellectual honesty looks like in contrast to the other chapters.
Two observations that are more critical. First, the chapter is diplomatically careful to a fault. The APA findings — 20% self-citation, citation mismatches where the cited study literally doesn’t address the claim being made — are described as “concerning” and “worrying, even if rare.” In your blog voice, you’d call that what it is: shoddy scholarship that wouldn’t pass peer review*. The diplomacy is understandable given the WHR context, but it does somewhat blunt the force of findings that are genuinely alarming.
Second, the chapter doesn’t connect its findings back to the WHR’s own social media chapters. It evaluates external organizations’ synthesis practices but doesn’t turn the lens on whether the WHR itself — including the Haidt chapter we just read — meets these same six standards. That’s a missed opportunity, or perhaps a diplomatic impossibility given they’re in the same volume.
But overall: yes, much more balanced. This is what evidence-based analysis looks like when it’s done by people who take methodology seriously.
* Correction: This is not correct. I would not assume that pee-review is quality control Biased work constantly passes pre-publication peer review. This is the reason why we need open post-publication reviews like this one, just like we have thousands of book reviews that range in ratings from 1 to 5.
Chapter 5
This is Twenge’s chapter, and it’s a mixed bag — competent descriptive empirics wrapped in a rhetorical frame that does some of the very things Chapter 4 just criticized.
The strength is the data. PISA gives you nationally representative samples of 15–16-year-olds across 47 countries with the same measures, which is a genuine advantage over the US/Canada/UK-dominated literature the chapter itself flags. The regional breakdowns are useful, and the finding that the social media–life satisfaction association is essentially null for boys outside of English-speaking countries and Western Europe is important — it’s the kind of finding that complicates the “social media is harming youth” narrative rather than confirming it.
The curvilinearity point is well taken and the observation about greater variance among non-users and heavy users is genuinely interesting. Both non-users and heavy users show elevated rates of both very low and very high life satisfaction, which suggests these are heterogeneous groups — some non-users are thriving, some are isolated; some heavy users are socially engaged, some are compulsively scrolling. That’s a finding that resists simple policy prescriptions, and the chapter deserves credit for reporting it.
Now the problems.
The relative risk versus linear r argument is the chapter’s rhetorical centerpiece, and it’s doing a lot of work that isn’t fully warranted. Yes, it’s true that linear r is poorly suited for curvilinear associations, and the polio/aspirin/seatbelt analogies are vivid. But those analogies are misleading in a fundamental way: polio vaccination has a known causal mechanism, aspirin has RCT evidence, and seatbelts have physics. Social media use and life satisfaction have a cross-sectional correlation from a single time point. Relative risk sounds more impressive than r = .10, but repackaging a cross-sectional association as a relative risk doesn’t make it causal. A 50% increase in “risk” of low life satisfaction among heavy users is still a 50% increase in a cross-sectional association that cannot distinguish cause from selection. The chapter acknowledges this in one sentence near the end (“this research is correlational and, thus, cannot rule out reverse causation or third variables”) but spends several paragraphs building the rhetorical frame that makes the effects sound large and practically important before that caveat appears.
This is exactly the “calibrating certainty to conclusion strength” problem that Chapter 4 just documented. The chapter front-loads the impressive-sounding relative risk statistics and buries the causal limitations.
From your book’s measurement framework, several issues stand out. The social media measure is a single item asking about “browsing social networks” on a “typical weekday,” which is essentially asking adolescents to estimate their own screen time — precisely the measure the chapter’s own literature review acknowledges adolescents are poor at estimating (line 30 cites this limitation for the field, then proceeds to use exactly such a measure). The life satisfaction measure is a single 0–10 item. Both are self-reported by the same person at the same time. Your Chapter 3 telescope metaphor applies: we’re looking through a fairly blurry instrument here, and the chapter never discusses the validity limitations of these specific measures.
The response style point the chapter raises almost in passing (line 438 — some respondents may routinely choose extreme responses, linking heavy use and 10/10 satisfaction artifactually) is actually a serious methodological concern that deserves much more than a sentence. If extreme responding is a confound, it could explain the elevated very-high-satisfaction rates among heavy users — which is one of the chapter’s most interesting findings. The chapter identifies the problem and then moves on without grappling with it.
The absence of any control variables is glaring. No personality. No family income (the chapter acknowledges PISA lacks this). No in-person social interaction (also acknowledged). No school belonging — which is ironic given that the WHR’s own Chapter 3 found school belonging effects 4–6 times larger than social media effects. The chapter is essentially reporting raw bivariate associations between two self-report variables measured at a single time point, with no covariates, and then framing them in relative risk language that implies practical importance.
There’s also a notable asymmetry in how the chapter handles regional variation. When the association is significant (girls in Western Europe, English-speaking countries), it gets highlighted. When it’s null (boys in Asia, Latin America, Middle East/North Africa), it gets reported but with less interpretive weight. The null findings are actually the majority pattern for boys — in most of the world, the association between social media use and boys’ life satisfaction is essentially zero. A more balanced reading would lead with that finding: for most adolescent boys globally, there is no meaningful association between social media use and life satisfaction.
The self-citation pattern is worth noting. Twenge cites herself (Twenge & Hamilton 2022, Twenge & Farley 2021, Twenge & Martin 2020, Twenge et al. 2018) repeatedly — four of the roughly 20 references are her own work, including the paper that introduced the relative risk framing. This isn’t disqualifying, but given that Chapter 4 just flagged the APA’s 20% self-citation rate as a concern, it’s notable.
Bottom line: this chapter is more honest about the data than the Haidt chapter — it reports null findings for boys in most regions, it shows the curvilinear pattern, and it flags the variance issue among non-users and heavy users. But the rhetorical packaging oversells the findings. The relative risk framing makes cross-sectional associations sound like established health risks, the causal limitations are acknowledged but not given proportionate weight, and the absence of any covariate adjustment means we have no idea how much of these associations would survive basic controls for personality, socioeconomic status, or social engagement. Chapter 4’s own standards — citation accuracy, engaging with complexity, calibrating certainty to evidence strength — would give this chapter a middling grade: better than the APA and OSG reports, but not meeting the NASEM standard.
Personal note: If men decrease in happiness nearly as much as women, and the social media effect is more gendered and often does not show for males, it directly points to other factors that decrease happiness for young people. The same line of reasoning was used to find out that bad air quality was not the cause of lung cancer. Men got lung cancer, women did not, and now we know that the reason was that men were smoking and women were not.
Chapter 6
Sure. Here’s a tighter version:
Sunstein’s chapter is the most intellectually interesting in the social media section. The “product trap” concept — people who would demand money to quit TikTok individually but would pay to have everyone quit simultaneously — is a genuine insight about coordination failures in network goods. The party analogy is effective and the preference reversal is well-documented.
But three problems undermine the conclusions.
First, the entire chapter rests on three studies, two involving the author himself. That’s an essay, not an evidence review. The “Key Insights” box presents sweeping conclusions (“if social media platforms did not exist, many users would be better off”) that outrun a three-study base.
Second, Sunstein acknowledges that both his WTP and WTA measures are unreliable — low WTP may be “protest answers,” high WTA reflects the standard endowment effect — and then draws welfare conclusions from them anyway. If your thermometer is broken in both directions, you can’t read the temperature.
Third, and most fundamentally: there’s no baseline. The entire argument — people use it compulsively, wouldn’t pay for it, recognize it as time-wasting, and are modestly better off without it — describes television in 1975. Americans watched 6+ hours daily, wished they watched less, and the few reduction studies showed small wellbeing gains. Nobody concluded TV should be abolished. Sunstein never demonstrates that social media is uniquely trapping compared to every previous generation’s Wasting Time Good. The coordination failure he documents is a feature of any network good — you could run the Bursztyn experiment on email or mobile phones and probably get similar results. The question isn’t whether network effects create traps; it’s whether this trap is worse than its predecessors. The chapter never asks.
Finally, my question. The Economist published a figure based on the WHR results showing that Anglo nations are decreasing and diverging from happy Scandinavia. That is the real story in the data. So, why is the report about social media and not about the real trend in the data that needs to be examined. Is social media a cover up to distract from real problems in Angloland?
Publish or perish. I heard this in the 1990s, but it is even more true today. Submitting manuscript to publish has gotten easier, too. It cost me real money to mail three copies of a manuscript from Germany to the United States (Schimmack, 1996). Now, you just need to check all the boxes on a submission portal. Not an easy task, but virtually cost free.
This system is like a lottery, where tickets are cheap and winnings can be rewarding. No wonder, authors are playing the lottery and submitting manuscripts in large numbers, even if chances of rejection are high. Maybe journals should charge for submissions rather than for publications.
Anyhow, I just reviewed a manuscript in 30 minutes. It was conceptually flawed. More importantly, my own AI – trained on this area of research – also spotted the conceptual problem, and several others that I didn’t even bother to read as it would take too long for a human reader to do so (life is short at age 60). It also wrote a nice and detailed review, much better than most human reviews. Of course, it had the advantage of being trained on this research area, but I also submitted the manuscript to a generic AI with no special knowledge. It also spotted the fatal conceptual mistake. This brings me to the main point of this rant.
Dear authors, do yourself and others a favor. Use AI to review your paper before you submit it. Even better ask it to evaluate it from the perspective of legendary Reviewer 2 and address critical issues before you submit it to a journal. You save yourself time and effort, but more importantly, you are a good citizen and do not clog the peer-review system with flawed manuscripts in the hope that they pass peer-review despite major problems.
No polite ChatGPT edits. Unfiltered raw Schimmack. Love it or hate it.
It was supposed to be the American Psychological Society (APS), but international researchers complained – especially those who want to publish in prestigious American journals – and APS became the Association for Psychological Science.
Psychological Science is now a brand name and many departments have been renamed to be Departments of Psychological Science. However, you do not become a science, just because you call yourself one, you actually have to behave like a science. And that seems to be something that many psychologists do not want to do because it would mean giving data to decide about the truth. Just like William James, many psychologists like their theories more than truth So, they continue to conduct silly statistical rituals (Gigerenzer) that are biased to show either evidence for their beliefs (p < .05) or no evidence against them (p > .05) and justify another biased test.
Every generation there have been a few psychologists who were frustrated by the futility of this and made suggestions to improve things (Meehl, Cohen, Gigerenzer) or just also fake the data (Stapel). You have to give it to Stapel. Why collect data if their only purpose is to add p < .05 to any claim one wants to make?
Since the early 2010s, thanks to Bargh and Bem, more people are calling for change, but progress is slow and stalling. Meanwhile, most published articles continue to report claims with p-values below .05.
A cynical approach to this sad state of affairs would be to say “fuck it”, “burn it all down,” and enjoy life. However, some people just can’t let go. We (Brunner, Bartos, Schimmack) developed a statistical method that helps readers to distinguish between good and bad significant results. Good ones come from studies with high statistical power that are likely to replicate. Bad ones are studies with low power or even false positive results that will not replicate. Of course, there is no hard line, but we can identify subsets of good studies, if they exist.
You would think an aspirational science would welcome a tool that can salvage good results from decades of research with mostly significant results. Which ones are trustworthy? Which ones are like pornception (Bem, 2011)?
But being a science would mean that we have to expose the fact that some results were made up – not like Stapel on his laptop – but by collecting and analyzing data, year after year, painstaking work to get significant results – and many unpublished failures. No, we cannot have this. Therefore, we have to fight the method that can distinguish good and bad research.
To fight this method, we need to get a peer-reviewed article that claims “the method does not work.” To do so, the article does not have to be evaluated by statisticians or present good arguments. All we need is a quotable peer-reivewed article, because peer-reviewed equals truth, which is also why extrasensory perception is true (Bem, 2011, JPSP).
Now reviewers can quote the criticism – and not cite evidence that contradicts these claims – and editors can use the peer-review to reject the article. The key feature of science is to fight motivational biases. If a system just amplifies misinformation and glorifies misinformation that passed peer-review, it is not a science. Maybe APS really means Anti-Psychological Science.
The question is how long this game of self- and other-deception can continue? At what point will public interest in psychology wane because it never produces any useful results that advance society, health, and wellbeing? Science is worth defending against the attacks by Trumpians, but I am not sure psychological science is part of this.
All options are set as global variables at the beginning of installing the functions with source(zcurve3). Afterwards they can be changed like any other R object
1. Curve Type: Default Z-Values, Option Fit t-Distributions with a Fixed df
CURVE.TYPE <- “z” # Set to “t” for t-distribution df = c() # set to the df of the t-distribution
2. Speed Control Parameters
parallel <- FALSE # Placeholder – parallel functionality not yet implemented max_iter <- 1e6 # Max iterations for model estimation max_iter_boot <- 1e5 # Max iterations for bootstrapped estimates
EM.criterion <- 1e-3 # Convergence threshold for EM algorithm EM.max.iter <- 1000 # Max iterations for EM
Plot.Fitting <- FALSE # Plot fitting curve (only for Est.Method = “OF” or “EXT”)
PLOT SETTINGS
Title <- “” # Optional plot title
letter.size <- 1 # Text size in plots letter.size.1 <- letter.size # Used for version labels in plot y.line.factor <- 3 # Controls spacing of plot text
Show.Histogram <- TRUE # Toggle histogram in plot Show.Text <- TRUE # Toggle model results in plot Show.Curve.All <- TRUE # Show predicted z-curve Show.Curve.Sig <- FALSE # Option: show z-curve only for significant values Show.Significance <- TRUE # Show z = critical value line Show.KD <- FALSE # Toggle kernel density overlay (density method only)
sig.levels <- c() # Optional: mark additional p-value thresholds on plot
int.loc <- 0.5 # Plot local power intervals below x-axis (set 0 to disable) hist.bar.width <- 0.2 # Width of histogram bars bw.draw <- 0.10 # Smoothing for kernel density display
CONSOLE OUTPUT
Show.Iterations <- TRUE # Show iterations for slow procedures (e.g., EXT, TEST4HETEROGENEITY)
Est.Method <- “OF” # Estimation method: “OF”, “EM”, or “EXT” # Clustered Data: “CLU-W” (weighted),”CLU-B” (bootstrap) Int.Beg <- 1.96 # Default: critical value for alpha = .05 Int.End <- 6 # End of modeling interval (z > 6 = power = 1)
ncp <- 0:6 # Component locations (z-values at which densities are centered) components <- length(ncp) # Number of components zsd <- 1 # SD of standard normal z-distribution zsds = rep(zsd,components) # one SD for each component
just <- 0.8 # Cutoff for “just significant” z-values (used in optional bias test)
ZSDS.FIXED <- FALSE # Fix SD values for EXT method NCP.FIXED <- FALSE # Fix non-central parameter(NCP) means values for EXT method W.FIXED <- FALSE # Fix weights for EXT method
fixed.false.positives <- 0 # If > 0, constrains proportion of false positives (e.g., weight for z = 0 component)
DENSITY-BASED SETTINGS (Only used with Est.Method = “OF”)
n.bars <- 512 # Number of bars in histogram
Augment <- TRUE # Apply correction for bias at lower bound Augment.Regression <- FALSE # Use Slope for Augmentation Augment.Factor <- 1 # Amount of augmentation
bw.est <- 0.05 # Bandwidth for kernel density (lower = less smoothing, higher = more smoothing) bw.aug <- .20 # Width of Augmentation interval
INPUT RESTRICTIONS
MAX.INP.Z <- Inf # Optionally restrict very large z-values (set Inf to disable)
CONFIDENCE INTERVALS / BOOTSTRAPS
boot.iter <- 0 # Number of bootstrap iterations (suggest 500+ for final models) ERR.CI.adjust <- 0.03 # Conservative widening of confidence intervals for ERR EDR.CI.adjust <- 0.05 # Conservative widening for EDR
CI.ALPHA <- 0.05 # CI level (default = 95%)
CI levels for Heterogeneity Test
fit.ci <- c(.01, .025, .05, .10, .17, .20, .50, .80, .83, .90, .95, .975, .99) # CI levels for model fit test
TEST4BIAS <- FALSE # Enable optional bias test TEST4HETEROGENEITY <- 0 # Optional heterogeneity test (slow) — set number of bootstrap iterations
Cookie Consent
We use cookies to improve your experience on our site. By using our site, you consent to cookies.