Blogging about statistical power, replicability, and the credibility of statistical results in psychology journals since 2014. Home of z-curve, a method to examine the credibility of published statistical results.
Show your support for open, independent, and trustworthy examination of psychological science by getting a free subscription. Register here.
“For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication” (Cohen, 1994).
DEFINITION OF REPLICABILITY: In empirical studies with sampling error, replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion (Schimmack, 2017).
See Reference List at the end for peer-reviewed publications.
Mission Statement
The purpose of the R-Index blog is to increase the replicability of published results in psychological science and to alert consumers of psychological research about problems in published articles.
To evaluate the credibility or “incredibility” of published research, my colleagues and I developed several statistical tools such as the Incredibility Test (Schimmack, 2012); the Test of Insufficient Variance (Schimmack, 2014), and z-curve (Version 1.0; Brunner & Schimmack, 2020; Version 2.0, Bartos & Schimmack, 2021).
I have used these tools to demonstrate that several claims in psychological articles are incredible (a.k.a., untrustworthy), starting with Bem’s (2011) outlandish claims of time-reversed causal pre-cognition (Schimmack, 2012). This article triggered a crisis of confidence in the credibility of psychology as a science.
Over the past decade it has become clear that many other seemingly robust findings are also highly questionable. For example, I showed that many claims in Nobel Laureate Daniel Kahneman’s book “Thinking: Fast and Slow” are based on shaky foundations (Schimmack, 2020). An entire book on unconscious priming effects, by John Bargh, also ignores replication failures and lacks credible evidence (Schimmack, 2017). The hypothesis that willpower is fueled by blood glucose and easily depleted is also not supported by empirical evidence (Schimmack, 2016). In general, many claims in social psychology are questionable and require new evidence to be considered scientific (Schimmack, 2020).
Each year I post new information about the replicability of research in 120 Psychology Journals (Schimmack, 2021). I also started providing information about the replicability of individual researchers and provide guidelines how to evaluate their published findings (Schimmack, 2021).
Replication is essential for an empirical science, but it is not sufficient. Psychology also has a validation crisis (Schimmack, 2021). That is, measures are often used before it has been demonstrate how well they measure something. For example, psychologists have claimed that they can measure individuals’ unconscious evaluations, but there is no evidence that unconscious evaluations even exist (Schimmack, 2021a, 2021b).
If you are interested in my story how I ended up becoming a meta-critic of psychological science, you can read it here (my journey).
References
Brunner, J., & Schimmack, U. (2020). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta-Psychology, 4, MP.2018.874, 1-22 https://doi.org/10.15626/MP.2018.874
Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17, 551–566 http://dx.doi.org/10.1037/a0029487
Schimmack, U. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychology/Psychologie canadienne, 61(4), 364–376. https://doi.org/10.1037/cap0000246
Andrew Gelman is well known for strong opinions about psychological science, including its methods and research culture (Fiske, 2017. For the most part, he writes as if psychologists are still following a statistical ritual that cannot produce meaningful results. This criticism is not new. It was already made by influential psychologists and methodologists, including Cohen (1990, 1994) and Gigerenzer (2004). The problem with Gelman’s critique is that it is outdated and largely ignores the discussion of null-hypothesis significance testing that took place in psychology during the 1990s. As evidence for this claim, one can simply inspect the reference list of Gelman and Carlin (2014). An article published in Perspectives on Psychological Science does not cite Cohen (1990, 1994), Gigerenzer (2004), or Tukey’s directional reformulation of significance testing (Tukey, 1991; Jones & Tukey, 2000). Although an outsider perspective can be useful for challenging untested assumptions, a commentary that ignores key insights produced by eminent statisticians and methodologists within psychology is unlikely to do so.
The Null-Hypothesis Significance Testing Strawman
As Gigerenzer (2004) pointed out, statistics is often taught as a ritual to be followed rather than as a principled approach to drawing conclusions from data. Rituals are not necessarily bad, but in science it is usually better to understand the rationale and assumptions underlying routine practices.
Null-hypothesis significance testing (NHST) has been described and criticized for decades (Tukey, 1991; Cohen, 1994). Most students of psychology will recognize the following brief description of it. First, researchers collect data that relate one variable to another. Ideally, this is an experiment in which one variable is experimentally manipulated (the independent variable) and the other is observed (the dependent variable). In experiments, a relationship between the independent and dependent variable may justify causal claims, but NHST itself is indifferent to causality. It can be applied to both experimental and correlational data. The main information produced by statistical analyses is the p-value. P-values below a conventional threshold are called statistically significant; those above the threshold are treated as not significant (ns). Significant results are easier to publish. As a result, data analysis often becomes a series of statistical tests searching for statistically significant results (Bem, 2010).
This approach to data analysis has been criticized for several reasons. First, statistical significance by itself does not provide information about effect size. For this reason, psychologists have increasingly reported effect-size estimates in addition to tests of statistical significance, in large part due to Cohen’s (1990) emphasis on effect sizes. Second, NHST has been criticized for its focus on statistically significant findings. Psychology journals have long reported rates of over 90% statistically significant results (Sterling, 1959; Sterling et al., 1995). Publication bias in favor of significant results then leads to inflated effect-size estimates (Rosenthal, 1979).
Most importantly, NHST has been criticized because it appears to reject a null hypothesis that is known to be false before any data are collected. Cohen (1994) called this the nil hypothesis. The nil hypothesis assumes that the population effect size is exactly zero. Statistical significance is then taken to imply that this hypothesis is unlikely to be true and can be rejected. The problem is that rejecting one specific possible effect size tells us very little about the data. It would be equally uninformative to test the hypothesis that the effect size equals any other single value, such as Cohen’s d = .20. So what if the effect size can be said not to be 0 or .20? It could still be 0.01 or 1.99. In short, hypothesis testing with a single point as the null hypothesis is meaningless. Yet that is exactly what psychological articles seem to be reporting when they state p < .05.
What Psychological Scientists Are Implicitly Doing
In reality, however, psychological scientists are doing something different. It may look as if they are testing the nil hypothesis, but in practice they are often testing two directional hypotheses at the same time (Kaiser 1960; Lakens et al., 2025; Tukey, 1991; Jones & Tukey, 2000). When the nil hypothesis is rejected, researchers do not merely conclude that there is a difference. They also inspect the sign of the effect size estimate and infer that the experimental manipulation increased or decreased behavior.
Some authors have argued that drawing directional conclusions from a two-sided test is conceptually problematic (e.g., Rubin, 2020). However, Jones and Tukey explain the rationale for doing so. The easiest way to see this is to reinterpret the standard nil-hypothesis test as two directional tests with two complementary null hypotheses. One null hypothesis states that the effect size is zero or negative. The other states that the effect size is zero or positive. Rejecting the first leads to the inference that the effect is probably positive. Rejecting the second leads to the inference that the effect is probably negative. Viewed this way, zero is simply the boundary between two rejection regions.
Because NHST can be understood in this way as involving two directional possibilities, alpha must be allocated across both tails to maintain the long-run error rate. No psychology student would be surprised to see a t distribution with 2.5% of the area in each tail. Each tail represents the error rate for one directional rejection, and together they produce the familiar two-sided alpha level of 5%.
Most psychology students are not taught that they are implicitly conducting directional tests when they interpret significant p values, but their actual practice shows that this is what they are doing. They routinely draw directional inferences from NHST, and this is a legitimate use of the procedure. It also makes NHST more meaningful than the strawman version in which researchers merely reject an exact value of zero that is often known in advance to be false.
Using NHST to infer the direction of population effects is meaningful because researchers often do not know that direction before data are collected. Empirical data can therefore provide genuinely new information. This is not a full defense of NHST, because effect size and practical importance can still be ignored, but it does show that psychologists have not spent decades and millions of dollars merely to establish that effect sizes are not exactly zero.
Gelman’s Type-S Error
Gelman and Tuerlinckx (2000) criticized NHST because “the significance of comparisons … is calibrated using the Type 1 error rate, relying on the assumption that the true difference is zero, which makes no sense in many applications.” To replace this framework, they proposed focusing on Type S error, where S stands for sign. A Type S error occurs when a researcher makes a confident directional claim even though the true effect has the opposite sign.
The label Type S error is potentially confusing because it suggests a replacement for the Type I error framework rather than a refinement of it. A Type I error is the unconditional long-run probability of falsely rejecting a null hypothesis across all tests that are conducted. For example, suppose a researcher conducts 100 tests with a significance criterion (alpha) of 5%. This criterion ensures that in the long run no more than 5% of all tests will be false positives. Testing at least some real effects will reduce the probability of a false positive. For example, if all studies have high power to detect a true effect, the probability of a false positive is zero (Soric, 1989). Thus, alpha sets a range of the relative frequency of false positives between 0 and alpha.
This unconditional probability must be distinguished from the conditional probability of error among the subset of studies that produced statistically significant results. In the previous example, if only 5 results were significant, it is likely that all 5 rejections were errors and that the conditional probability of a false positive given a significant result is 5 / 5 = 100% (Sorić, 1989). The proportion of false rejections among statistically significant results is called the false discovery rate (FDR), and the estimation and control of FDRs has become a large literature in statistics (Benjamini & Hochberg, 1995).
Applying Jones and Tukey’s interpretation of NHST to false discovery rates, a false discovery occurs not only when the true effect size is zero but also when it is in the opposite direction of the significant result. Gelman’s Type S error rate, also called the false sign rate (Stephens, 2017), assumes that effect sizes are never zero and counts only false rejections with the opposite sign. False sign rates are necessarily smaller than false discovery rates because wrong-sign rejections are only a subset of all false rejections. Exact-zero effects can produce significant results in either direction, whereas nonzero effects make correct-sign rejections more likely and wrong-sign rejections less likely.
The key source of confusion is that Gelman’s criticism of NHST and FDR estimation rests on a misunderstanding of NHST (Gelman, 2021). He maintains that FDR estimates are limited to the unlikely scenario that an effect is exactly zero and ignores sign errors. However, as Jones and Tukey (2000) pointed out, psychological researchers routinely use NHST as a directional sign test. Once NHST is understood in this way, Type S errors are no longer a fundamentally new kind of inferential problem and are already included in conditional and unconditional error rates. Moreover, NHST provides researchers with concrete statistical tools to estimate and control error rates, whereas Gelman’s Type S error is not something that can be estimated and was introduced as a rhetorical tool without practical use (Gelman, 2025; Lakens et al., 2025). In contrast, estimation of false discovery rates and false sign rates is an active area of research in statistics that builds on the foundations of NHST (Benjamini & Hochberg, 1995; Stephens, 2017) and has been largely ignored in psychology.
Statistical Power
So far, the distinction between Type I and (unconditional) Type S errors is mostly harmless. It may even help clarify that NHST is really used as a test of the sign of the population effect size rather than as a literal test of the nil hypothesis (Jones & Tukey, 2000). However, the wheels come off when Gelman and Carlin (2014) extend this critique from Type I error to Type II error and statistical power.
The distinction between Type I and Type II errors was introduced by Neyman and Pearson. A Type II error is the probability of failing to reject a false null hypothesis. Neyman and Pearson were cautious and avoided framing results as inferences about a true effect or as acceptance of a true hypothesis. In practice, however, failure to reject a false hypothesis means that either the population effect is positive and the study failed to produce a statistically significant result with a positive sign, or the population effect is negative and the study failed to produce a statistically significant result with a negative sign.
Statistical power is simply the complementary probability of obtaining a statistically significant result with the correct sign. Unlike the discussion of Type I errors, there is no important distinction here between a point null and an opposite-sign error. Power calculations are inherently directional. Researchers assume either a positive or a negative effect and then choose a design and sample size that reduce sampling error while controlling the Type I error rate. For example, a comparison of two groups with n = 50 per group, a population effect size of half a standard deviation (Cohen’s d = .50), and alpha = .05 has about a 70% probability of producing a statistically significant result with the correct sign.
By definition, then, power already concerns rejections with the correct sign. At this point, there is no meaningful difference between standard NHST and Gelman’s Type S framework (Stephens, 2017). The only minor difference arises in hypothetical scenarios with extremely low power. For two-sided (non-directional) power calculations, low power can produce significant results with sign errors. To use NHST as a sign-test in Jones and Tukey framework of two simultaneous one-sided tests, power should be estimated for one-sided directional tests with alpha/2. However, in practice, this distinction is irrelevant because Gelman and Carlin already showed that even modest power of 50% renders sign errors practically impossible.
Thus, the main concern about Gelman and Carlin’s (2014) article is the false implication that power calculations ignore sign errors and that researchers must move “beyond power” to control them. Grounding NHST in Jones and Tukey’s (2000) framework of two simultaneous directional tests shows that power calculations are not flawed. High power prevents both false negatives and sign errors. Gelman’s critique rests on a false premise: the assumption that NHST is nil-hypothesis testing. Under that assumption, power appears disconnected from sign errors. But once NHST is understood as directional inference, the criticism is invalid. Power analysis is not only useful but essential for controlling sign errors and the false sign rate.
Implications
Here’s a shortened version:
Implications
Gelman positions the Type S error as a new concept that requires moving “beyond power” because “power analysis is flawed” (p. 641). On closer inspection, power analysis is necessary and sufficient to control Type S error rates. Studies with high power ensure that most significant results have the correct sign, and high power also ensures a high discovery rate, which limits the proportion of false discoveries (Sorić, 1989). Power delivers everything needed to make significant results credible. It is paradoxical to criticize psychology for relying on small samples while also criticizing the tool that tells researchers how to avoid them. Cohen’s lasting contribution was precisely this: demonstrating that many studies lack power to detect plausible but small effect sizes and providing the tools to do better (Cohen, 1962).
Gelman and Carlin’s (2014) framing of power as flawed may have added to misunderstandings about the role of power in ensuring credible results. NHST and power analysis are not flawed. They are statistical tools for drawing conclusions about the direction of population effect sizes (Maxwell, Kelley, & Rausch, 2008). It would be desirable to conduct all studies with enough precision to provide informative effect size estimates, but limited resources often make this impossible. Meta-analysis of smaller studies can yield precise estimates, provided results are reported without selection bias. Reporting outcomes regardless of statistical significance is the most effective way to address selection bias, which remains the biggest threat to the credibility of NHST in practice (Sterling, 1959).
The real problem of NHST is not solved by a focus on Type S errors. The real problem is that non-significant results are inconclusive because failure to provide evidence for a positive or negative effect does not allow inferring the absence of an effect (Altman & Bland, 1995). The solution is to distinguish three hypotheses (Rice & Krakauer, 2023): (a) the effect is positive and larger than a smallest effect size of interest, (b) the effect is negative and larger in magnitude than a smallest effect size of interest, and (c) the effect falls within a region of practical equivalence around zero. Evidence for absence is established if the confidence interval falls entirely within the middle region. Replacing the point nil hypothesis with a range of practically equivalent values is an important addition to statistics for psychologists (Lakens, 2017; Lakens, Scheel, & Isager, 2018). It helps distinguish between statistical and practical significance, and it can turn non-significant results into significant evidence for the absence of a meaningful effect. However, providing evidence for absence often requires large samples because precise confidence intervals are needed to fit within a narrow region around zero. Power analysis remains essential for planning studies with this goal.
Conclusion
Continued controversy about NHST shows that better education about its underlying logic is needed. Jones and Tukey (2000) provided a clear explanation that deserves to be foundational for the teaching of NHST. Understanding NHST as two simultaneous directional tests avoids the confusion created by decades of criticism directed at a strawman version of the procedure. NHST has persisted for nearly a century despite harsh criticism because it provides a minimal but useful inference: determining the likely sign of a population effect size. Students need to learn about the real limitations of NHST and how they can be addressed. Changing statistical methods does not solve the problem that researchers need to publish and that precise effect size estimates are often out of reach. Even power to infer the sign of an effect is often low. Honest reporting of a single well-powered study is more important than reporting multiple underpowered studies that are p-hacked or selected for significance (Schimmack, 2012). With good data, different statistical approaches lead to the same conclusion. Open science reforms that improve the quality of data are more important than new statistical methods. The main reason NHST continues to attract criticism is that criticism is easy, but finding a better solution is harder. Real progress requires a real analysis of the problem NHST has many problems, but ignoring sign errors is not one of them.
It’s sooooo frustrating when people get things wrong, the mistake is explained to them, and they still don’t make the correction or take the opportunity to learn from their mistakes.
This could have been written by me or many other people who are in the business of calling out other people’s mistakes. In theory, that would be all scientists because science is supposed to progress by correcting mistakes. However, academia is not science and many academics don’t like to face their own mistakes. The more their status and reputation depends on some claim they made in the past, the more reluctant people are to admit that they were wrong. Max Plank famously declared that science only progresses when pig-headed prominent scientists die and the field can move on. But humans are human and public admission of mistakes is not a virtue in modern capitalist science that reward self-promotion and sexed-up research findings.
While it is true that the incentives are against public admission of mistakes, there are notable exceptions. Daniel Kahneman, after he won a Nobel Prize, was able to admit some mistakes. Maybe it takes a Nobel to overcome nagging feelings of self-doubt and defensiveness. I hope not. I have corrected some of my mistakes, but I have to admit, that it sometimes took a long time to admit them. At the same time, I have also pushed back against critics who were wrong. The real problem is of course to know the difference. Accept valid criticism, reject invalid criticism, requires knowing what is valid and what is invalid. Thus, the requestion for all actors, critic, person being criticized, and observers is “Who is right?”
The content of the blog post, however, conflates responding to criticism with responding to an error in one’s work.
Consider the following range of responses to an outsider pointing out an error in your published work:
Look into the issue and, if you find there really was an error, fix it publicly and thank the person who told you about it.
Look into the issue and, if you find there really was an error, quietly fix it without acknowledging you’ve ever made a mistake.
Look into the issue and, if you find there really was an error, don’t ever acknowledge or fix it, but be careful to avoid this error in your future work.
Avoid looking into the question, ignore the possible error, act as if it had never happened, and keep making the same mistake over and over.
If forced to acknowledge the potential error, actively minimize its importance, perhaps throwing in an “everybody does it” defense.
Attempt to patch the error by misrepresenting what you’ve written, introducing additional errors in an attempt to protect your original claim.
Attack the messenger: attempt to smear the people who pointed out the error in your work, lie about them, and enlist your friends in the attack.
As you can see, there is no option to look at the issue, find a mistake in the criticism, point out the mistake, and the critic apologizes and thanks the person being criticized for engaging constructively and taking time to address their concern.
A Case Study
Taken, Erik van Zwet’s post “Concerns about z-curve “as an example. The post contains several mistakes about z-curve. Some mistakes are glaring, like being a reviewer of z-curve and then claiming it was not vetted by experts.
The strange fact, not mentioned by van Zwet on his blog post, is that he wrote a favorable review of z-curve when he was a reviewer of z-curve.2.0. Claiming that z-curve was not reviewed by experts implies that he is not an expert, but if he is not an expert, it undermines his critique of z-curve.
2. van Zwet then claims that the z-curve method is based on the assumption that the absolute values of the SNRs have a discrete distribution supported on 0,1,2,…, 6. That statement confuses the default settings of the z-curve package with the z-curve method. Criticizing these defaults is fine, but confusing default settings and a method is not. Especially Bayesian statisticians like Gelman and van Zwet should know the difference.
If somebody uses Gelman’s statistical tool, stan, with bad priors, it leads to bad results. The problem is not the tool, but the prior. I have made this point clear in the comment section and pointed out that z-curve handles some specific edge-cases where the defaults fail well by changing the defaults.
3. In the conclusion, van Zwet makes generalizes from a single scenario that shows z-curve underestimates uncertainty to imply that z-curve is always unreliable. “In my opinion, statistical methods should be reliable when their assumptions are met. I don’t think unreliable methods should be used because no better methods are available.”
Once again, this is like saying nobody should use Gelman’s stan program to analyze data because one application resulted in a false conclusion. Non-sensical, unscientific, and clearly a mistake that only Reviewer B would make because the goal is not to advance science, but to be a nasty reviewer for reasons that remain unknown (e.g., sexual frustration, grant application failed, realizing that academia is a waste of time, no hobby, etc.).
How I respond to valid criticism
Let me show how I respond to valid concerns. Yes, in the specific scenario picked by van Zwet, z-curve.2.0 was overconfident and produced confidence intervals that were too narrow and missed the true value more often than a 95% confidence interval should, namely more than 5 out of 100 times. That is a valid criticism of z-curve.2.0.
I was already working on improving z-curve. Using van Zwet’s scenario, I was able to use information in the data to alert z-curve to scenarios that provide little information about the expected discovery rate (van Zwet’s own simulation had 40% data that contained absolutely no information). I tested z-curve.3.0 with van Zwet’s scenario and 99 out of 100 simulations contained the true value. Thus, the new confidence intervals provide accurate information about lack of information about the EDR in the data.
Of course, z-curve is not magic. As the plot shows, the EDR is an estimate of the distribution of non-significant results based on only the significant results. When there are few informative z-values just below significance (z = 1.96 to 2.96), the EDR cannot be estimated. Z-curve.3.0 realizes this and gives a wide confidence interval that ranges from 15% to 98%. This is informative because it tells users that the EDR cannot be estimated and the point estimate cannot be trusted. However, the confidence interval will be smaller and more informative in other situations and with larger sets of studies.
In short: z-curve.2.0 is dead. Long live z-curve.3.0
Now, this is how you respond to valid concerns and demonstration of errors. You learn from them and fix them. That is how real science advances and z-curve has been developed, evaluated, and improved for over 10 years now.
Waiting for Gelman and van Zwet’s Response to this Criticism
It will be interesting to see how van Zwet and Gelman respond to this criticism of their criticism. The ladder of responses is clear and now also includes pointing out errors in my response or in z-curve.3.0 In the age of preregistration, let me preregister my prediction.
4. Avoid looking into the question, ignore the possible error, act as if it had never happened, and keep making the same mistake over and over.
I hope this is a mistake that I am happy to correct when proven wrong.
Andrew Gelman is a statistician who is working for Columbia University. He also maintains a blog post where he shares his opinions about many topics, including the replication crisis in psychology and related fields like behavioral economics. He is not an expert in either field, but that does not prevent him from evaluating the research in these areas. But you do not have to read a specific blog post by him because the result is often the same. The research is not credible, sample sizes are too small, studies are selected for significance, and meta-analyses are not trustworthy. In his favorite area of statistics that uses prior assumptions to make sense of actual data, this is known as a dogmatic prior. No amount of data will reverse the conclusion that is already implied by a dogmatic prior. So, you really do not need data.
As you may have guessed, I don’t like the guy. I think he is a jerk, and that may cloud my evaluation of him. However, I do have data to support my claim that the Gelman’s statements often reflect his prior assumptions and are immune to data. He says so himself on his blog post.
After discussing some problems with a meta-analysis of nudging studies (a Nobel prize winning idea in behavioral economics), Gelman writes:
Just to be clear: I would not believe the results of this meta-analysis even if it did not include any of the above 12 papers, as I don’t see any good reason to trust the individual studies that went into the meta-analysis. It’s a whole literature of noisy data, small sample sizes, and selection on statistical significance, hence massive overestimates of effect sizes.
What are small sample sizes (some of these studies have hundreds of participants)? Where is the evidence that selection leads to MASSIVE overestimation. Gelman has no answers to such scientific questions about the evidence because he does not care about the data. His prior is sufficient to dismiss an entire literature, not just a few bad studies.
Did I cheery-pick this example? Should you trust me? To find an answer to these questions you can use AI that can read Gelman’s blog within seconds. Share one of his blog posts where he reversed a prior belief in response to empirical data. I am waiting.
The problem is not that Gelman is opinionated and shares his opinions on a blog (some people may say that is also true of myself). The problem is that he has blind followers that seem to confuse believing Gelman’s opinions with meta-science. Actual understanding of problems in science requires investigating these problems with empirical methods and draw conclusions from data; not believing in conclusions that rest on unproven assumptions.
I am all in favor of open science and a critic of closed pre-publication peer-review. The downside of open communication is that there is no quality control and internet searches will amplify misinformation. This is the case with Erik van Zwet’s critique of z-curve. Even though I addressed his criticisms in the comment section, search engines – like humans – do not scroll to the end and process all information. I have even addressed concerns about z-curve.2.0 by improving z-curve 3.0 to handle edge cases like the one used by van Zwet to cast doubt about z-curves performance in general. In science, facts trump visibility Z-curve.has been validated with many simulations across a wide range of scenarios and works well even with just 50 significant z-values. For more information, check out the Replication Index blog or the FAQ about z-curve page.
The bias in the Bing (AI) summary is evident when we compare it to Google search summary. Still makes a false claim about assumptions based on Erik van Zwet’s blog bost, but also avoids the dismissal of a method based on a single edge case that was easy to address and is no longer of concern in the new z-curve.3.0. In short, don’t trust the first generic response of AI. Use AI to probe arguments.
Power Failure, False Positives, and The Replication Crisis
Scientists have become increasingly skeptical about the credibility of published results (Baker, 2016). The main concern is that scientists were presenting results as objective facts, while the results were often influenced by undisclosed subjective decisions that increased the chances of presenting a desirable result. These degrees of freedom in analyses are now called questionable research practices or p-hacking.
Ioannidis (2005) showed with hypothetical scenarios that questionable research practices combined with low statistical power and testing of many false hypotheses could lead to more false than true discoveries of statistical regularities (i.e., a statistically significant result).
Awareness of this problem has produced thousands of new articles that discuss this problem. It has even created its own new science called meta-science; the scientific study of science. Some articles have gained prominent status and are foundational to meta-science.
For example, the Reproducibility Project in psychology replicated 100 studies. While 97 of these studies reported a statistically significant result, only 36% of the replication studies showed a significant result. The drop in the success rate can be attributed to questionable research practices that inflated effect size estimates to achieve significance. Honest replications did not have this advantage, and the true population effect sizes were often too small to produce significant results.
The true probability of obtaining a statistically significant result is called statistical power (Cohen, 1988; Neyman & Pearson, 1933). In the long run, a set of studies with average true power of 50% are expected to produce 50% significant results, even if all studies test different hypotheses (Brunner & Schimmack, 2020l). Thus, the success rate of the Reproducibility Project implies that the replication studies had about 40% average power. As these studies replicated original studies as closely as possible (and similar sample sizes), this suggests that the average power of the original studies was also around 40%.
This estimate is in line with Cohen’s (1962) seminal estimate of power. Average power around 40% has two implications. First, many attempts to demonstrate an effect in a single study will fail to reject a false null hypothesis that there is no relationship; a false negative result (Cohen, 1988). Concerns about false negatives were the focus of meta-scientific discussions about significance testing in the 1990s (Cohen, 1994).
This shifted, when meta-scientists pointed out the consequences of selection for significance and low power (Ioannidis, 2005; Rosenthal, 1979; Sterling et al., 1995). Low statistical power combined with questionable research practices could result in many false discoveries (i.e., statistically significant results without a real effect). In some scenarios, literatures could be entirely made up of false discoveries (Rosenthal, 1979) or at least more false than true discoveries (Ioannidis, 2005).
Theoretical articles and simulation studies suggested that false positive rates might be uncomfortably high and replication failures seemed to support this suspicion, although replication failures could also just be false negative results (Maxwell, 2016). Thus, actual replication studies often do not settle conflicting interpretations of the evidence. While some researchers see replication failures as evidence that original results cannot be trusted, others point towards the difficulty of replicating actual studies and false negatives as reasons why original results could not be replicated (Gilbert et al., 2016).
An alternative approach examines false positives for sets of studies rather than a single study. The statistical results of original articles are used to estimate the average power of studies and to use power to evaluate the risk of false positive results. One of the first attempts to do so was Button, Ioannidis, Mokrysz, Nosek, Flint, Robinson, and Munafò’s (2013) article “Power failure: why small sample size undermines the reliability of neuroscience.” The key empirical finding was that median power of 730 studies from 49 meta-analysis was 21%. The article did not provide an empirical estimate of the false positive rate, but it did illustrate implications of the power estimate for false positive rates in various scenarios. The authors suggested that “a major implication is that the likelihood that any nominally significant finding actually reflects a true effect is small” (p. 371). This claim has contribute to concerns that many published significant results are unreliable.
Reexamining The Power Failure
More than ten years later, it is possible to revisit the seminal article with the benefit of hindsight. Advances in the estimation of true power have revealed important conceptual problems that are different from the computation of hypothetical power for the purpose of sample size planning (Brunner & Schimmack, 2020; Soto & Schimmack, 2026).
Cohen defined statistical power as the probability of obtaining a significant result (1988). In the context of sample size planning, however, power is defined as the probability of obtaining a significant result given a hypothetical population effect size greater than zero. This conditional definition of power given a true hypothesis is widely used in the power literature and was also used by Ioannidis (2005) in his calculations of false positive rates.
Assuming only true hypothesis to compute power is reasonable for hypothetical scenarios, but not for the estimation of true power of completed studies. As the population effect size remains unknown after a study produced an effect size estimate, it is not possible to assume an effect size greater than zero. Thus, the true probability of a completed study to produce a significant result is unconditional and independent of the distinction between H0 and H1. Any estimate of average true power is therefore an estimate of the unconditional probability to produce a significant result. This average can contain tests of true null-hypothesis.
The distinction between conditional and unconditional probabilities has important implications for Button’s calculations of false positive rates. The median power of 21% is unconditional, but the false positive calculations assume conditional power. This can lead to inflated estimates of false positive rates. For example, mean power of 20% could be made up of 50% true H0 with a 5% probability to produce a (false) significant results and 50% tests of H1 with 35% power. In this scenario, the false positive rate is 2.5% / (2.5% + 17.5) = 12.5%. Increasing the proportion of true hypothesis that were tested to a 4:1 ratio would increase the conditional power of tests of H1 to 80% to maintain 20% average power. The false positive rate would increase to .04 / (.04 + .15) = 20%. As noted by Soric (1989), we can even compute the maximum false positive rate that is consistent with unconditional mean power assuming conditional power of 1. With mean power of 21%, the maximum ratio of H0 over H1 is 5.25:1 and the maximum false discovery rate is 20%.
Table 1
Maximum False Discovery Rate for 20% Unconditional Power (Soric, 1989)
Not Significant
Significant
Total
H₁ True
.000
.160
.160
H₀ True
.798
.042
.840
Total
.798
.202
1.000
H₀ : H₁ Ratio
5.25 : 1
False Discovery Rate
.208
Note. The table shows the maximum false discovery rate when average unconditional power equals 20%. This maximum occurs when conditional power for true hypotheses (H₁) equals 100%. The false discovery rate equals the proportion of significant results that are false positives: .042 / .202 = .208. Any lower conditional power with the same unconditional power of 20% produces a lower false discovery rate.
The 21% false positive rate overestimates the true false positive rate with 21% median power for two reasons. Soric’s formula assumes that H1 are tested with 100% power. Assuming that many tests of small true effect sizes in small samples have low conditional power, the true false positive rate is below 21%. The second reason is that unconditional power has a skewed distribution with many low power studies and a few high power studies. As a result, mean power will be higher than median power. Button et al.’s provide information about mean power based on their analyses of publication bias that uses mean power. This analysis suggested that 254 of the 730 studies were expected to produce a significant result and the expected percentage of significant results is equivalent to mean power (Brunner & Schimmack, 2020). Thus, mean power was estimated to be 254 / 730 = 35%. Based on Soric’s formula, the maximum false discovery rate with 35% significant results is 10%.
In conclusion, Button et al.’s estimate of unconditional mean power can be used to draw inferences about false positives in the meta-analyses that they examined that do not rely on unknown ratios of true and false hypotheses being tested in neuroscience. Using their data and Soric’s formula suggests that the false positive risk is fairly small.
A Z-Curve Analyses of Button et al.’s Data
Button et al.’s article contribute to a culture of open sharing of data, but that was not the norm when the article was published. Fortunately, Nord et al. (2017) conducted further analyses of the data and shared power estimates for the 730 studies in an Open Science Foundation (OSC) project. The power estimates do not use the effect sizes of individual studies. Rather they use the sample sizes and the meta-analytic effect size to estimate power. This approach corrects for effect size inflation in smaller studies and reduces bias in power estimates. The following analyses used these data. Power estimates based on individual studies are likely to be inflated by publication bias.
Based on these data, 28% of the studies were statistically significant. Mean power was 35%, matching Button et al.’s estimate of mean power, suggesting that Nord et al.’s power values are based on meta-analytic effect sizes.
I converted power values into z-values and analyzed the z-values with z-curve.3.0 using the default model (Figure 1).
The observed discovery rate (ODR) is simply the percentage of significant results. More important is the bias-corrected estimate of unconditional mean power for all 730 z-values. Z-curve uses the observed distribution of significant z-values and projects the fitted model into the range of unobserved significant results. As shown in Figure 1, the model predicts the actual distribution of non-significant results fairly well. This suggests that the use of meta-analytic effect sizes adjusted inflated effect size estimates and removed publication bias. The estimated mean power for all studies is called the expected discovery rate (EDR). The EDR estimate is close to the ODR, suggesting further that the data are unbiased.
A key problem of estimating the EDR based on the significant results only is that the confidence interval around the point estimate is very wide. When the data show no major bias, more precise estimates can be obtained by fitting the model to all 730 data (Figure 2).
The key finding is that the point estimate of the false positive risk, FDR = 13% is in line with calculations based on Button’s estimate of mean power. The confidence interval around this estimate limits the FDR at 20%. This is an upper limit because conditional power of studies with significant results is likely to be less than 100%.
In fact, z-curve makes it possible to estimate conditional power of significant studies. First, z-curve estimates unconditional average power of significant studies. This parameter is called the expected replication rate (ERR) because it predicts how many studies would produce a significant result again in a hypothetical replication project that reproduces the original studies exactly with new samples. The ERR is 54% with an upper limit of 60% for the 95% confidence interval. We also know that no more than 20% of these studies are false positives. Assuming 80% true hypotheses, the average conditional power can not be higher than (.60 – .20*.05) / .8 = 74%. Thus, Soric’s assumption of 100% power is conservative, and the false positive rate is likely to be lower.
In conclusion, a z-curve analysis of Nord et al.’s power estimates for Button et al.’s meta-analyses confirms estimates that could have been obtained by applying Soric’s formula to Button et al.’s estimate of mean power. The true rate of false positive results remains unknown, but it is unlikely to be more than 20%.
Heterogeneity Across Research Areas
Nord et al. (2017) demonstrated that power varies across different research areas that were included in Button et al.’s sample of meta-analyses. Some of these areas had enough studies to conduct z-curve analyses for these specific areas. The most interesting area are candidate-gene studies that relate genotypic variation in single genes to phenotypes across participants With the benefit of hindsight, it is known that variation in a single gene has trivial effects on complex traits and that many of the significant results in these studies were practically false positive results (Duncan & Keller, 2011). 234 of the 730 studies were from this research area. Figure 3 shows the results. Interestingly, only 11% of the results were statistically significant. Thus, the low average power can be explained by many studies that reported non-significant results. There is no evidence of publication bias in these meta-analyses.
Using Soric’s formula, the low EDR translates into a high false positive risk, 42% and the upper limit of the 95% confidence interval includes 100%. Thus, z-curve confirms that the rare significant results in this literature could be false positive results. Most significant results also are just significant. There are hardly any results that show strong evidence (z > 4) against the null-hypothesis.
In short, a large portion of the 730 studies came from a research area that is known to have produced few significant results. This finding implies that other research areas are producing more credible significant results (Nord et al., 2017).
A second set of meta-analyses were clinical trials. Clinical trials have received considerable attention using Cochrane meta-analyses and abstracts in original articles that often report the key statistical result ( (Jager & Leek, 2013; Schimmack & Bartos, 2023, van Zwet et al., 2024). The results suggest that unconditional mean power is around 30% and the false positive risk is between 10% and 20%. These results serve as benchmarks for the z-curve analysis of the 145 clinical trials in Button et al.’s study (Figure 4).
The EDR is somewhat lower, 21%, but the 95% confidence interval includes 30%. The FDR is 19%, but the lower limit of the confidence interval includes 13%. Thus, the results are a bit lower, but mostly consistent with evidence from estimates based on thousands of results. These estimates of the FDR are notably lower than the false positive rates that were predicted by Ioannidis’s scenarios that assumed high rates of true null-hypotheses.
The third domain were studies from psychology. Psychological scientists have examined the credibility of their research in the wake of replication failures (Open Science Collaboration, 2015). Suddenly, only significant results in multiple studies within a single study were no longer attributed to reliable effects, but seen as signs of selection for significance (Schimmack, 2012). Francis (2014) found that over 80% of these multi-study articles showed statistically significant evidence of bias. Large scale multi-lab replication studies failed also showed that effect sizes estimates in these studies could be inflated by a factor of 1,000, shrinking effect sizes from d = .6 to d = .06 (Vohs et al., 2019). A z-curve analysis of a representative sample of studies in social psychology estimated that average unconditional power before selection for significance, EDR = 19%, FDR = 22%. Cohen (1962) already found similar estimates are similar for focal and non-focal results. This was also the case in a survey of emotion research (Soto & Schimmack, 2024). Soto and Schimmack (2024) reported an EDR of 30% and a corresponding FDR = 12% (k sig = 21,628) for all automatically extracted tests, and an EDR of 27%, FDR = 14%, for hand-coded focal tests (k sig = 227). These results serve as a comparison standard for the z-curve of 145 studies classified as psychological research by Nord et al. (2017). The EDR is 49%, FDR = 5%. Even the lower limit of the EDR confidence interval, 39%, implies only 8% false positives. among the significant results.
There are several reasons why these results differ from other findings. First, the focus on meta-analyses leads to an unrepresentative sample of the entire literature. Meta-analyses often include a lot more non-significant results and have less bias than original articles. Second, the specific set of meta-analyses was not representative of the broader literature in psychology. Thus, the results cannot be generalized from the specific studies in Button et al.’s sample to psychology or neuroscience. That would require representative sampling or collecting data from all studies using automatic extraction of test statistics.
Discussion
Button et al.’s (2013) was a first attempt to assess the credibility of empirical results with empirical estimates of power based on meta-analytic effect sizes and sample sizes. The median power was low (21%). The key implications of these finding was that researchers often fail to reject null-hypotheses and may use questionable research practices to report significant results in published articles. Low power and bias could lead to many false positive results. This article added to other concerns about the reliability of findings in neuroscience (Vul et al., 2019).
Most citations took Button et al.’s findings and implications at face value. Nord et al. (2017) pointed out that power and false positive rates varied across research areas. Most notably, candidate gene studies have lower power and a much higher false positive risk. Including these studies in the calculation of median power may have led to false perceptions of other research areas.
Here I presented the first serious critical examination of Button et al.’s methodology and inferences and found several problems that undermine their pessimistic assessment of neuroscience. First, they estimated unconditional power, but their false positive calculations require estimates of conditional power. Second, false positives rates depend on mean power and not median power. Mean power was 35% which is close to the estimate for psychology based on actual replication studies (OSC, 2015). Third, they made unnecessary assumptions about ratios of true and false hypotheses being tested, when unconditional power alone is sufficient to estimate false positive rates (Soric, 1989). Fourth, they relied on meta-analysis to correct for publication bias, but meta-analyses are not representative of the broader literature.
Meta-science is like other sciences. Ideally, critical analyses reveal problems and new innovations address these problems. Power estimation started in the 1960s with Cohen’s seminal article. Cohen (1962) worked with plausible effect sizes, but did not aim to estimate studies true power. Moreover, his work and statistical power were largely ignored (Cohen, 1990; Sedlmeier & Gigenzer, 1989).
Conclusion
The replication crisis stimulated renewed interest in methods that use observed results to draw inferences about the power of actual studies (Ioannidis & Trikalinos, 2007; Francis, 2014; Schimmack, 2012; Simonsohn, Nelson, & Simmons, 2014). This work shifted attention from prospective power calculations to the retrospective assessment of evidential strength in published literatures. Two challenges emerged as central. First, selection bias inflates the observed rate of significant results, requiring methods that correct for selection. Second, power varies across studies, requiring models that allow for heterogeneity rather than assuming a single common effect size or power level. Early approaches addressed selection under simplifying assumptions, typically treating power as homogeneous across studies. As a result, their inferences become unreliable when studies differ in sample size, effect size, or both (Brunner & Schimmack, 2020; Schimmack, 2026).
Z-curve extends this line of work by explicitly modeling both selection and heterogeneity, estimating a distribution of power across studies rather than a single average. This provides a framework for quantifying key properties of the literature, including expected discovery and replication rates, and for linking these quantities to false discovery risk (Sorić, 1989). In this sense, z-curve represents a substantive advance in the empirical assessment of the credibility of published findings. Like earlier contributions such as Button et al., it is unlikely to be the final word, but it is currently the most advanced method to estimate true power for sets of studies with heterogeneity in power and selection bias.
The latest World Happiness Report gives Jonathan Haidt a megaphone to continue his narrative that decreasing wellbeing among young people can be blamed nearly entirely on social media use (Chapter 3). Chapter 4 shows how assessments of the evidence are biased and (US) American Psychologists, APA, are the most biased, but the (US) Surgeon General report is not much better. Policy is made based on biased readings of the evidence (fortunately, 16-year old will find ways, just like they were watching R-rated movies in the old days).
Chapter 3
Chapter 3 is written by Haidt and the website gives a helpful warning that it is a 61 minute read. That is like asking somebody to listen to 24 hours of Fox News to find out how they misrepresent everything to support a criminal president of the United States, where young people are getting less happy. I do not have time for that. Rather, I asked Clause (not war-supporting ChatGPT) to summarize and evaluate the chapter. Importantly, this is not generic Claude. This is a Claude project that knows everything about SWB that I have written in my textbook on this topic. Yes, unlike Haidt, I have studied SWB for 30 years. Is it unbiased? No. But it is an antidote to Haidt’s noise machine.
My favorite quote from Claude’s review. “The chapter is extraordinarily effective as persuasion and deeply problematic as science.” I don’t have the scientific answer, but I know that social media are at best only a small piece of the puzzle, why young people in Canada and the USA are suffering. Distracting from these problems by blaming social media is problematic and we need to find real answers to create real solutions to this problem.
The Structure
The chapter is organized as a legal prosecution — they literally say “we are making the case for the prosecution” and ask readers to use “preponderance of the evidence” as the standard. They present seven “evidence lines” organized as 15 “exhibits”:
Lines 1–3 (Testimony): Surveys of young people showing regret, surveys of parents/teachers/clinicians expressing concern, and leaked internal corporate documents showing companies knew about harms.
Lines 4–7 (Academic evidence): Cross-sectional studies showing heavy users have elevated depression risk, longitudinal studies showing social media at Time 1 predicts depression at Time 2, RCTs showing reduction in social media improves mental health, and natural experiments showing broadband rollout worsened mental health.
They then extrapolate to population-level harm estimates (millions of US teens affected) and argue this is sufficient to answer the “historical trends question” — that social media caused the youth mental health decline.
The Rhetorical Strategy
The chapter is extraordinarily effective as persuasion and deeply problematic as science. The courtroom framing is the tell. In a courtroom, the prosecution presents one side of the case. The defense presents the other. Haidt has cast himself as prosecutor and explicitly asks readers to lower their evidentiary standard. He then preemptively addresses critics not by engaging with their arguments substantively, but by saying their findings support his case once “unblended.”
The “unblending” argument is his central methodological move: whenever a study finds null or small effects, Haidt argues this is because the researchers combined too many outcomes, populations, or technologies. When you restrict to girls, to internalising symptoms, and to social media specifically, the effects get larger. This is a legitimate analytical point — but it’s also a form of specification searching. You can always find larger effects by narrowing the sample and outcome to where the signal is strongest. The question is whether those specifications were preregistered or selected post hoc.
Critical Problems from Your SWB Framework
1. No personality controls anywhere. Not a single study Haidt cites controls for Neuroticism or Depressiveness. Your Chapter 7 work shows these facets explain ~50% of SWB variance. A high-Neuroticism adolescent girl is simultaneously more likely to use social media heavily (rumination, reassurance-seeking), report depression, report body image problems, and perceive social media as harmful. Without personality controls, every “line of evidence” is confounded by the same omitted variable.
2. The testimony evidence is circular. Lines 1–3 amount to: people believe social media is harmful. But people’s causal attributions about their own mental health are unreliable — that’s one of the core lessons of your measurement chapter. If you asked depressed people in the 1990s what caused their depression, many would have blamed television, or music, or whatever was culturally salient. The fact that Meta’s own employees believed their products were harmful is concerning, but it’s still testimony about beliefs, not evidence of causation. Haidt himself acknowledges victims “could be mistaken” but then proceeds to treat the testimony as though it establishes the case.
3. The RCT evidence is weaker than presented. The headline finding is g = 0.19 for depression from social media reduction experiments. But these are short-term studies (often 1–3 weeks) with self-selected participants who know they’re in a social media reduction study. Demand effects are enormous. If you tell someone “we’re testing whether social media is bad for you, please reduce your usage,” the treatment group has a strong expectation of feeling better. There’s no placebo control for that expectation. Compare this to the evidence standards your textbook applies: lottery studies for money effects, migration quasi-experiments for cross-national differences. The RCT evidence here doesn’t come close.
4. The population extrapolations are absurd. Haidt takes an effect size from a 2-week reduction experiment on college students and extrapolates to “a decline in the prevalence of clinical depression by roughly one-third.” This assumes the experimental effect scales linearly to the population, that it would persist beyond two weeks, that college students are representative of adolescents, and that the experimental context (individual reduction while peers stay online) maps onto population-wide changes. None of these assumptions are justified.
5. The “historical trends question” remains unanswered by his own admission — then he answers it anyway. He repeatedly says the product safety question is easier than the historical trends question, and that his evidence is mainly about the former. But then he pivots and says the scale of individual harm “makes it more plausible” that social media caused the population trends. This is a non sequitur. Alcohol causes individual harm at massive scale too, but alcohol consumption hasn’t changed enough to explain any recent population trend in youth mental health.
6. He never mentions housing, employment, economic prospects, or any structural factor. In a 61-minute chapter organized as an exhaustive prosecution of social media, there is zero engagement with the alternative suspects you identified. No mention of housing affordability. No mention of the gig economy. No mention of declining real wages for young people. No mention of student debt. The word “economic” appears only in passing references to other studies’ control variables. This is exactly the bad criminal investigation you described — single suspect, no alibi checking.
7. The asymmetric framing of evidence standards is self-serving. Haidt argues that the cost of wrongly concluding social media is unsafe is low (kids just wait until 16), while the cost of wrongly concluding it’s safe is “catastrophically high.” This is a policy argument dressed up as an epistemological one. By the same logic, the cost of wrongly attributing the youth mental health crisis to social media — while ignoring structural economic causes — is also catastrophically high, because it diverts political attention from housing, labor market, and inequality reforms that could actually help.
Bottom Line
The chapter is a 61-minute advocacy document by someone who wrote a bestselling book with the same thesis and has enormous reputational stakes in being right. The evidence he presents is real but selectively curated, never contextualized against alternative explanations, and extrapolated well beyond what the data support. From the standards your textbook establishes — multi-method designs, halo bias correction, causal evidence hierarchies, personality controls — this chapter wouldn’t pass peer review in the form it’s presented.
Chapter 4
Alright, let me read the WHR26 Chapter 4 and evaluate it against the standards we’ve been applying to the other chapters.Night and day compared to the Haidt chapter.
This is a proper piece of meta-science rather than advocacy. The authors (this reads like Przybylski’s group) do exactly what a rigorous analysis should do: they ask why three organizations looking at the same literature reached different conclusions, and then systematically show the answer isn’t what evidence was selected but how it was synthesized and communicated.
The key findings are damning for the APA and Surgeon General but in a measured, empirically grounded way. Only 4 out of 617 unique papers appeared in all three reports — yet the methodological and thematic profiles of the cited evidence were statistically indistinguishable. So the divergence is entirely in the synthesis layer: citation accuracy, engagement with disconfirmatory evidence, acknowledgment of limitations, and calibration of certainty to what the evidence actually supports. NASEM comes out well; the APA comes out badly (20% self-citation by advisory panel members, citation clumping, outright mismatches between cited studies and the claims they’re supposed to support); the Surgeon General falls in between but leans toward rhetorical urgency over epistemic humility.
From your book’s perspective, several things stand out.
The chapter’s core argument — that the same evidence base can yield opposite policy conclusions depending on synthesis practices — is essentially a measurement-and-method-variance story applied to the policy level. It’s the same logic as your Chapter 3 telescope metaphor: the instrument (here, the synthesis process) determines what you see, and if the instrument is distorted, the conclusions are distorted. The chapter makes this point more carefully and with better evidence than anything in the Haidt chapter, which was itself an example of the very problem this chapter diagnoses.
The six quality indicators they identify (citation accuracy, contextualisation, limitation acknowledgment, engagement with disconfirmatory evidence, calibrated language, process transparency) are essentially a checklist for distinguishing science from advocacy — which maps directly onto your book’s concern about distinguishing empirical claims from conceptual positions. The Weiss “evidence as ammunition” framing is particularly apt.
The 17% causal-inference figure is striking and worth remembering. Only 59 of 355 on-topic studies used designs that could plausibly support causal claims. That’s the evidence base on which population-level policy is being demanded. Compare that with the causal evidence hierarchy in your Chapter 5 (migration quasi-experiments) and Chapter 6 (lottery studies) — those chapters insist on strong designs before drawing strong conclusions. The social media field is drawing strong conclusions from weak designs, and this chapter documents that quantitatively.
The self-critical limitations section is genuine rather than perfunctory — they flag that their single-coder qualitative analysis is a limitation, that they can’t access internal deliberative processes, and that their classification system produced some inconclusive codes. This is what intellectual honesty looks like in contrast to the other chapters.
Two observations that are more critical. First, the chapter is diplomatically careful to a fault. The APA findings — 20% self-citation, citation mismatches where the cited study literally doesn’t address the claim being made — are described as “concerning” and “worrying, even if rare.” In your blog voice, you’d call that what it is: shoddy scholarship that wouldn’t pass peer review*. The diplomacy is understandable given the WHR context, but it does somewhat blunt the force of findings that are genuinely alarming.
Second, the chapter doesn’t connect its findings back to the WHR’s own social media chapters. It evaluates external organizations’ synthesis practices but doesn’t turn the lens on whether the WHR itself — including the Haidt chapter we just read — meets these same six standards. That’s a missed opportunity, or perhaps a diplomatic impossibility given they’re in the same volume.
But overall: yes, much more balanced. This is what evidence-based analysis looks like when it’s done by people who take methodology seriously.
* Correction: This is not correct. I would not assume that pee-review is quality control Biased work constantly passes pre-publication peer review. This is the reason why we need open post-publication reviews like this one, just like we have thousands of book reviews that range in ratings from 1 to 5.
Chapter 5
This is Twenge’s chapter, and it’s a mixed bag — competent descriptive empirics wrapped in a rhetorical frame that does some of the very things Chapter 4 just criticized.
The strength is the data. PISA gives you nationally representative samples of 15–16-year-olds across 47 countries with the same measures, which is a genuine advantage over the US/Canada/UK-dominated literature the chapter itself flags. The regional breakdowns are useful, and the finding that the social media–life satisfaction association is essentially null for boys outside of English-speaking countries and Western Europe is important — it’s the kind of finding that complicates the “social media is harming youth” narrative rather than confirming it.
The curvilinearity point is well taken and the observation about greater variance among non-users and heavy users is genuinely interesting. Both non-users and heavy users show elevated rates of both very low and very high life satisfaction, which suggests these are heterogeneous groups — some non-users are thriving, some are isolated; some heavy users are socially engaged, some are compulsively scrolling. That’s a finding that resists simple policy prescriptions, and the chapter deserves credit for reporting it.
Now the problems.
The relative risk versus linear r argument is the chapter’s rhetorical centerpiece, and it’s doing a lot of work that isn’t fully warranted. Yes, it’s true that linear r is poorly suited for curvilinear associations, and the polio/aspirin/seatbelt analogies are vivid. But those analogies are misleading in a fundamental way: polio vaccination has a known causal mechanism, aspirin has RCT evidence, and seatbelts have physics. Social media use and life satisfaction have a cross-sectional correlation from a single time point. Relative risk sounds more impressive than r = .10, but repackaging a cross-sectional association as a relative risk doesn’t make it causal. A 50% increase in “risk” of low life satisfaction among heavy users is still a 50% increase in a cross-sectional association that cannot distinguish cause from selection. The chapter acknowledges this in one sentence near the end (“this research is correlational and, thus, cannot rule out reverse causation or third variables”) but spends several paragraphs building the rhetorical frame that makes the effects sound large and practically important before that caveat appears.
This is exactly the “calibrating certainty to conclusion strength” problem that Chapter 4 just documented. The chapter front-loads the impressive-sounding relative risk statistics and buries the causal limitations.
From your book’s measurement framework, several issues stand out. The social media measure is a single item asking about “browsing social networks” on a “typical weekday,” which is essentially asking adolescents to estimate their own screen time — precisely the measure the chapter’s own literature review acknowledges adolescents are poor at estimating (line 30 cites this limitation for the field, then proceeds to use exactly such a measure). The life satisfaction measure is a single 0–10 item. Both are self-reported by the same person at the same time. Your Chapter 3 telescope metaphor applies: we’re looking through a fairly blurry instrument here, and the chapter never discusses the validity limitations of these specific measures.
The response style point the chapter raises almost in passing (line 438 — some respondents may routinely choose extreme responses, linking heavy use and 10/10 satisfaction artifactually) is actually a serious methodological concern that deserves much more than a sentence. If extreme responding is a confound, it could explain the elevated very-high-satisfaction rates among heavy users — which is one of the chapter’s most interesting findings. The chapter identifies the problem and then moves on without grappling with it.
The absence of any control variables is glaring. No personality. No family income (the chapter acknowledges PISA lacks this). No in-person social interaction (also acknowledged). No school belonging — which is ironic given that the WHR’s own Chapter 3 found school belonging effects 4–6 times larger than social media effects. The chapter is essentially reporting raw bivariate associations between two self-report variables measured at a single time point, with no covariates, and then framing them in relative risk language that implies practical importance.
There’s also a notable asymmetry in how the chapter handles regional variation. When the association is significant (girls in Western Europe, English-speaking countries), it gets highlighted. When it’s null (boys in Asia, Latin America, Middle East/North Africa), it gets reported but with less interpretive weight. The null findings are actually the majority pattern for boys — in most of the world, the association between social media use and boys’ life satisfaction is essentially zero. A more balanced reading would lead with that finding: for most adolescent boys globally, there is no meaningful association between social media use and life satisfaction.
The self-citation pattern is worth noting. Twenge cites herself (Twenge & Hamilton 2022, Twenge & Farley 2021, Twenge & Martin 2020, Twenge et al. 2018) repeatedly — four of the roughly 20 references are her own work, including the paper that introduced the relative risk framing. This isn’t disqualifying, but given that Chapter 4 just flagged the APA’s 20% self-citation rate as a concern, it’s notable.
Bottom line: this chapter is more honest about the data than the Haidt chapter — it reports null findings for boys in most regions, it shows the curvilinear pattern, and it flags the variance issue among non-users and heavy users. But the rhetorical packaging oversells the findings. The relative risk framing makes cross-sectional associations sound like established health risks, the causal limitations are acknowledged but not given proportionate weight, and the absence of any covariate adjustment means we have no idea how much of these associations would survive basic controls for personality, socioeconomic status, or social engagement. Chapter 4’s own standards — citation accuracy, engaging with complexity, calibrating certainty to evidence strength — would give this chapter a middling grade: better than the APA and OSG reports, but not meeting the NASEM standard.
Personal note: If men decrease in happiness nearly as much as women, and the social media effect is more gendered and often does not show for males, it directly points to other factors that decrease happiness for young people. The same line of reasoning was used to find out that bad air quality was not the cause of lung cancer. Men got lung cancer, women did not, and now we know that the reason was that men were smoking and women were not.
Chapter 6
Sure. Here’s a tighter version:
Sunstein’s chapter is the most intellectually interesting in the social media section. The “product trap” concept — people who would demand money to quit TikTok individually but would pay to have everyone quit simultaneously — is a genuine insight about coordination failures in network goods. The party analogy is effective and the preference reversal is well-documented.
But three problems undermine the conclusions.
First, the entire chapter rests on three studies, two involving the author himself. That’s an essay, not an evidence review. The “Key Insights” box presents sweeping conclusions (“if social media platforms did not exist, many users would be better off”) that outrun a three-study base.
Second, Sunstein acknowledges that both his WTP and WTA measures are unreliable — low WTP may be “protest answers,” high WTA reflects the standard endowment effect — and then draws welfare conclusions from them anyway. If your thermometer is broken in both directions, you can’t read the temperature.
Third, and most fundamentally: there’s no baseline. The entire argument — people use it compulsively, wouldn’t pay for it, recognize it as time-wasting, and are modestly better off without it — describes television in 1975. Americans watched 6+ hours daily, wished they watched less, and the few reduction studies showed small wellbeing gains. Nobody concluded TV should be abolished. Sunstein never demonstrates that social media is uniquely trapping compared to every previous generation’s Wasting Time Good. The coordination failure he documents is a feature of any network good — you could run the Bursztyn experiment on email or mobile phones and probably get similar results. The question isn’t whether network effects create traps; it’s whether this trap is worse than its predecessors. The chapter never asks.
Finally, my question. The Economist published a figure based on the WHR results showing that Anglo nations are decreasing and diverging from happy Scandinavia. That is the real story in the data. So, why is the report about social media and not about the real trend in the data that needs to be examined. Is social media a cover up to distract from real problems in Angloland?
P-curve is a statistical tool that was designed to evaluate the statistical credibility of significant results. When only significant results are published, it is unclear how much selection for significance contributed to the results. In the worst case scenario, all published results are false positives. P-curve uses a variety of approaches to test this worst case scenario. If the null-hypothesis can be rejected, the data are said to have evidential value; that is, at least some of the studies rejected a false null-hypothesis.
P-curve was published without extensive validation research. Critical examination of the method has focussed on the estimate of average power (Brunner, 2018; Brunner & Schimmack, 2020). Average power can quantify the strength of evidence against the null-hypothesis rather than simply rejecting the null-hypothesis of no evidence. For example, a set of studies could have 18% average power, suggesting that some significant results were true positives, but also showing that this literature has many studies with low power.
The problem with p-curve is that, contrary to claims by its developer, it produces inflated estimates of power when studies vary in power. For example, it predicts that 91% of replications should have been successful in the reproducibility project (Open Science Collaboration, 2015), when only 36% of the actual replications were successful. This bias is expected given the large heterogeneity in power across these studies (Schimmack & Soto, 2026). A solution to this problem is to use z-curve (Bartos & Schimmack, 2022; Brunner & Schimmack, 2020). Z-curve is explicitly designed for heterogenous data and performs well with low and high heterogeneity (Schimmack & Soto, 2026).
Morey and Davis-Stober (2025) raised further concerns about the statistical properties of p-curve. Given the similar aims of p-curve and z-curve, it is reasonable to wonder whether z-curve suffers from some of the same problems as p-curve, despite its ability to handle heterogeneity well. I asked Claude AI to examine this question and it concluded that z-curve is built on a fundamentally different approach than p-curve that avoids many of p-curve’s pitfalls. Here is a summary of the evaluation.
Not applicable — z-curve targets power, not effect size
P-curve’s problems go beyond heterogeneity
The most fundamental problem is inadmissibility of the core test of evidential value (EV). The core test — the version currently in the p-curve app — uses a probit transformation that produces a concave acceptance region in the test statistic space. By results from Birnbaum (1954) and Marden (1982), this makes the test inadmissible: its power is dominated by other tests for every possible alternative, including the homogeneous case. The 2015 switch from the log to the probit transformation was motivated by wanting robustness to extreme values, but admissibility requires exactly the property that was engineered out — sensitivity to large individual test statistics.
The compound half p-curve rule introduces nonmonotonicity: increasing the evidence in a single study can flip the procedure from rejection to acceptance and back, multiple times, along a monotonically increasing path. This is a purely structural consequence of the hard boundary at αpc/2 combined with the probit transform, and has nothing to do with whether effect sizes are heterogeneous.
Test LEV, which is supposed to detect “lack of evidential value,” has an additional pathology: arbitrarily large test statistics contribute zero weight to the sum, because they map to log(1) = 0. A single study with a p value just below 0.05 can dominate the test and force rejection regardless of how large every other test statistic is. Six studies with Z = ∞ plus one study at Z = 1.97 yields the same test statistic as six studies at Z = 1.97.
None of these problems affect z-curve. Z-curve uses EM estimation on a mixture of truncated normal distributions, fitting the full shape of the observed z-score distribution above the significance threshold. Large z-scores contribute information proportional to their posterior weight on high-NCP components. The EM likelihood surface is smooth and does not blow up near the truncation boundary. There is no compound decision rule. And because z-curve’s target quantities are replicability (ERR) and discovery rate (EDR) — both functions of noncentrality parameters — there is no conflation of power with effect size.
The Morey and Davis-Stober paper does not mention z-curve. It does not need to. Their formal results simply confirm, from a different direction and with different tools, what simulation studies have shown for years: p-curve’s statistical machinery is not up to the job it advertises. Z-curve was designed from the start to avoid exactly these pitfalls.
In short, z-curve is not just another p-curve. While the aims are similar, the statistical approach and the ability to handle realistic amounts of heterogeneity are very different. Morey and Davis-Stober’s critique is limited to p-curve and does not generalize to z-curve.
Publish or perish. I heard this in the 1990s, but it is even more true today. Submitting manuscript to publish has gotten easier, too. It cost me real money to mail three copies of a manuscript from Germany to the United States (Schimmack, 1996). Now, you just need to check all the boxes on a submission portal. Not an easy task, but virtually cost free.
This system is like a lottery, where tickets are cheap and winnings can be rewarding. No wonder, authors are playing the lottery and submitting manuscripts in large numbers, even if chances of rejection are high. Maybe journals should charge for submissions rather than for publications.
Anyhow, I just reviewed a manuscript in 30 minutes. It was conceptually flawed. More importantly, my own AI – trained on this area of research – also spotted the conceptual problem, and several others that I didn’t even bother to read as it would take too long for a human reader to do so (life is short at age 60). It also wrote a nice and detailed review, much better than most human reviews. Of course, it had the advantage of being trained on this research area, but I also submitted the manuscript to a generic AI with no special knowledge. It also spotted the fatal conceptual mistake. This brings me to the main point of this rant.
Dear authors, do yourself and others a favor. Use AI to review your paper before you submit it. Even better ask it to evaluate it from the perspective of legendary Reviewer 2 and address critical issues before you submit it to a journal. You save yourself time and effort, but more importantly, you are a good citizen and do not clog the peer-review system with flawed manuscripts in the hope that they pass peer-review despite major problems.
Full citation: Soto, M. D., & Schimmack, U. (2024). Credibility of results in emotion science: A Z-curve analysis of results in the journals Cognition & Emotion and Emotion. Cognition and Emotion. https://doi.org/10.1080/02699931.2024.2443016
Purpose of this document: This is a detailed analytical summary written entirely in the summarizer’s own words. It is intended to make the paper’s methods, results, and arguments accessible for discussion and analysis without reproducing copyrighted text. Readers should consult the original article for exact language and figures.
Structured Summary
1. Motivation and Research Question
The paper addresses whether the replication crisis — documented most prominently by the Open Science Collaboration (2015), which found only 36% of psychology results replicated — extends to the emotion research literature specifically. The authors note that the OSC findings were limited to articles from 2008 and may not generalize to emotion research, which has its own dedicated journals and traditions.
The two journals examined are Cognition & Emotion (established 1987) and Emotion (established 2001 by APA). The authors aimed to assess: (a) how much selection bias exists in these journals, (b) what proportion of published results might be false positives, (c) what the expected replication rate is, and (d) whether these indicators have improved over time in response to the replication crisis.
2. Z-Curve Method: How It Works
The paper uses Z-curve 2.0 (Bartoš & Schimmack, 2022), which takes a set of test statistics, converts them to absolute z-scores, and fits a finite mixture model to the distribution of statistically significant z-values (those exceeding 1.96). The method produces four key estimates:
Expected Discovery Rate (EDR): An estimate of the average true power of studies before selection for significance. This represents what proportion of all conducted tests (including unpublished ones) would be expected to reach significance. It is conceptually the mean power across the full population of tests.
Expected Replication Rate (ERR): An estimate of mean power after selection for significance — that is, among published significant results. Because significance selection favors higher-powered studies, ERR is always higher than EDR. The authors frame ERR as an optimistic upper bound on expected replication success.
Observed Discovery Rate (ODR): Simply the proportion of extracted test statistics that were statistically significant at p < .05. Comparing ODR to EDR quantifies selection bias: a large gap indicates that many non-significant results went unreported.
False Discovery Risk (FDR): Computed from the EDR using Soric’s (1989) formula, which gives the maximum proportion of significant results that could be false positives given a particular discovery rate.
The authors explicitly note that ERR overestimates actual replication success (comparing z-curve’s ERR for the OSC dataset to the actual 36% rate), and they recommend interpreting the true replication rate as falling somewhere between EDR and ERR, citing Sotola (2023) for empirical support.
3. Methods
3.1 Test Statistic Extraction
The authors collected the complete set of published articles from both journals (3,831 from C&E covering 1987–2023; 2,323 from Emotion covering 2001–2023). Using custom R code built on the pdftools package (Ooms, 2024), they automatically extracted reported test statistics: F-tests, t-tests, chi-square tests (with df between 1 and 6 only, to exclude SEM model-fit tests), z-tests, and 95% confidence intervals of odds ratios and regression coefficients.
Chi-square tests with df > 6 were excluded because these typically come from structural equation modeling, where rejecting the null indicates poor model fit rather than a substantive finding. Confidence intervals were excluded when reported alongside test statistics to avoid double-counting. Meta-analysis articles were excluded entirely.
The extraction code was designed to handle various notation formats across journals and was iteratively refined. However, the authors acknowledge that the automated process cannot extract statistics from tables or figures, and cannot distinguish between focal and non-focal hypothesis tests.
After exclusions (including test statistics with N < 30, since t-to-z conversion is unreliable at very low df), the final samples were 30,513 z-scores from 1,902 C&E articles and 35,457 z-scores from 1,953 Emotion articles. The majority were F-tests (62% C&E, 53% Emotion) and t-tests (26% C&E, 28% Emotion).
3.2 Statistical Analysis — The Clustering Approach
This is a critical methodological detail. The authors used the zcurve_clustered function with the “b” method. This method works by sampling a single test statistic from each article during model fitting, thereby addressing within-article dependence. This directly addresses concerns about independence violations that arise when multiple test statistics are extracted from the same paper.
The EM algorithm was applied to significant z-values between 1.96 and 6 (values above 6 are treated as having essentially 100% power). The fitted mixture model uses seven discrete components (z = 0 through 6), and the estimated weights are used to compute EDR and ERR. The model then extrapolates the full distribution to estimate what the non-significant portion would look like without selection.
3.3 Time Trend Analysis
Annual z-curve estimates were computed for each publication year and regressed on linear and quadratic predictors of year. The quadratic term tested whether improvements accelerated after 2011 (when the replication crisis became prominent).
3.4 Hand-Coded Focal Tests
To address the limitation that automatic extraction conflates focal and non-focal tests, the authors also present results from 241 hand-coded articles from 2010 and 2020, drawn from an ongoing project covering 30+ journals and 4,000+ studies (Schimmack, 2020). This sample contained 227 significant tests out of 241 total.
4. Results
4.1 Main Z-Curve Estimates
The two journals produced remarkably similar results:
Parameter
Cognition & Emotion
Emotion
ODR
71% [70%, 71%]
70% [70%, 70%]
EDR
30% [14%, 53%]
31% [15%, 53%]
ERR
66% [59%, 73%]
65% [59%, 71%]
FDR
12% [5%, 32%]
12% [5%, 30%]
The ODR-EDR gap (approximately 40 percentage points) provides clear evidence of selection bias in both journals, confirmed visually by a sharp drop in observed z-scores just below the significance threshold of 1.96.
The ERR of approximately 65% suggests that the majority of published significant results should replicate with the same sample size, though the authors stress this is an optimistic estimate. The FDR point estimate of 12% is comparable to medical clinical trial journals (14% per Schimmack & Bartoš, 2023) and substantially lower than the most pessimistic predictions (Ioannidis, 2005). However, the upper bound of the FDR confidence interval (~30%) is high enough to warrant concern.
4.2 Time Trends
Sample sizes (degrees of freedom): Both journals showed significant linear increases over time, with some acceleration (significant quadratic trends). Median within-group df increased from roughly 50 in the early years to over 100 in recent years for Emotion, and showed a particularly sharp increase in C&E’s most recent years.
ODR: Both journals showed significant linear decreases in ODR over time (approximately 0.45 percentage points per year), suggesting that non-significant results are being reported more frequently. However, the quadratic terms were non-significant, meaning this trend preceded the replication crisis rather than being a response to it.
EDR: Both journals showed significant increases in EDR over time, consistent with increasing sample sizes leading to higher power. The combination of decreasing ODR and increasing EDR indicates that selection bias has diminished, though it remains present.
ERR: Increased over time for both journals, with C&E showing a significant acceleration (quadratic trend) suggesting the replication crisis may have prompted improvements.
FDR: Decreased over time as a direct consequence of the increasing EDR.
4.3 Hand-Coded Focal Test Results
The 241 hand-coded focal tests from 2010 and 2020 yielded:
Parameter
Estimate
95% CI
ODR
94%
[91%, 97%]
EDR
27%
[10%, 67%]
ERR
65%
[53%, 75%]
FDR
14%
[3%, 50%]
The ODR for focal tests (94%) is substantially higher than the 70–71% from automatic extraction, confirming that automatic extraction captures many non-focal, non-significant tests that dilute the ODR. However, the EDR, ERR, and FDR estimates are comparable to the automatically extracted results and fall within their confidence intervals. This is an important robustness check: the key z-curve parameters are not substantially altered by the inclusion of non-focal tests.
4.4 Alpha Adjustment Analysis
The authors examined the effect of lowering the significance threshold on discovery rates and false positive risk. Lowering alpha from .05 to .01 retains approximately half of all significant results while reducing FDR to below 5% for most publication years. Further reductions to .005 or .001 have diminishing returns for FDR reduction but increasingly sacrifice power.
5. Discussion and Interpretation
The authors frame their results as relatively encouraging for emotion research compared to worst-case scenarios. Key interpretive points:
The FDR of approximately 12% (though with wide CIs) suggests that most published significant results in emotion journals are not false positives. However, the upper bound of the CI leaves open the possibility of rates up to 30%.
The ERR of 65% predicts that most significant results should replicate with the same sample size, but this is optimistic. Adjusting for the estimated FDR, power for true effects may be approximately 72%, close to the conventional 80% benchmark but with substantial heterogeneity — half of studies have less power than this average.
The authors recommend treating results with p-values between .05 and .01 with skepticism, and suggest that alpha = .01 provides a better balance between false positive risk and power loss for the emotion literature specifically. They emphasize this recommendation is for evaluating existing literature, not as a new publication standard.
On effect sizes, the authors warn that selection bias inflates point estimates, making even meta-analytic effect sizes unreliable unless bias correction is applied. They advocate for honest reporting of all results, including non-significant ones, as essential for accurate meta-analysis.
6. Limitations Acknowledged by the Authors
The authors explicitly discuss several limitations:
Z-curve’s selection model assumes that publication probability is a function of power. In reality, questionable research practices (QRPs) can produce significance without real effects, potentially inflating EDR estimates and underestimating selection bias.
Simulation studies of z-curve performance under QRP-generated data are lacking.
The N > 30 exclusion removes some studies, though supplementary analyses with the full sample show similar results.
Automated extraction cannot distinguish focal from non-focal tests (addressed by the hand-coded analysis).
The automated extraction cannot reliably capture statistics from tables or figures.
7. Key Methodological Features Relevant to the Pek et al. Debate
Several aspects of this paper are directly relevant to criticisms raised by Pek et al.:
Independence assumption: Soto & Schimmack explicitly used zcurve_clustered with the “b” method, which samples one test statistic per article during bootstrapping. This directly addresses the concern about within-article dependence. The method section states this clearly.
Focal vs. non-focal tests: The paper includes both automatic extraction (all tests) and hand-coded focal tests, and shows that the z-curve parameters (EDR, ERR, FDR) are comparable across both approaches. This addresses the concern that including non-focal tests distorts results.
Appropriate caveats: The authors consistently describe ERR as optimistic, characterize the true replication rate as lying between EDR and ERR, acknowledge the wide confidence intervals on EDR and FDR, and explicitly discuss the limitations of the selection model assumption.
Asymmetric interpretation: The paper notes that z-curve evaluations of credibility are asymmetric — low values raise concerns about a literature, but high values do not guarantee credibility.
8. Summary Table of All Z-Curve Estimates
Analysis
N tests
N sig
ODR
EDR [95% CI]
ERR [95% CI]
FDR [95% CI]
C&E (auto)
30,513
21,628
71%
30% [14%, 53%]
66% [59%, 73%]
12% [5%, 32%]
Emotion (auto)
35,457
24,824
70%
31% [15%, 53%]
65% [59%, 71%]
12% [5%, 30%]
Focal (hand-coded)
241
227
94%
27% [10%, 67%]
65% [53%, 75%]
14% [3%, 50%]
Summary prepared for analytical discussion purposes. All descriptions reflect the summarizer’s interpretation of the original work. For exact language, figures, and supplementary analyses, consult the published article.
No polite ChatGPT edits. Unfiltered raw Schimmack. Love it or hate it.
It was supposed to be the American Psychological Society (APS), but international researchers complained – especially those who want to publish in prestigious American journals – and APS became the Association for Psychological Science.
Psychological Science is now a brand name and many departments have been renamed to be Departments of Psychological Science. However, you do not become a science, just because you call yourself one, you actually have to behave like a science. And that seems to be something that many psychologists do not want to do because it would mean giving data to decide about the truth. Just like William James, many psychologists like their theories more than truth So, they continue to conduct silly statistical rituals (Gigerenzer) that are biased to show either evidence for their beliefs (p < .05) or no evidence against them (p > .05) and justify another biased test.
Every generation there have been a few psychologists who were frustrated by the futility of this and made suggestions to improve things (Meehl, Cohen, Gigerenzer) or just also fake the data (Stapel). You have to give it to Stapel. Why collect data if their only purpose is to add p < .05 to any claim one wants to make?
Since the early 2010s, thanks to Bargh and Bem, more people are calling for change, but progress is slow and stalling. Meanwhile, most published articles continue to report claims with p-values below .05.
A cynical approach to this sad state of affairs would be to say “fuck it”, “burn it all down,” and enjoy life. However, some people just can’t let go. We (Brunner, Bartos, Schimmack) developed a statistical method that helps readers to distinguish between good and bad significant results. Good ones come from studies with high statistical power that are likely to replicate. Bad ones are studies with low power or even false positive results that will not replicate. Of course, there is no hard line, but we can identify subsets of good studies, if they exist.
You would think an aspirational science would welcome a tool that can salvage good results from decades of research with mostly significant results. Which ones are trustworthy? Which ones are like pornception (Bem, 2011)?
But being a science would mean that we have to expose the fact that some results were made up – not like Stapel on his laptop – but by collecting and analyzing data, year after year, painstaking work to get significant results – and many unpublished failures. No, we cannot have this. Therefore, we have to fight the method that can distinguish good and bad research.
To fight this method, we need to get a peer-reviewed article that claims “the method does not work.” To do so, the article does not have to be evaluated by statisticians or present good arguments. All we need is a quotable peer-reivewed article, because peer-reviewed equals truth, which is also why extrasensory perception is true (Bem, 2011, JPSP).
Now reviewers can quote the criticism – and not cite evidence that contradicts these claims – and editors can use the peer-review to reject the article. The key feature of science is to fight motivational biases. If a system just amplifies misinformation and glorifies misinformation that passed peer-review, it is not a science. Maybe APS really means Anti-Psychological Science.
The question is how long this game of self- and other-deception can continue? At what point will public interest in psychology wane because it never produces any useful results that advance society, health, and wellbeing? Science is worth defending against the attacks by Trumpians, but I am not sure psychological science is part of this.
Cookie Consent
We use cookies to improve your experience on our site. By using our site, you consent to cookies.