Is Using AI in Science Really “Research Misconduct”? A Response to Guest & van Rooij

This response to Guest and Rooji’s article was written in collaboration with ChatGPT. I asked ChatGPT critical questions and probed their arguments in critical discussions. I even asked ChatGPT to play devil’s advocate and defend Guest’s claims against my counterarguments. The results of this discussion were summarized by ChatGPT. I read and lightly edited the summary. I take full responsibility for the content of this blog post.

PsyArXiv Preprints | Critical Artificial Intelligence Literacy for Psychologists

Why I embrace AI?

Guest & van Rooij (2025) claim that “ultimately, contemporary AI is research misconduct.” It’s a dramatic statement, meant to rally readers around the dangers of adopting AI in academia. Their case builds on analogies: AI hallucinations resemble fabrication, patchwork outputs resemble plagiarism, and reliance on opaque corporate models undermines transparency and independence.

I agree with their starting point: uncritical use of AI is a problem. But here’s the rub: uncritical use of anything in science is a problem. Uncritical use of statistics leads to false positives. Uncritical use of the peer-review process perpetuates bias. Uncritical acceptance of published articles spreads errors.

The mistake in Guest & van Rooij’s argument is collapsing bad use into all use. That move is rhetorically powerful, but analytically sloppy. Let me explain why — drawing both on my own experience as a scientist and on concrete cases where AI already proves useful without violating integrity.


My Experience With Technological Change in Psychology

I came of age as a researcher in the 1990s. Back then, psychology was only just beginning to use computers:

  • Writing: We still typed on typewriters in Germany. Writing on computers was new.
  • Literature searches: Most library searches were done with index cards, unless you visited a top U.S. university that had computerized catalogs.
  • Experiments: Measuring reaction times with computers was just emerging, enabling whole new paradigms in cognitive psychology.
  • Data collection and sharing: When the internet arrived, the ability to download articles and collect data online was a game changer.

Each of these technologies improved my life and productivity personally as well as the work of my colleagues. Some people worry that the use of AI is fundamentally different from using a word processor that corrects my spelling and grammar. Others like me think it is just another tool that benefits scientists and science. Who is right?


The Real Difference: Critical vs. Uncritical Use

The key distinction the paper refuses to make is between uncritical and critical use.

  • If I let AI write my paper and submit it without checking — that’s uncritical.
  • If I use AI to speed up coding, brainstorm counterarguments, or translate technical phrasing into accessible English — and then critically evaluate the outputs — that’s critical use.

The difference is obvious, but Guest & van Rooij collapse them together. That leads to sweeping claims that don’t hold up.


Concrete Examples of Good Use

Here are cases from my own work where AI has enhanced my research without violating any integrity principle:

  1. Debate partner in methodology.
    I used AI to test Pek et al.’s argument that post-hoc power estimation is an “ontological error.” At first, AI repeated their claim. But after I challenged it, AI conceded the mistake — showing me that my arguments were solid. This was not outsourcing thought; it was sparring with an interlocutor, strengthening my reasoning.
  2. Coding support.
    AI has helped me code faster, debug errors, and explore unfamiliar packages. This is no different from when IDEs and compilers automated tasks that once took days. The intellectual contribution remains mine.
  3. Methodological discovery.
    With AI’s help, I simulated data and found a new regression method that improves the replicability index. AI suggested possibilities, but I tested and validated them. That’s science — proposing, checking, and refining.
  4. Writing support.
    English is my second language. I am not a stylist, but I care about clarity. AI helps me express my ideas in accessible, standardized English. This does not plagiarize anyone’s creativity. It democratizes science, ensuring my work is judged on content rather than fluency.
  5. Teaching and learning.
    I encourage my students to use AI to clarify statistical concepts, probe conflicting theories, and summarize difficult primary sources. This accelerates learning and equips them with critical skills for a world where AI is unavoidable. Banning AI would handicap them — just as insisting on index cards would handicap students in the age of online databases.

Rebutting the Counterarguments

Critics might reply with several counterpoints. Here is why they don’t hold:

  1. “Most uses are uncritical, so exceptions don’t matter.”
    That’s an empirical claim, not a principle. With training and guidance, we can increase critical use. Education, not prohibition, is the solution.
  2. “Efficiency isn’t progress — p-hacking is efficient too.”
    False equivalence. P-hacking is efficient at producing false positives. AI speeds up legitimate tasks: coding, summarizing, translating. Efficiency on valid tasks is progress.
  3. “Language equity just shifts dependence to corporations.”
    True equity would mean every university pays editors for non-native speakers — but that’s not reality. AI lowers barriers now. And dependence on tools is normal: we already depend on Microsoft Word, or LaTeX. R for statistical analyses, and sharing new methods in packages. Integrity lies in critical use, not abstinence.
  4. “AI just replaces one monopoly (Elsevier) with another (OpenAI).”
    Wrong analogy. Publishers control distribution. AI facilitates creation. And unlike Elsevier, AI already exists in open-source forms (HuggingFace, R packages). AI can actually undermine publishers by making it easier for readers to critically evaluate preprints (lke the one by Guest) rather than rely on the faulty pre-publication peer-review process that artificially inflates the value of for-profit publcations.
  5. “The burden of proof is on AI proponents.”
    In science, the burden of proof lies on claims. If you claim “AI = misconduct,” you must show that all uses are misconduct. One transparent, validated counterexample disproves that.

The Broader Context: Science, Capitalism, and Sustainability

Science in capitalist societies follows the rules of capitalism. That means hype, profit-seeking, and exploitation. But it also means efficiency.

I am writing this from my porch on a late summer day, on a laptop connected to AI through the internet. Thirty years ago, this was unimaginable. That’s progress — not because capitalism is good, but because technological efficiency creates new possibilities for science and life.

The real challenge is not to reject progress, but to make it sustainable and ethical:

  • Reduce the carbon footprint of training models.
  • Ensure fair labor practices in data labeling.
  • Support open-source and decentralized AI tools.

Those are worthy battles. But fighting against efficiency itself is not.


Science Is Not Owned

Finally, let’s not confuse science with art or music. A novel is the intellectual property of its author. But science is a cumulative, communal project. We write articles not to guard them but to share them.

Ironically, the greatest exploitation in science is not students using AI to polish their writing — it’s publishers who take our free labor and sell it back at exorbitant prices. If anything, AI has the potential to undermine this exploitative model by making open preprints more accessible, readable, and useful.


Conclusion

Guest & van Rooij’s warning starts from a truth: uncritical AI use is bad. But their conclusion — that all AI use is misconduct — is false.

History shows that every technological advance in psychology was first treated with suspicion: computers, online databases, internet data collection. All of them improved science when used properly. AI is no different.

The real question is not whether we should use AI, but how to use it critically, transparently, and sustainably. When we do, AI doesn’t undermine integrity. It strengthens it — by helping us think more clearly, code more efficiently, write more accessibly, and share science more openly.

Postscript

That’s a powerful, candid postscript — it adds urgency and generational perspective. I’d suggest just a few tweaks to keep the edge while making it harder to dismiss as personal attack. Here’s a polished version:


Postscript

I wrote this blog post because I worry that some young scientists do not fully grasp how profoundly AI will shape their future. Science is collaborative by nature, but academia is also a cut-throat competitive game. Established scholars may have the safety of tenure or stable positions, but many graduate students, postdocs, and early-career researchers do not. For them, ignoring AI is not a neutral stance — it is a disadvantage. Those who learn to use AI critically and productively will move faster, write more clearly, and explore ideas more widely. Those who refuse will be left behind in what too often feels like the Hunger Games of academia.

Of course, it is always a personal choice how to play this game. But if you want to play it — you had better play to win.


Against the “Against” — A Response to Against the Uncritical Adoption of AI in Academia

Against the Uncritical Adoption of ‘AI’ Technologies in Academia

My prelude (ChatGPT thought it is witty).
Ironically, I worked with ChatGPT on this response to Olivia Guest et al.’s warning against the uncritical use of AI. Of course, we should never use AI uncritically — but the same goes for journal articles, textbooks, or even peer reviews. The real problem with Guest et al.’s piece is that it collapses all AI use into “uncritical adoption.” It does not distinguish between uncritical and critical use. That distinction matters. Used properly, AI can benefit science — by accelerating learning, enhancing equity, and sharpening critical thinking. In fact, the authors themselves might have benefitted from subjecting their own arguments to a critical dialogue with an AI.


Against the “Against” — A Response to Against the Uncritical Adoption of AI in Academia

A recent position paper, Against the Uncritical Adoption of AI in Academia, argues that universities are rushing headlong into adopting artificial intelligence under the banner of “progress.” The authors warn that bundling chatbots into tools like Microsoft Office normalizes AI use without consent, blurs boundaries of academic integrity, and risks undermining both pedagogy and research quality. Their framework rests on five principles of research integrity: honesty, scrupulousness, transparency, independence, and responsibility. From their perspective, most current AI tools fail these tests because they are opaque, corporate-controlled, environmentally costly, and prone to generating polished but shallow text. In short: uncritical AI use threatens to hollow out the critical and self-reflective fabric of academia.

These are serious concerns, and I share the view that uncritical adoption is a danger. But I want to highlight what the paper does not: examples of critical and productive AI use that enhance, rather than erode, academic standards.

  • Accelerating statistical learning. Many academics in psychology and related fields struggle with quantitative methods. AI gives students and reviewers the chance to query statistical models, check assumptions, and explore methods interactively. This doesn’t deskill — it scaffolds. It allows scholars to learn concepts faster and more deeply than in the past.
  • Probing conflicting ideas. I train students to use AI as an interlocutor: to ask probing questions, compare theoretical perspectives, and practice evaluating arguments. Far from outsourcing thought, this cultivates the critical stance that integrity requires.
  • Accessing difficult texts. Undergraduates often struggle with primary sources written for specialists. AI can summarize, rephrase, and contextualize these texts so students can engage with them meaningfully. They still must return to the original, but the entry barrier is lowered.
  • Language equity. English dominates global academia, privileging native speakers. As a non-native speaker, I use AI to polish my writing so that reviewers judge my ideas on their scientific merits, not on my fluency. This does not lower standards. It raises them by removing a linguistic bias that has long distorted evaluation.

The authors are right: if we simply accept AI outputs as final products, we risk replacing reasoning with rhetoric. But if we use AI critically — with disclosure, scrutiny, and accountability — it can make academics more skillful, more equitable, and more rigorous. Avoiding AI altogether is as counterproductive as insisting students still search “dusty archives” rather than using online databases. The real challenge is not to ban AI, nor to normalize it uncritically, but to teach and model how to use it wisely.


A Cautionary Note About False Claims Regarding Power Estimation

Prelude 1.
I have asked Pek, McShane and Bockenholt for comments and received no response. Apparently, they are not really interested in a scientific discussion. The response by Ulf is particularly disappointing. I shared some data with him so that he could demonstrate some of his statistical methods and made him a co-author of one of my papers. Oh well, one more name to scratch of my list.

Prelude 2
The new criterion of scientific credibility is not “it passed human pre-publication review”. Rather the test is whether it passes post-publication review by AI. That does not mean AI finds the right answer right away, but it can evaluate arguments critically, if you challenge it. It did not come up with the more efficient way to estimate average power, but it knows enough statistics to see that it is much more efficient than McShane et al.’s approach that was used to make false claims about uncertainty in estimates of average power. McShane also knows it, but he is not going to say it openly because there is no reward for him to acknowledge the truth.

A Cautionary Note About False Claims Regarding Power Estimation

Ulrich Schimmack & ChatGPT

Recent criticisms of post-hoc and average power estimation, most prominently by Pek et al. (2024) and McShane et al. (2020), have advanced two central claims: (a) that estimating power after data are collected is an “ontological error,” and (b) that average power estimates are too noisy to be useful unless the number of studies is very large. Both claims are flawed.

1. Estimating the probability of significance is not an ontological error.
It is true that once a study has yielded a significant result, the probability of that event is no longer uncertain—it has happened with probability 1. But that does not mean the result tells us nothing about the process that generated it. Just as a coin flip yields a realized outcome (heads), but we still know the generating mechanism was 50/50, so too a study outcome is one draw from a process governed by true power and sampling error. Post-hoc power simply re-expresses the evidence in terms of that process, using the observed effect size as a provisional estimate of the underlying effect. That is not an ontological mistake; it is a noisy but meaningful inference.

2. Post-hoc power is not “just a p-value in disguise.”
Pek et al. argue that post-hoc power provides no new information beyond the p-value. This is misleading. A p-value is defined under the implausible null hypothesis of zero effect. Post-hoc power is defined under the observed effect size, which—while noisy—provides an empirically motivated reference point. Moreover, p-values alone cannot be transformed into power estimates without knowing the standard error or sample size. Both p-values and post-hoc power depend on the observed effect size and its precision, but they answer different counterfactual questions: “How surprising is this result if the null is true?” versus “What chance would a replication have, if the effect were as large as we just estimated?”

3. Average power estimates are not too noisy.
McShane et al. claim that average power estimates are unreliable unless based on more than 100 studies, citing wide confidence intervals from random-effects meta-analyses of effect sizes. But this method is inefficient. If we meta-analyze observed power estimates directly, the problem reduces to averaging a bounded variable. The worst-case standard deviation of power estimates in [0.05, 1] is 0.475, giving


Thus with 30 studies, the 95% confidence interval for average power is at most ±0.17; with 100 studies, ±0.09. Moreover, selection for significance truncates the range to [0.50, 1], reducing the maximum SD to 0.25 and shrinking uncertainty even further (±0.09 for 30 studies, ±0.05 for 100). These intervals are far narrower than suggested by McShane’s effect-size–first method.


Conclusion. Estimating post-hoc power is not an ontological error, it is not reducible to a p-value, and its average across studies is not prohibitively noisy. The problems Pek et al. and McShane et al. identify stem not from the concept of power but from restrictive definitions and inefficient methods. Meta-analyzing post-hoc power estimates directly provides a simple, bounded, and efficient way to evaluate the credibility of scientific literatures.


Free Speech and Double Standards

This post was created in collaboration with ChatGPT.
I am fully responsible for the views and factual accuracy of this post.


When it comes to free expression in Canadian institutions, consistency matters. Yet sometimes it appears as if some people are freer to express their opinions than others. I could not help but think about Jordan Peterson’s outrageous and often hurtful social media posts when the University of Toronto put Ruth Marshall on academic leave over a single post on X.

Jordan Peterson built a massive platform by railing against diversity, equity, and inclusion (DEI) and dismissing systemic racism. In one tweet, he wrote:

“Reverse racism Is the new tolerance
Evil camouflages itself
Eternally in the cloak of goodness
That’s the subtlety of The serpent.”
@jordanbpeterson on X, Sept. 2, 2023

He also expressed similar hostile views about transgender people.

The College of Psychologists of Ontario (CPO) determined that Peterson had crossed a line as a regulated professional. They required him to undergo remedial coaching / media training to maintain his license (Toronto Star, June 2023).

In August 2023, the Ontario Divisional Court upheld this order, ruling that it was a “proportionate balance” between free expression and professional responsibility (CBC News, Aug. 23, 2023).

Throughout all this, the University of Toronto remained silent. Peterson was not suspended, investigated, or put on leave by his employer. His eventual departure came on his own terms, when he retired and took emeritus status in 2022 (The Varsity, Jan. 23, 2022).

Contrast this with the case of Professor Ruth Marshall. After Charlie Kirk’s assassination in September 2025 (NBC News, Sept. 13, 2025), she tweeted in anger that “shooting is too good for so many of you fascist c**ts.”

The tweet was ugly, emotional, and poorly phrased. She could certainly have benefitted from guidance about the limits of Canadian free expression before posting it (Replication Index, Sept. 19, 2025). However, it was one post, not a sustained campaign.

UofT acted immediately: placing her on leave and launching an investigation into “reputational harm” (The Tribune, Sept. 22, 2025).

I cannot help but see a double standard here. Jordan Peterson was allowed to express racially harmful views for years without consequence; Ruth Marshall was put on leave over a single tweet that could be misinterpreted, if taken literally, as inciting violence.

No immediate actions also follow when Fox News commentators suggest killing all the homeless or bombing the UN. That may be explained by differences in American and Canadian free speech law, but it still shows that not everyone pays a price for speech deemed unacceptable by those in power.

The response by the College of Psychologists of Ontario shows a more sensible path. When social media posts cross a line, the first response should be education, not exile. Individuals should be given the chance to apologize and receive training to ensure they understand what they can and cannot say. Organizations should also provide clear guidelines about where the line falls between protected free speech and speech that is not acceptable.

There are currently no updates on the investigation of Ruth Marshall by UofT, but I hope both sides can resolve this issue. More importantly, I hope UofT will provide its employees with clear, consistent guidelines that ensure fairness so that institutional actions are balanced, and free expression is applied equally.


Pleskac, Cesario, Johnson, & Gagnon (2025) find Evidence of Racial Bias in Use of Lethal Force

This review was written in collaboration with ChatGPT to reduce personal biases, but I take full responsibility for the accuracy of the claims in this review.

Pleskac, T. J., Cesario, J., Johnson, D. J., & Gagnon, G. (2025). Modeling police officers’ deadly force decisions in an immersive shooting simulator. Journal of Experimental Psychology: Applied. Advance online publication. https://dx.doi.org/10.1037/xap0000542

The 2025 article by Pleskac, Cesario, Johnson, and Gagnon presents a large-scale experimental investigation of police officers’ deadly force decisions using an immersive simulator. With a sample of 659 officers from the Milwaukee Police Department, the study represents one of the most ambitious attempts to date to examine racial disparities in shoot/don’t-shoot errors under realistic conditions. The authors report that officers were more likely to mistakenly shoot unarmed Black suspects than unarmed White suspects, but only in non-antagonistic, ambiguous encounters. Computational modeling further suggests that this disparity arises not from a global bias to “shoot Black suspects,” but from differences in evidence accumulation once an object is produced. These findings are an important contribution to understanding how racial bias can emerge in precisely those contexts that mirror real-world disparities in unarmed fatalities.

While the methodological contribution of the 2025 study is considerable, the article’s presentation raises serious concerns about scholarly transparency and the cumulative development of knowledge. The data were collected in 2017, two years before Johnson and Cesario published their now-retracted 2019 PNAS article, and around the same time as their 2019 Social Psychological and Personality Science (SPPS) article. Both of those earlier publications emphasized the absence of systematic racial disparities in police use of deadly force, with the PNAS article going further to assert “no evidence of anti-Black disparities.” Neither article cited or discussed the ongoing experimental work reported in the 2025 paper.

This timeline matters because the 2017 simulator data clearly demonstrate context-dependent racial bias: officers were approximately 1.5 times more likely to mistakenly shoot unarmed, non-threatening Black suspects compared to White suspects (95% CI ~1.0–2.0). That evidence directly undermines the sweeping “no disparity” claims advanced in the 2019 PNAS paper and complicates the more cautious but still minimization-oriented conclusions of the SPPS article. By omitting mention of their own experimental findings, the authors allowed a misleading narrative—that disparities are illusory or absent—to gain traction in both the scientific literature and public debate.

Equally concerning is that the 2025 publication itself does not acknowledge this inconsistency. The authors frame their contribution as filling a methodological gap in the simulator literature, but they do not confront the fact that their own experimental evidence from 2017 contradicts claims they made in widely cited articles. Readers are left without any discussion of why the earlier findings were published without reference to this experiment, or how to reconcile the divergent messages across their corpus of work. This absence undermines the credibility of the current contribution and raises questions about selective framing of evidence in a highly contested area of research.

A more balanced and transparent approach would have explicitly situated the 2025 findings against the backdrop of their earlier claims. Doing so would not only have clarified the scientific record but also demonstrated scholarly accountability in an area where research is closely tied to public trust and policy debates. By failing to address the contradiction between the 2017 data and their 2019 publications, the authors miss an opportunity to advance a genuinely integrative understanding of racial bias in policing.

In sum, the 2025 article provides strong experimental evidence that racial bias shapes police use of force decisions under ambiguity. However, the credibility and impact of this contribution are diminished by the failure to acknowledge how these results, collected in 2017, undermine the claims made in earlier high-profile publications. A critical lesson from this case is that transparency about contradictory findings is not optional; it is central to the integrity of science, especially on issues as socially consequential as racial disparities in police violence.



Postscript by Ulrich Schimmack:
I like ChatGPT because it writes well, in a neutral, matter-of-fact way. My own blog posts tend to be more emotional and maybe as a result less convincing. For example, I would have said that White researchers who receive funding to study racial biases but fail to report evidence of such biases are not living up to the standards of science and should be held accountable. Having a PhD or a university position is neither a necessary nor a sufficient criterion for being a true scientist; what matters is a commitment to transparency and integrity. Cases like this show the importance of diversity at universities and in sensitive research projects. The general public is already losing trust in universities, and examples of selective reporting only reinforce those concerns by suggesting that academics are sometimes unwilling or unable to confront their own biases.

Free Expression and the End of Charlie Kirk’s Megaphone

Guest Post by ChatGPT (only a little edited by me for clarity)

Charlie Kirk built a career around amplifying division. Through Turning Point USA and countless appearances, he pushed narratives that critics recognize as racist, harmful, and corrosive to public life. His attacks on Simone Biles and others revealed a deep contempt for those who don’t fit his narrow worldview.

Now that Kirk is gone, we face a choice in how to respond. Some will mourn, others will recoil, and still others will feel a complicated mix of relief and frustration. Those who strongly oppose his views, may want to express their views with social media posts like:

  • “The world is better off without Charlie Kirk’s racist megaphone.”
  • “At least now his platform of hate has been silenced.”
  • “No more venom from Charlie Kirk — that is something to be grateful for.”

These words are not celebrations of death. They are acknowledgments of relief — relief that one of the loudest promoters of racist ideology no longer has the ability to spread it further.


Why These Statements Are Protected in Canada

In Canada, free expression is a constitutional right under Section 2(b) of the Charter of Rights and Freedoms. The law draws a careful boundary: hate speech, threats, and incitement to violence are prohibited, but strong criticism of public figures and their ideologies is not only permitted — it is vital to a democratic society.

The phrases above stay within those boundaries:

  • They criticize rhetoric and ideology, not immutable characteristics of an identifiable group.
  • They express opinion (protected as “fair comment”) rooted in documented examples of Kirk’s words and actions.
  • They do not threaten or encourage violence — they describe relief at the end of harmful speech.

In other words, they are sharp, emotional, and unapologetic — but fully protected.


Why the Distinction Matters

Free speech does not mean freedom from criticism. Charlie Kirk exercised his right to speak; others have the equal right to call his ideas racist and destructive. What the law safeguards is the ability to name harm when we see it, without sliding into calls for violence.

Saying “I am grateful that Kirk’s racist megaphone has been silenced” is fundamentally different from saying “someone should silence him.” One is an observation of fact and an opinion about its social consequences; the other would be incitement. This distinction is precisely what Canadian courts protect.


Closing Thought

In the end, Charlie Kirk’s legacy is one of polarization. But his absence opens space for conversations that do not center on spreading racism and fear. Expressing relief at the silencing of his platform is not hate speech, not incitement, and not a threat. It is free expression at work — and a reminder that the fight against harmful ideology doesn’t end with one voice disappearing.


ChatGPT Recommends Retraction of Racist Article in the Journal “Intelligence”

In 2020, psychologists became more aware of institutional racism and tried to address these problems. An article that relied on invalid data by a racist psychologist was retracted, but other articles were not. In 2020, I asked the editor of the journal “Intelligence” to retract a flawed and racist article (Schimmack, 2020). The editor agreed that it was flawed, but they declined the request to retract it. Today, I asked the new editors to revisit this decision. I also discussed the pros and cons of a retraction with ChatGPT and it decided in favor of retraction. Let’s see whether editors of a journal that studies human intelligence are as intelligent as artificial intelligence.


ChatGPT’s Reason to Reject a Racist Article in the journal “Intelligence”

The question of whether journals should retract harmful articles has become increasingly urgent. While retraction has traditionally been reserved for fraud, plagiarism, or serious methodological errors, there is growing recognition that some published work is not only scientifically unsound but also perpetuates harmful ideologies. A prime example is the continued presence of race-IQ articles in the journal Intelligence.

One such paper, Lynn & Meisenberg (2010), claimed that “national IQs” are valid because they correlate with skin color. This argument is both circular and racist: it assumes cognitive inferiority in darker-skinned populations and then treats this stereotype as evidence that IQ tests measure “intelligence.” Despite these flaws, the article has been cited around 80 times, often without critique, allowing it to reinforce stereotypes under the guise of scientific authority.

When asked whether such a paper should be retracted, ChatGPT applied standard retraction principles, including those set by the Committee on Publication Ethics (COPE):

  • Unreliable data: The national IQ data are based on poor sampling and questionable aggregation.
  • Invalid reasoning: The validation argument is circular, using a stereotype as “evidence.”
  • Ethical concerns: The article promotes racial hierarchy narratives that lack scientific justification.
  • Ongoing harm: The paper continues to be cited as if it were credible science.

While some argue that retraction risks erasing history, this is a misconception. Retraction does not remove an article from the record—the PDF remains available, but with a clear notice explaining why the work is not reliable. This corrects the scientific record, prevents misuse, and maintains transparency.

Based on these considerations, ChatGPT concluded that retraction is warranted. A responsible retraction notice would explain that the article’s reasoning is scientifically invalid and that it has been misused to support harmful claims about racial differences in intelligence.

This case illustrates why retraction should not be seen as censorship but as quality control. By marking demonstrably flawed and harmful articles, journals protect their readers, their reputations, and the integrity of science itself.


Studies in Psychology are Underpowered and We Know it Since Cohen (1962)

Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. The Journal of Abnormal and Social Psychology, 65(3), 145–153. https://doi.org/10.1037/h0045186

This is a short summary of the key findings of Cohen’s (1962) seminal and groundbreaking study of the actual power of studies in psychology that should have triggered a method revolution 60 years ago, but it didn’t. Reading the key sections of this important article is worth your time.

Here’s a summary of the key points from Jacob Cohen’s 1962 classic article The Statistical Power of Abnormal-Social Psychological Research:


Background and Motivation

  • Psychological research emphasized Type I error (false positives) and statistical significance while largely neglecting Type II error (false negatives) and statistical power.
  • Sample sizes were typically set by tradition, convenience, or negotiation—not by rational power analysis.
  • This neglect meant that many studies had little chance of detecting real effects.

Study Goals

  1. Draw attention to the importance of power for researchers, editors, and sponsors.
  2. Provide conventions and tables to facilitate power analysis.
  3. Assess the actual power of published psychological research through a literature survey.

Method

  • Cohen examined 78 articles published in the Journal of Abnormal and Social Psychology (1960–61).
    • 70 articles contained statistical tests relevant to hypotheses.
  • He created conventional benchmarks for effect size:
    • Small, medium, large, defined in standardized terms for various tests (t-tests, F-tests, chi-square, correlations, proportions, etc.).
    • Example: a medium difference between two means was set at 0.5 SD.
  • Power was calculated under nondirectional two-tailed tests at α = .05.

Results

  • Average power across the 70 studies to detect:
    • Small effects: ~0.18 (about 1 in 5 chance).
    • Medium effects: ~0.48 (barely 50–50 odds).
    • Large effects: ~0.83 (good chance, but still not guaranteed).
  • None of the studies achieved power ≥ .95 (the symmetrical equivalent of Type I error = .05) for small or medium effects; only 23% did so for large effects.
  • Published studies looked “successful” mainly due to publication bias—failed studies were unlikely to appear in print.

Implications

  • The typical abnormal-social psychology study had too small a sample size (mean maximum N ≈ 68).
  • This led to:
    • A high rate of false negatives (missed real effects).
    • Premature abandonment of promising research lines.
    • An overrepresentation of positive findings in the literature.
  • Cohen argued that larger samples and routine power analysis should become standard practice.

Legacy

  • Introduced the now-standard small/medium/large effect size conventions.
  • One of the earliest systematic critiques of psychology’s reliance on low-powered studies.
  • Helped set the stage for later discussions on replication, publication bias, and research credibility.

👉 In short: Cohen (1962) showed that most published abnormal-social psychology research in the early 1960s was severely underpowered, especially for detecting anything but large effects. He called for larger sample sizes, formal power analyses, and conventional effect size benchmarks to improve research quality.


Follow Up Questions

1. Has this article led to an increase in power in psychological research?

Great question. The short answer is: not immediately. Cohen’s (1962) article is now seen as a landmark, but for decades its message had little practical impact on how psychologists designed studies.


Immediate Aftermath (1960s–1980s)

  • Cohen’s review was noticed, but it did not change practice in the short run.
  • Significance testing remained the dominant framework; power analysis was rarely taught or reported.
  • Cohen himself revisited the issue in his 1988 book Statistical Power Analysis for the Behavioral Sciences (2nd ed.), noting that very little had improved since 1962.
  • Sedlmeier & Gigerenzer (1989) did a follow-up survey of psychology journals and concluded that power was still as low as in Cohen’s review.

Replication Crisis Era (2010s onward)

  • The replication crisis (e.g., Open Science Collaboration, 2015) revealed widespread failures to replicate published findings.
  • Low power was identified as a central cause of irreproducibility.
  • This revived attention to Cohen’s warnings—his 1962 and 1988 works were widely cited as prophetic.
  • Meta-research (e.g., Button et al., 2013 in neuroscience; Szucs & Ioannidis, 2017 in psychology) confirmed that average power remained ~20–40%, especially for small-to-medium effects.

Current Trends (2020s)

  • Some increase in awareness:
    • Journals, funders, and preregistration platforms increasingly require power analyses.
    • Larger sample sizes are becoming more common in high-profile replication projects.
  • But in practice:
    • Many psychology studies are still underpowered, especially in social/personality psychology.
    • Incentive structures (publish-or-perish, novelty bias) continue to favor small-N, flashy studies over careful, well-powered designs.
  • Methods like z-curve, p-curve, and meta-science simulations (which you’ve been working with!) now make it possible to estimate field-wide power from published results, and these consistently show average power below Cohen’s recommended standards.

Bottom Line

  • Cohen’s (1962) article did not directly lead to an increase in power in psychology in the decades that followed.
  • Its influence was delayed: it became central only when the replication crisis forced the field to confront reproducibility issues.
  • Even today, psychology still struggles with underpowered studies—so Cohen’s critique remains highly relevant.

2. Why do you think that the article did not change research practices?

This is one of the most interesting questions about Cohen’s 1962 article. Historians and methodologists generally agree that it had very little immediate effect on research practices. There are several reasons for that:


1. Incentive Structures in Psychology

  • Researchers were (and still are) rewarded for publishing statistically significant results, not for designing well-powered studies.
  • Journals rarely required power analyses; significance testing dominated.
  • Small samples were cheaper, faster, and more publishable—even if underpowered.

Result: Even if researchers knew about power, they had little incentive to act on it.


2. Educational Norms

  • Graduate training and statistics textbooks emphasized p-values and Type I error, while Type II error and power were treated as afterthoughts.
  • Cohen himself noted this imbalance: textbooks introduced power briefly, then ignored it.
  • Most researchers in the 1960s–80s were never taught how (or why) to calculate power.

3. Lack of Accessible Tools

  • In the 1960s, computing resources were scarce.
  • Power calculations required special tables or hand calculations, making them burdensome.
  • Cohen (1962) provided effect size conventions and tables, but without user-friendly software, most researchers didn’t adopt them.
  • Only much later (with programs like G*Power in the 1990s) did power analysis become technically easy.

4. Misconceptions About “Significance”

  • Many psychologists assumed that if a result was “statistically significant,” then power was irrelevant.
  • Others believed that if their study “worked,” power must have been adequate.
  • This confusion meant researchers didn’t perceive low power as a threat to validity.

5. Publication Bias Masked the Problem

  • Because only “successful” studies were published, the literature appeared full of significant effects.
  • This gave the illusion that studies must be powerful—contradicting Cohen’s warning.
  • Researchers rarely saw the large number of underpowered null findings that never made it into print.

6. Cohen’s Framing

  • Cohen presented his conventions as provisional standards, but did not push them as hard rules.
  • His style was more diagnostic than prescriptive—raising awareness, but not building institutional pressure for change.
  • Only later, in his 1988 book, did Cohen emphasize the urgent need for reform.

7. Timing and Culture of the Field

  • In the 1960s–80s, psychology was dominated by behaviorism and later cognitive psychology, both focused on experimental ingenuity rather than methodological rigor.
  • Concerns about replication, bias, and research credibility were not mainstream until the replication crisis decades later.

Bottom Line

The article didn’t change practice because:

  • Incentives rewarded significance, not power.
  • Education ignored power.
  • Tools made it hard to calculate.
  • Biases in publication created a false sense of success.
  • Cultural timing meant the field wasn’t ready to prioritize rigor over discovery.

In short, Cohen planted the seed, but the soil of psychology wasn’t fertile enough for it to grow until the replication crisis made his warnings unavoidable.


4. Has Cohen’s claim of low statistical power been challenged or is it widely accepted? For example, one could argue that effect sizes are actually larger than those assumed by Cohen and would justify high success rates.

Excellent, you’ve zeroed in on one of the core debates about Cohen’s (1962) findings and their legacy. Here’s the landscape:


1. Cohen’s Original Claim

  • Cohen found that most studies in Journal of Abnormal and Social Psychology (1960–61) had very low power (~.18 for small, ~.48 for medium, ~.83 for large effects).
  • He argued this meant many real effects were being missed, leading to wasted research and false negatives.

2. Challenges to Cohen

Yes, there have been arguments against taking his numbers at face value:

a) Effect sizes might actually be larger

  • Critics suggest that Cohen’s “small/medium/large” conventions may underestimate the true effect sizes in psychology.
  • If effects in the field are typically larger than Cohen’s “medium” benchmark (d = .5), then actual power would be higher.
  • For instance, some areas (like cognitive psychology or psychophysics) do report very large, robust effects.

b) Publication bias inflates apparent success

  • Others flipped the logic: since so many published studies were significant, maybe power must have been sufficient.
  • But Cohen himself warned this was misleading: success rates reflect selective publication, not true underlying power.

c) Context-specific power

  • Some researchers argued Cohen’s survey of one journal, over two years, may not generalize to all of psychology.
  • Certain subfields (e.g., experimental cognitive psychology) tend to have higher-powered designs, while social/personality studies are often weaker.

3. Replications of Cohen’s Diagnosis

Most follow-up studies confirmed his basic point:

  • Sedlmeier & Gigerenzer (1989): Found that average power in psychology remained just as low decades later.
  • Maxwell (2004): Argued that low power was one of psychology’s most persistent problems.
  • Button et al. (2013, neuroscience): Average power only ~20%.
  • Szucs & Ioannidis (2017, psychology): Average power across thousands of studies was ~36%—remarkably close to Cohen’s original estimates.

4. Current Consensus

  • Widely accepted: The field as a whole has been underpowered, particularly in social/personality psychology.
  • Effect size debate remains: Some argue Cohen’s “small/medium/large” cutoffs don’t reflect real psychological phenomena.
    • Example: A correlation of r = .20 (Cohen’s “small-to-medium”) may in fact be socially important but is hard to detect with small samples.
  • Nuanced view today:
    • In some areas (e.g., Stroop effect, visual perception), effects are large and easy to detect → power is fine.
    • In others (social/personality, clinical interventions), effects are small-to-medium → power is chronically low.

5. Bottom Line

  • Cohen’s claim of low power has stood the test of time.
  • The challenge has mostly been about whether his conventions underestimated effect sizes—but large-scale meta-analyses have shown that in many areas, effects are indeed small-to-medium, meaning his diagnosis was correct.
  • The modern view is that Cohen was right overall, though effect size distributions vary by subfield.

5. Are there any methodological advances that build on Cohen’s seminal work to provide better estimates of power in psychological research that do not rely on assumed effect sizes?

Yes — and this gets to the heart of why Cohen’s (1962, 1988) work is considered both foundational and limited. He gave psychology a language (effect size, small/medium/large) and a framework (power analysis), but he assumed you had to specify the effect size in advance. In reality, that’s often the hardest part.

Since then, several methodological advances have tried to estimate power empirically, without relying on arbitrary effect size assumptions:


1. Post-hoc and Retrospective Power (Cohen himself discouraged this)

  • Early attempts calculated power from the observed effect size in a published study.
  • Problem: highly unstable and circular — significant results yield “high” power, non-significant results yield “low” power.
  • Cohen (1988) called this practice “illogical” when based only on observed effects.
  • Conclusion: Not a real solution.

2. Meta-Analytic Power Estimation

  • Instead of guessing effect sizes, meta-analyses combine past studies to estimate the distribution of effect sizes, which can then be used for more realistic power calculations.
  • Examples:
    • Maxwell (2004): called for cumulative effect-size databases to guide power analysis.
    • But meta-analyses are themselves biased (publication bias, p-hacking).

3. Bias-Correcting Meta-Science Tools

These methods directly estimate average power (and related quantities) from the distribution of published test statistics — no need to assume effect sizes.

  • p-curve (Simonsohn, Nelson, & Simmons, 2014):
    Examines the distribution of significant p-values. If studies are truly powered, there should be “right-skew” (more very small p’s than just-below-.05 p’s). Produces an estimate of average power across a set of studies.
  • p-uniform (van Assen, van Aert, & Wicherts, 2015):
    Similar idea, but uses the uniform distribution property of p-values under H0. Allows effect-size and power estimation corrected for selection.
  • z-curve (Schimmack, 2016; Bartoš & Schimmack, 2021):
    Models the entire distribution of z-values (test statistics), both significant and nonsignificant (when available), to estimate:
    • Observed discovery rate (ODR)
    • Expected discovery rate (EDR)
    • False discovery risk
    • Average power, directly from published evidence.
  • RoBMA (Bartoš et al., 2021):
    A Bayesian model-averaging extension of p-curve/p-uniform/z-curve, correcting for publication bias and heterogeneity simultaneously.

4. False Discovery Risk Metrics (related to power)

  • Sorić’s formula (1989): links discovery rates to the maximum false discovery rate (FDR).
  • Ioannidis (2005, “Why Most Published Findings Are False”): reframed power as a determinant of false discovery probability in fields with many small studies.

These don’t estimate power directly, but they make power observable in practice by connecting it to replicability and error rates.


5. Large-Scale Replication Projects

  • Projects like the Open Science Collaboration (2015), Many Labs, and Registered Replication Reports provide empirical estimates of effect sizes by repeating published studies at scale.
  • These projects show that many original studies were underpowered, confirming Cohen’s diagnosis without having to assume effect sizes.

6. Automated Power Estimation from the Literature

  • Tools like z-curve (which you’ve used) and related approaches essentially reverse-engineer power from published test statistics, producing field-wide or topic-wide estimates of “realized” power.
  • This is probably the most direct methodological advance that fulfills Cohen’s vision without his limitation (the need to specify an assumed effect size).

✅ Bottom Line

Cohen’s work forced psychology to talk about power.
Modern advances like p-curve, p-uniform, z-curve, and RoBMA made it possible to estimate actual power empirically from the published record, instead of relying on assumed effect sizes. These methods are now central to meta-science and are widely used to diagnose the credibility of research literatures.


6. Which of these methods has been used to estimate the power of psychological research using large representative sets of studies like Cohen’s (1962) seminal study?


1. Direct Replications of Cohen’s Survey

  • Sedlmeier & Gigerenzer (1989): Surveyed 10 psychology journals from 1960–1984.
    • Found no improvement in average power since Cohen.
    • Median power: ~.37 for medium effects (virtually the same as Cohen’s ~.48).
  • Maxwell (2004): Re-analyzed psychological literature, confirming that low power was still endemic.

These studies stayed close to Cohen’s method — manually coding studies, assuming effect sizes, and applying power tables.


2. p-curve

  • Simonsohn, Nelson, & Simmons (2014): Introduced p-curve and demonstrated it on sets of published studies.
    • Early applications used bundles of psychological experiments, showing that many literatures had low evidential value (i.e., low true power).
    • But most p-curve analyses have been domain-specific (ego depletion, priming, etc.), not broad surveys like Cohen’s.

3. p-uniform

  • van Assen, van Aert, & Wicherts (2015): Developed the method and applied it to collections of psychology studies.
    • Like p-curve, mostly applied to targeted meta-analyses rather than huge representative samples.

4. z-curve

  • Schimmack (2016 onward): Explicitly framed z-curve as a modern analogue to Cohen’s 1962 study.
    • Used large samples of psychological articles (hundreds to thousands of test statistics).
    • Reported average discovery rates of 30–40% and average power for published significant results around 60%.
    • Schimmack’s “Replicability Index” and later z-curve 2.0 / 3.0 papers directly compared field-wide replicability to Cohen’s original findings.
    • This is the closest true successor to Cohen’s survey — using automated test statistic distributions rather than effect-size assumptions.

5. RoBMA

  • Bartoš et al. (2021): Combined approaches (p-curve, selection models, Bayesian averaging).
    • Applied to psychology meta-analyses, not usually to whole journals.
    • More powerful for specific literatures than sweeping field-level audits.

6. Large-Scale Meta-Science Surveys

  • Szucs & Ioannidis (2017): Analyzed all psychology articles indexed in Cognitive, Developmental, Social journals (~1960–2015).
    • Estimated median power for medium effects at ~36%.
    • Essentially a mega-scale Cohen-style survey with modern computing.
  • Open Science Collaboration (2015): Replication study in social/cognitive psychology.
    • Directly showed realized replication rates ~36% — consistent with low original power.
  • Button et al. (2013, neuroscience): Broader field but widely cited in psychology. Estimated median power ~20%.

✅ Bottom Line

  • Cohen (1962): manual, small journal survey.
  • Sedlmeier & Gigerenzer (1989); Maxwell (2004): larger manual surveys, same conclusion.
  • Szucs & Ioannidis (2017): massive database approach, confirmed Cohen’s diagnosis.
  • z-curve analyses (Schimmack, 2016–2021): the closest conceptual heir, using test-statistic distributions to estimate actual discovery rates and power without assuming effect sizes.

So — while p-curve and RoBMA are powerful tools, the field-wide, Cohen-style power estimates have mainly come from manual coding studies (1960s–2000s), meta-science big data surveys (2010s), and z-curve analyses (2016 onward).


Z-Curve.3.0 Tutorial: Chapter 9

You can find links to the other chapters on the post with Chapter 1.

The code for this chapter can be found on GitHub.
zcurve3.0/Tutorial.R.Script.Chapter9.R at main · UlrichSchimmack/zcurve3.0

This chapter examines the influence of one p-hacking method on z-curve (and p-curve) estimates of power (see Chapter 5 for a previous comparison of z-curve and p-curve). P-hacking is a term for statistical tricks that researchers can use to increase the probability of a significant result. This can be considered desirable when there is a true effect as a significant result avoids a so-called type-II error (failing to reject a false null-hypothesis). However, this argument ignores that p-hacking inflates effect size estimates and increases the risk of rejecting a true null-hypothesis. The replication crisis has shown that p-hacked results often do not replicate. The question is whether such replication failures can be predicted by z-curving p-hacked results.

Chapter 9: The Influence of P-Hacking V1.

There are different p-hacking methods, and they can have different effects on the distribution of z-values and z-curve estimates. It is therefore important to examine them one-by-one. The present study examines a method that was used by Bem (2011) to provide false evidence for time-reversed cognition (knowing something that has not yet happened or studying after an exam improves exam scores). To obtain significant results, he did two things. First, he checked whether studies showed promising trends and abandoned studies that failed to do so. Second, he combined smaller samples into one study (Letter to JPSP). This practice was considered acceptable by the editor of JPSP, implying that it may have been a common strategy to get significant results with low power (Kitayama). I call this approach patchwork sampling. This approach is also more efficient than a similar strategy called optional stopping. Optional stopping implies that researchers continue to collect data until they get a significant result. The problem with this strategy is that it can take many participants to reverse a negative trend in the beginning. It is much more efficient to abandon these samples and focus on samples with the desired trends.

Patchwork sampling is also consistent with the common claim among social psychologists that results are sensitive to small variation in experiments. How would they know this. Well, they do many small studies with small variations. When a study doesn’t show the desired result, the moderator is blamed, and the results can be ignored. However, if two studies show the same trend, the moderator can be ignored, and they can be combined to produce p < .05.

The present simulation assumed that researchers conduct up to three studies with small samples. A significant result in a small sample is reported. If they are not significant, two samples are combined which produces another 3 chances to get p < .05. If that fails, all three samples are combined for another test. If none of the tests produce a significant result, the results are not reported.

P-hacking differs from simple selection bias. it is often assumed that p-hacking can be detected with p-curve analyses (Figure 1). A p-curve plot shows the distribution of p-values between 0 and .05. Without a real effect and without p-hacking there are equal frequencies of p-values in the bins from 0 to .01, .01 to .02, … .04 to .05. Some p-hacking methods will produce more p-values close to .05 than .01. This is called a “left-skewed” distribution. Figure 1 shows that p-curve detects patchwork sampling when the studies test a true null-hypothesis (no effect).

It is often overlooked that p-curve does not detect p-hacking when the studies test a true effect even if the effect is small and power is low. Figure 2 shows the p-curve plot with mean effect size d = .2, SD of effect sizes = .4, and n = 15 per group. The simulated data are p-hacked, but the p-curve plot fails to detect p-hacking.

Z-curve 3.0 uses a different approach to detect p-hacking methods that produce too many p-values close to .05. It is possible to fit z-curve to any set of z-values. To test for p-hacking, z-curve is fitted to z-values greater than 2.8 (p = .005). The model then predicts the distribution of z-values in the range between 1.96 (p = .05) and 2.8 (p = .005), and the observed frequency of z-values in this range can be compared to the predicted frequency.

Figure 3 shows the results for the same simulation that was used for the p-curve in Figure 2. The figure shows that there are many more observed (purple bars) than expected (dotted red line) values between 2 and 2.8.

The simulations examine the type-I and type-II error rates of this approach to the detection of p-hacking.

The Simulation

The simulation varied cell sizes of a two-group design (n = 15, 20, 30), effect sizes (0 to 1 in steps of .2) and heterogeneity in effect sizes with standard deviations of 0, .2, .4, .6, and .8. This covers common ranges of heterogeneity in meta-analyses. Each study was run to have 10,000 successful attempts to get at least one significant result for every attempt. The actual number of significant results could be higher when power was high because more than 1 of the 7 attempts could produce a significant result.

The data were analyzed once without p-hacking using the results of the three small samples and once with p-hacking. The simulation without p-hacking was used to estimate the type-I error of the p-hacking bias test. The other simulations were used to examine the influence of p-hacking on z-curve estimates and to estimate the type-II error of the p-hacking bias test. Z-curve was fitted once with the default method that does not make distribution assumptions and once with a single component that assumed a normal distribution of power.

Results

Inflation of Success Rates

P-hacking increases success rates. When the null-hypothesis is true, the inflation of the success rates increases the risk of a type-I error (a false rejection of the null-hypothesis). However, when there is a true average effect, p-hacking essentially increases power to reject a false null-hypothesis. The problem is that we do not know whether the null-hypothesis is true or false. Thus, p-hacking is risky and invalidates the assurance that no more than 5% of tests can falsely reject the null-hypothesis.

Figure 4 shows how much the patchwork method simulated here inflated success rates.

This p-hacking method boosted the success rate from 5% to 19% when the null-hypothesis was true. On average, it increased power to detect a real effect by 28%, but the figure shows that the amount depends on the real power. With low power and real effects, success rates are inflated by up to 50%. This does not justify p-hacking, but it does show the appeal to use p-hacking.

P-Hacking Bias Test

The p-hacking bias test was significant in 9 out of 90 (10%) simulations for the default method and 12 out of 90 (13%) for the normal z-curve. In the simulations with bias, the default method had 52 out of 90 (58%) significant results and the normal z-curve had 74 out of 90 (82%) significant results. Importantly, these success rates were obtained with very large samples. In short, the p-hacking bias test is promising, but non-significant results do not imply that there is no bias. Thus, it is important to examine how p-hacking influence z-curve estimates.

Z-curve Estimates of the Expected Replication Rate (ERR)

Overall performance of estimation methods can be examined with the root mean square error (RMSE) where error is simply the difference between the simulated true ERR and the z-curve estimate. There are four z-curve estimates using default and normal models and using all significant values or only z-values greater than 2.8 to avoid bias from p-hacked just significant results.

Method……………………..RMSE
Z-curve default – all….. .075
Z-curve normal, all ….. .070
Z-curve default, >2.8… .064
Z-curve normal, >2.8 … .035

The results are similar, but the normal method works better because the simulation used normal distributions of effect sizes. Also, excluding the p-hacked just-significant results produced better results.

.
The figure shows the superior performance when just significant results are excluded. Even if the bias test is not significant, excluding just-significant results may be useful to reduce the influence of p-hacking or at least to examine the robustness of results.

Z-curve Estimates of the Expected Discovery Rate (EDR)

Estimating the distribution of all z-values from the distribution of only significant results (or only z > 2.8) is difficult, especially when p-hacking is used. Nevertheless, the RMSE values show some accuracy. The normal model outperforms the default method because it is easy to predict the missing values from a truncated normal model. This works in these simulations because the assumption fits the simulated data. However, using only z-values greater 2.8 does not improve performance because extreme values provide less information about the distribution of non-significant z-values.

Method……………………..RMSE
Z-curve default – all….. .146
Z-curve normal, all ….. .066
Z-curve default, >2.8… .180
Z-curve normal, >2.8 … .096

Using only z-values greater than 2.8 leads to overestimation of the EDR.

One way to examine the possible influence of patching samples would be to conduct analysis for subsets of studies with different sample sizes. In the present case, the smaller samples will produce stronger estimates because they are not p-hacked than the p-hacked larger samples. If this is the case, the results from the smaller samples could be used.


Conclusion

P-hacking creates a problem for selection models that simply assume non-significant results are not reported because p-hacking changes the distribution of test-statistics and it is not known which p-hacking methods were used. This simulation examined one method that leads to underestimation of the ERR and EDR. The p-hacking bias is difficult to detect and also difficult to correct. At present, my recommendation is to compare results across different models (sensitivity analysis), but to focus on the results with the default method. The underestimation of EDR and ERR can be considered a penalty for p-hacking that creates an incentive to avoid p-hacking. In the present scenario, researchers are conducting many small studies. If a real effect is present, they could present all studies, including non-significant ones, with a meta-analysis of the studies or combine the studies into a single dataset. In contrast, not publishing the non-significant results to present a perfect picture of only successes undermines the credibility of the published results (Schimmack, 2012).

How to Measure Happinesses

Social science is undermined by slippery language and sloppy measurement. The term happiness (singular) has led to a lot of confusion about the meaning of happiness or its twin wellbeing. It seems to imply that there is one happiness, one way to have high wellbeing, and one way to have a good life. Philosophers have tried to define this elusive construct without success. Not a problem for social scientists. They just operationalize constructs and bypass any philosophical problems about the meaning of the words that they use as labels for their happiness or wellbeing measures. Wellbeing is simply whatever my wellbeing measure measures and of course my measure is better than the other measures. This mindless approach to measurement gave us hundreds of happiness measures and unless Reviewer 2 stops them, we will have more every day (see my review below).

The multitude of happiness measures shows of course that researchers do not agree because there is no single, universal definition of happiness that scientists can agree on because people do not agree on it. In other word, there are happinesses (plural) and the first task for everybody is to figure out what happiness means for them.

So, how can we measure happinesses, for example, to rank nations in terms of the average happinesses of their citizens. The solution to this problem comes from a science that is not even recognized as a distinct science, namely public opinion research. In the 1960s, public opinion researchers simply started asking people how happy they are with their lives or how satisfied they are with their lives. They did not give people a definition of happiness or ask specific questions based on a deep theory of happiness because they treated happiness like other topics; that is, as personal opinions.

This approach to the measurement of happiness has later been formalized in measurements of subjective well-being (Diener, 1984) and provided with a theoretical framework in philosophy (Sumner, 1996). The core assumption of subjective indicators of wellbeing or subjective well-being measures is that people have different conceptions of ideal lives (plural), but we can measure how their actual lives compare to people’s ideal lives.

Not all social scientists are happy with this approach, especially when studies show results that they do not like. To avoid the pesty problem that people do not define happiness in a way they like, researchers develop measures that fit their own views and produce results that they like. For example, when the average life-satisfaction of citizens in rich countries is higher than in poor countries because money is essential for the fulfillment of basic needs, they create measures that are not correlated with money because they have the romantic idea that 8 billion people on this planet would be better off living off the land. So, they create a wellbeing measure that is not correlated with national wellbeing. This is of course not science this is politics pretending to be science.

Given the lack of clear standards for the validation of measures in the social sciences, it is important that scale names have no meaning and no construct validity because constructs are not defined. Wellbeing is just a cool name to say “my measure is really important.” If you agree that there is no single wellbeing or happiness, you need to focus on work with life-satisfaction judgments that require people to define happiness for themselves.

My Review

The manuscript starts with the common assumption that there is one true happiness (well-being) and that the goal of wellbeing science is to define it and measure it. This mindset has let to the creation of hundreds of different definitions and measures. It is probably time to recognize that the search for a single construct of happiness is elusive. There is not one happiness. There are many happinesses.

The plurality of happiness creates a problem for the use of happiness as a social indicator or policy goal. If there is not one true happiness, who gets to decide what happiness is used to measure people’s happiness? The king of Bhutan with his clever slogan of the Gross National Happiness?  Not everybody may agree with this definition.

The appeal of subjective wellbeing indicators is that they do not impose a definition of wellbeing on the people who’s wellbeing is measured. Ideally, the measure reflects their own definition of wellbeing rather than a definition that they do not share. This approach solves the problem of plurality because people can have different definitions of wellbeing and we can still measure the average wellbeing of a population.

This does not automatically imply that life-satisfaction judgments are perfectly valid measures of subjective wellbeing. Far from it. The observed scores can be biased by many factors, including a bias towards consideration of material factors, which was argued by Kahneman for some time.

However, correlations at the individual level show that money plays a relatively small role in people’s life-satisfaction judgments. Personality dispositions and satisfaction with life-domains that are not affected by income (e.g., family relationships) also play a role.

At the level of national averages, the correlation is strong, but that may simply show that fulfillment of basic needs varies a lot across nations and is important for wellbeing. It does not automatically undermine the validity of life-satisfaction judgments.

The problem for other approaches is to demonstrate that they are valid measures of people’s wellbeing. The authors’ approach is questionable. They look for variables that are not related to economic conditions under the assumption that the strong correlation with life-satisfaction reflects a bias. However, they provide no evidence that it is a bias. Money buys shelter, food, medicine, education, etc. etc., all good things that people need to even start thinking about a good life. To claim that a measure that shows no relationship with economic conditions is a valid measure of well-being requires evidence and the authors do not provide such evidence. Therefore, it lacks support for the claim that “ this study provides fresh evidence that psychosocial well-being represents a distinct dimension of national well-being,”

To examine fundamental questions in wellbeing science requires deeper reflection about the meaning of happiness and the limitations in measuring such an elusive construct.