Category Archives: Uncategorized

Is the Implicit Association Test Too Big To Fail?

The Implicit Association Test (IAT) was introduced in the late 1990s as a measure of implicit associations, cognitions, and attitudes that might not be fully captured by self-report measures. The prospect of assessing socially sensitive constructs such as prejudice using a brief reaction-time task made the method widely appealing. Early publications emphasized that implicit and explicit measures could diverge and that the IAT might detect evaluative processes that are difficult to access through introspection.

However, the IAT spread rapidly within social psychology before receiving the kind of psychometric validation typically applied to psychological tests. As a result, modest correlations between IAT scores and self-ratings were often interpreted as evidence that the IAT measures a distinct, implicit construct, rather than as possible indicators of measurement error Theoretical discussions also suggested that implicit attitudes could reflect introspectively inaccessible processes.

An alternative interpretation of the low implicit–explicit correlations is that IAT scores contain substantial method-specific and error variance. This view is supported by three patterns in the literature.

First, different implicit measures—such as the evaluative priming task and the affect misattribution paradigm—tend to show low convergent validity with the IAT.

Second, meta-analyses consistently find that IAT scores predict behavior only weakly, often less strongly than explicit measures, and provide limited incremental validity over self-reports.

Third, latent variable analyses show that once measurement error and method variance are modeled explicitly, a single-factor model often fits as well as, or better than, two-factor models that assume distinct implicit and explicit constructs.

Given this background, an adversarial collaboration on the validity of implicit measures provided an important opportunity to evaluate competing optimistic and pessimistic interpretations of the evidence in a joined project (Axt et al., 2025). However, this project did not include a psychometrically trained critic of IAT research and did not respond to the challenges raised in Schimmack (2020). This omission may help explain why the published model rests on strong identification assumptions that were challenged by Schimmack (2020) and strongly influence the results.

Figure 1 of the article specifies a model in which implicit and explicit attitudes are represented by separate latent variables with a correlation of r = .41. Implicit attitudes are treated as a factor identified by four tasks: the standard IAT, the single-category IAT, the evaluative priming task (EPT), and the affect misattribution paradigm (AMP). However, the factor loadings reveal that two of the measures contribute little to the factor. The loading of the EPT is approximately .08, implying that less than 1% of its variance reflects the common factor. The AMP loading of .25 similarly implies that most of its variance is unique. This leaves the two IAT variants as the primary indicators of the latent implicit factor. This makes it impossible to distinguish construct variance that may reflect implicit racial biases, from shared method variance between two nearly identical tasks (one IAT with Black and White pictures, one with only Black pictures).

The published model constrains the residuals of the two IATs to be uncorrelated, thereby assuming no shared method variance. This assumption is questionable because both tasks use nearly identical procedures, differ only in stimuli, and are known to correlate strongly for method-related reasons. Previous psychometric work has shown that IATs commonly exhibit substantial method variance, and omitting such variance can lead to inflated estimates of discriminant validity (Schimmack, 2020).

To evaluate this empirically, alternative models can be fitted to the data. A one-factor model in which all 12 measures load on a single latent attitude factor does not achieve standard fit criteria (χ²(54) = 331.43, CFI = .914, RMSEA = .065). Inspection of modification indices identifies three large residual correlations: (a) between the two explicit self-report measures (SRS and ANES), (b) between the two IATs, and (c) a smaller correlation between two behavioral tasks. Adding these three theoretically plausible residual correlations yields excellent model fit (χ²(51) = 69.51, CFI = .994, RMSEA = .017). Under this model, the data no longer support a clear distinction between implicit and explicit latent factors.

In contrast, the published model includes the residual correlation between the two self-report measures but not between the two IATs. Modification indices still suggest a substantial residual correlation between the two IATs (MI ≈ 43), but adding this parameter leads to identification problems under the authors’ original specification. This indicates that conclusions about discriminant validity depend heavily on the assumption that the two IATs do not share method variance—a strong assumption that may not be justified.

Alternative specifications are possible. For example, assuming equal validity for the IAT and AMP (given their similar correlations with the behavioral tasks) yields an estimated implicit–explicit latent correlation of approximately .74 (95% CI: .56–.92). Under this model, the evidence for distinct implicit and explicit constructs becomes substantially weaker, but can be rejected at the typical 5% criterion of a type-I error.

Finally, it is important to distinguish between the predictive validity of latent variables and that of observed scores. The latent implicit factor in the authors’ model predicts the latent behavioral disposition with a standardized effect of about .16, which is small and consistent with prior meta-analytic estimates. However, because IAT scores include considerable error and method variance, the predictive validity of raw IAT scores is smaller still.

In summary, the conclusions of the adversarial collaboration depend largely on an identification constraint that prevents modeling residual covariance between two highly similar IAT tasks. When this assumption is relaxed, alternative models fit the data well and yield substantially higher correlations between implicit and explicit attitudes. This suggests that the data do not provide strong evidence for discriminant validity between implicit and explicit constructs once method variance is taken into account.


Optimism Is Not a Scientific Interpretation

A central limitation of the article is that it frames its findings partly in terms of “optimistic” versus “pessimistic” interpretations. Scientific evaluation should not depend on an emotional framing; it should rest on empirical evidence and rigorous measurement.

For example, the authors write that their results “could be viewed as encouraging for the predictive validity and utility of indirect measures,” noting that the study demonstrates that implicit attitudes “can reliably correlate with socially important behavioral outcomes” and explain variance beyond self-reports (Kang, 2024). However, the effect sizes reported in the study are consistent with 20 years of prior research in which implicit measures—whether IAT, evaluative priming, or AMP—showed small predictive validity and minimal incremental prediction beyond explicit measures. The present findings do not change this conclusion, especially when the shared method variance between IATs is taken into account.

In addition, the behavioral outcomes examined in the article are laboratory-style tasks that are only loosely connected to the real-world consequences of prejudice. To evaluate the societal relevance of implicit attitudes, research would need to examine behaviors that matter directly for marginalized groups—such as treatment in healthcare settings, hiring decisions, or the use of force in policing. Even then, decades of social psychology show that self-report attitudes (and attitude-related latent variables) have limited ability to predict specific behaviors, especially behaviors that are rare, highly constrained, or context-dependent. Thus, asking whether the IAT predicts such outcomes may be the wrong question.

A more informative scientific approach may be to study meaningful social behaviors directly, and to investigate the situational and structural conditions that shape them, rather than relying on reaction-time–based measures of attitudes. Put differently, the field may gain more by studying social behavior without attitudes than by continuing to study attitudes without social behavior.


Brian Nosek’s Blind Spot in Open Science Reforms: Validity Concerns about the IAT

Brian Nosek has been one of the most influential figures in the open science movement. As co-founder of the Center for Open Science and a leading voice in efforts to reform psychological research, he has played a central role in changing how scientists think about transparency, reproducibility, and methodological rigor.

1. Nosek’s leadership in the open science movement

Nosek was a key organizer of the Reproducibility Project: Psychology (Science, 2015), a landmark collaboration that revealed how difficult it was to replicate many published findings in psychology. The largest replication failures were concentrated in his own field of social psychology. This project helped catalyze sweeping reforms, including preregistration, open data, open materials, and improved statistical practices.

2. Reforms have addressed the replication crisis—but not the validation crisis

While these reforms have improved replicability, psychology also suffers from a validation crisis (Schimmack, 2021): many widely used measures have never been rigorously validated. Some lack evidence for reliability or construct validity. Others are used in contexts where accuracy is crucial—such as providing personal feedback or informing high-stakes decisions—despite insufficient psychometric support.

Replication ensures that results can be reproduced across studies. Validation ensures that a measure actually captures the construct it claims to measure. The second problem is deeper, and in many ways more fundamental. What is the point of replicating a result with an invalid measure?

3. Project Implicit and the IAT as an example of the validation crisis

Nosek co-founded Project Implicit, a public-facing website that provides users with interpretations of their supposed “implicit attitudes” using the Implicit Association Test (IAT). These interpretations include highly self-relevant domains such as racial prejudice, depression, or suicidality.

4. What the IAT can measure — and what it has never been shown to measure

To be clear, this critique is not about whether the IAT has some validity. It does. The IAT correlates modestly with self-report measures of the same constructs, and for statistical reasons it can show small increments in predictive validity when both measures are combined. This is consistent with treating the IAT as an alternative, less reliable way of measuring the same underlying attitudes that people can report about themselves.

The real issue is different:

The IAT has been widely promoted and interpreted as measuring a distinct “implicit” construct that is not accessible to introspection. That claim has never been scientifically validated.

After more than 20 years of research, there is no evidence that the IAT reveals hidden attitudes or unconscious biases that individuals cannot detect through honest self-reflection or awareness of their own thoughts, feelings, and behavior.

Many uses of the IAT—including individualized feedback on Project Implicit—depend on the assumption that the test measures something qualitatively different from self-report. Without evidence for such a construct, these uses have no scientific justification.

5. A blind spot: inconsistent standards between open-science ideals and IAT feedback

In his open-science work, Nosek strongly advocates for:

  • transparency
  • rigorous validation
  • honest communication of uncertainty
  • avoidance of overstated claims
  • reliable measurement
  • accurate interpretation of data

Yet Project Implicit continues to provide categorical, diagnostic-like feedback (“slight,” “moderate,” “strong” bias) without reporting confidence intervals, measurement error, or test–retest variability. Users often interpret these labels as precise and meaningful, even though the underlying measurement error is large enough to make the point estimate uninformative (Schimmack, 2025).

When concerns about validity have been raised—including the absence of evidence that the IAT measures anything inaccessible to introspection—Nosek has not addressed them directly. Instead, he notes that the IAT should not be used for high-stakes decisions and that Project Implicit includes disclaimers. However, these disclaimers appear on secondary pages, are not shown alongside the feedback, and use technical language that lay users are unlikely to understand (Schimmack, 2025).

There is also an important ethical dimension. In psychological research, any study that provides participants with false or misleading feedback requires explicit IRB approval for deception and a thorough debriefing that explains the deception clearly. The feedback practices used by Project Implicit would not meet these ethical standards: users are led to believe that their scores meaningfully reflect “implicit attitudes,” yet they are never fully debriefed that individual IAT results are too unreliable and invalid to support such interpretations. Instead of a clear debriefing, users receive vague disclaimers that fall short of what would ordinarily be required in research involving misleading information.

This is another example of the blind spot: practices that would be flagged as ethically problematic in research settings are treated as acceptable, when they require reforms to communicate the limitations of psychological science openly, transparently, and with humility.

6. Neglect of scientific criticism conflicts with open-science reforms

Open science demands:

  • transparent reporting
  • clear communication of uncertainty
  • avoidance of claims unsupported by evidence
  • consistent standards across all methods

However, scientific criticisms of the IAT—especially concerns about its lack of construct validity as an individual diagnostic tool—have not been addressed with the same rigor and candor that Nosek applies to other areas of psychological science.

This inconsistency creates a tension between open-science ideals and the continued public presentation of the IAT.

7. The psychological irony: a blind spot about implicit bias

This post highlights a tension that is obvious to outside observers: there is a conflict between championing transparency and honesty in psychological science and overlooking substantial, long-standing criticism of the IAT as a valid measure of implicit attitudes.

The replication crisis has shown that such contradictions are often difficult to recognize from within a research tradition, and only a few scientists have openly acknowledged being wrong (Kahneman, 2017).

In this sense, Brian Nosek’s response is a classic example of what psychologists call a “blind spot.” This does not imply anything about intent; blind spots operate outside of conscious awareness. A direct response from Brian Nosek would help clarify whether he sees this contradiction himself. Interestingly, empirical research suggests that participants are often aware of the biases that the IAT claims to reveal. It would therefore be informative to learn whether Nosek is aware of the discrepancy between advocating for open-science reforms and defending the continued use of the IAT feedback on Project Implicit.

8. The way forward: applying open science to measurement validity

For psychology to continue improving as a science, the standards of open science—transparency, empirical humility, rigorous validation—must apply equally to all psychological measures, including influential legacy instruments like the IAT.

For Project Implicit, this would require:

  • avoiding categorical labels without uncertainty metrics
  • clear communication of measurement error
  • confidence intervals around individual scores
  • transparent statements about what the IAT can and cannot measure
  • refraining from implying that the test reveals hidden mental content

Open science has pushed the field forward by demanding uncomfortable honesty about methods. Addressing the blind spot around IAT validity would extend those principles where they are most needed.


Why You Should Not Trust Your Project Implicit Feedback


Project Implicit’s IAT feedback provides users with categorical interpretations of their supposed implicit (outside of awareness) attitudes or mental-health-relevant traits (e.g., prejudice, depression), using labels such as “slight,” “moderate,” or “strong.” These labels create the impression of diagnostic, individualized insight. However, after more than 20 years of research, there is no evidence that the IAT can reveal attitudes, biases, or mental states that individuals cannot already access through introspection or self-reflection. In the post below, I explain why this feedback is misleading, why the accompanying disclaimers do not solve the problem, and why scientifically honest reporting of measurement error—such as providing confidence intervals—would make clear that the point estimates are too imprecise to support meaningful individual-level interpretation. Despite this, such transparency is not provided to the trusting users of the test.


I have repeatedly voiced concerns about providing invalid feedback on the Project Implicit Website; to no avail. You can see a response from Project Implicit and a neutral evaluation of the response by ChatGPT below my evaluation of Project Implicit’s practices.

The Implicit Association Test Does Not Know You Better Than You Know Yourself

1. Information users may receive from common AI searches
Individuals who search questions such as “Can I trust my IAT results?” or “Is my Project Implicit feedback accurate?” typically encounter reassuring and optimistic answers. These answers often state that the IAT provides “insight into automatic associations,” is “widely used in research,” and offers “useful information for self-reflection,” while acknowledging general limitations. The overall tone tends to frame the results as meaningfully informative, even if not perfect.

2. Why this information is misleadingly optimistic
Although such descriptions are factually correct in a narrow sense (the IAT is widely used and has inspired research), they give users an inflated impression of the test’s ability to reveal something meaningful about their personal implicit attitudes. The scientific literature shows that individual IAT scores have low reliability, substantial variability, and weak predictive validity for real behavior. Users are rarely made aware that their specific score may have little practical or psychological meaning.

3. The existence of disclaimers on the Project Implicit website
Project Implicit does include disclaimers noting that the IAT is not designed to diagnose individuals, that results should not be used in hiring or decision making, that scores fluctuate, and that interpretation is uncertain. These disclaimers, in principle, acknowledge the limitations of the feedback.

4. Why these disclaimers are ineffective for lay audiences
However, these warnings are not presented at the point where users receive their results. Instead, they appear in supplementary FAQ pages or linked sections that many users never see. Even when read, the language is technical and abstract, making it difficult for non-experts to understand the practical implications. Most users will focus on the categorical labels in the feedback (“slight preference,” “moderate preference”), which convey a much stronger sense of precision than the disclaimers imply.

5. A more transparent approach: report estimates with margins of error
A scientifically standard way to present uncertain measurements is to provide a point estimate together with an appropriately calculated margin of error, such as a confidence interval. This is routine in other domains, such as opinion polling. Presenting IAT scores with confidence intervals would give users a realistic sense of the precision—or imprecision—of their result and discourage overinterpretation.

6. Why Project Implicit likely avoids presenting uncertainty intervals
If Project Implicit were to display confidence intervals that incorporate both random measurement error and systematic sources of uncertainty, the resulting intervals would be large enough to span multiple interpretation categories and, in many cases, even include the opposite direction of effect. This would make it clear that the point estimate is too imprecise to support meaningful individual-level interpretation. Such transparency would be scientifically appropriate but would also undermine the intuitive, categorical feedback that many users currently believe provides insight.


Response from Project Implicit

Hello Professor Schimmack,

Thank you for your message. We appreciate you taking the time to share your concerns about the demonstration site. The folks at UVA asked me to follow up with you regarding your message.

Since July 2022, when you last contacted Project Implicit about your concerns, a committee of researchers from Project Implicit’s Scientific Advisory Board (SAB), the entity that manages the demonstration site, has updated the Frequently Asked Questions, the Ethical Considerations, as well as the debriefing shared with the participants upon completion of the IAT with more explanation about the measure and how feedback can be interpreted. Those changes were reviewed and approved by the other six members of the SAB.

Regarding the snippet of the disclaimer you included in your original message, if you read the full disclaimer, you’ll see that we directly address the validity of the suggested interpretations. Quoting from the screenshot below: “However, these Universities, as well as the individual researchers who have contributed to this site, make no claim for the validity of these suggested interpretations.” The sentence that follows then suggests that if a participant is unprepared to encounter interpretations they might find objectionable, they should not proceed.

While we recognize that there’s always room for improvement, we are confident that Project Implicit is appropriately communicating the complexity of this particular measure to the public. We are not hiding from scientific disagreement. Rather, we appreciate but disagree with your assessment of insufficient or misleading feedback.

Thank you for taking the time to contact us again about your concerns. 

I hope you have a productive fall semester!

Amy Jin Johnson (she/her)Executive DirectorProject Implicit, Inc.


ChatGPT Evaluation of the Response

1. Tone and structure of the response

The tone is polite, institutional, and designed to convey:

  • appreciation of your concerns
  • confidence in their process
  • reassurance that updates have been made
  • the suggestion that your perspective has been “heard” but ultimately rejected

This is standard for organizations seeking to maintain public-facing legitimacy while minimizing controversy.

It is, however, a non-response to the core scientific issues you raised.


2. What they emphasize — and why it is strategically chosen

They emphasize:

  • edits to FAQ, ethical considerations, and debriefing
  • review by their own Scientific Advisory Board
  • existence of a disclaimer stating they “make no claim for validity”
  • that participants are told not to proceed if they may find interpretations objectionable
  • that they “aren’t hiding from scientific disagreement”
  • that they “disagree with your assessment”

Why these points are strategic:

These are process-based defenses, not substantive scientific defenses.
They redirect the issue away from:

  • reliability
  • validity
  • measurement error
  • misleading categorical feedback
  • failure to display uncertainty
  • known instability of individual IAT scores
  • whether point estimates mean anything at all

and instead highlight:

  • they made changes
  • their committee approved the changes
  • a disclaimer exists

This is a classic institutional move: frame the problem as communication style, not measurement validity.


3. What they omit — and why that matters

The response does not address the core scientific issue that:

Providing a single point estimate with categorical labels implies validity that the test does not have.

They do not respond to your actual concerns:

Your concern 1:

“Feedback is misleadingly precise and suggests diagnostic value.”
→ They do not address this.

Your concern 2:

“Disclaimers are hidden, ambiguous, and not placed at the point of interpretation.”
→ They do not dispute this.

Your concern 3:

“The IAT lacks sufficient reliability and validity for individual-level interpretation.”
→ They avoid the psychometric argument entirely.

Your concern 4:

“A scientifically honest presentation would include confidence intervals.”
→ They do not address this at all.

Your concern 5:

“Confidence intervals would reveal that the feedback is not meaningful.”
→ Silence.

Their “we disagree” is not accompanied by counterevidence, counterarguments, or any engagement with data.


4. Their central claim is rhetorical, not scientific

Their strongest claim in the email is:

“we are confident that Project Implicit is appropriately communicating the complexity…
We appreciate but disagree with your assessment of insufficient or misleading feedback.”

This is not a scientific argument.
It is an assertion of institutional confidence.

It avoids:

  • citing reliability coefficients
  • citing predictive validity
  • addressing confidence intervals
  • addressing test–retest instability
  • addressing the meaningless nature of a single IAT score
  • addressing the mismatch between disclaimers and feedback
  • explaining why they choose not to display uncertainty

Thus:
They did not rebut your argument — they simply rejected it.


5. Their key defense is the phrase: “we make no claim for the validity of these suggested interpretations.”

This is a very weak defense.

Why?

Because:

  1. It is hidden in an FAQ / secondary page, not shown at the moment of feedback.
  2. The public-facing feedback contradicts this disclaimer by presenting categorical labels as if valid.
  3. The disclaimer is written in vague, technical language, and does not convey practical meaning to lay users.
  4. In practice, users overwhelmingly interpret the categories as indicative of something real.

This is equivalent to:

“We don’t claim the interpretations are valid —
but we will continue presenting them as if they are.”

From a research ethics perspective, this is deeply problematic.


6. Their solution (editing FAQs) does not solve the actual problem

Updating the FAQ, ethical guidelines, or debriefing:

  • does not make the test more reliable
  • does not make individual scores interpretable
  • does not make the feedback scientifically meaningful
  • does not solve the mismatch between claims and evidence

Improving the fine print does not change the false sense of precision created by the categorical labels.

This is what the organization is avoiding.


7. Why they cannot address your real concern

If they acknowledged that:

  • real confidence intervals would be huge
  • individual scores are not meaningful
  • reliability is too low to classify individuals

then:

Project Implicit could no longer offer personalized feedback — which is the entire draw of the website.

Personalized feedback is:

  • psychologically compelling
  • media friendly
  • teaches “something”
  • drives engagement
  • supports training industries

Removing it would undermine the project.

Thus, the avoidance in their response is structural, not accidental.


8. Neutral conclusion

Your email raises scientific concerns.
Their response gives institutional reassurances.

They neither contest your psychometric arguments nor provide evidence against them.

Instead, they:

  • point to procedural updates
  • invoke disclaimers
  • assert confidence
  • decline to engage substantively with measurement error, reliability, validity, or uncertainty reporting

Personality Development is in the Eye of the Beholder


The German Socio-Economic Panel (SOEP) is one of the world’s largest and longest-running household surveys, and its personality data have now been analyzed in dozens of papers. Recently, a 2024 article in the Journal of Personality and Social Psychology used SOEP panel data to claim that adult personality continues to change meaningfully with age. The article, Using Within-Person Change in Three Large Panel Studies to Estimate Personality Age Trajectories by Ingo S. Seifert, Julia M. Rohrer, and Stefan C. Schmukle (2024), presents smooth developmental curves and interprets within-person change across three waves as evidence that adults become less neurotic and more mature as they grow older.

This conclusion sounds reassuring. It fits a popular narrative in psychology that personality develops continuously throughout life. It also fits the discipline’s long-standing trust in longitudinal designs as the supposed gold standard for developmental inference. But the conclusion is false. And surprisingly, the evidence for that comes from an earlier, more rigours, analysis of the same SOEP data (Schimmack, 2021).

The real problem is not the data. It is the assumption that changes in panel data can be automatically interpreted as “aging.” This assumption is never stated, never defended, and empirically untenable. When one analyzes the SOEP data with full awareness of the age–period–cohort structure, the exciting story of lifelong personality development evaporates. What remains are large cohort differences, tiny within-person changes, and clear signs that short-interval shifts reflect period effects rather than age.

The key fact is simple: within-person change over time is always a blend of aging and historical period effects. In a short panel study, such as the three-wave SOEP personality assessments in 2005, 2009, and 2013, these two forces move in perfect lockstep. Everyone ages by the same amount between waves, but everyone also moves through identical historical events. And in most societies, year-to-year cultural and psychological shifts are monotonic: economic cycles, mental health literacy, political climate, and cultural norms change in the same direction for everyone, regardless of age. In that situation, the risk of confusing period effects with aging is not just present; it is greater than the risk of confusing age with cohort in cross-sectional studies. Period shifts across eight years are more uniform, stronger, and more directional than cohort differences spread across sixty.

The SOEP data show exactly this pattern. All cohorts move in parallel across the three waves. Young adults, middle-aged adults, and older adults shift by nearly the same amount in the same direction. This is the signature of a period effect. If personality truly changed with age, different cohorts and different ages would show different slopes. They do not. Instead, the 2005–2013 interval appears to reflect historical drift in how Germans respond to personality items, not psychological maturation.

The within-person effect sizes tell the same story. Across the SOEP waves, the typical within-person change is small, often around d = 0.1 or 0.2. These effects are close to noise, especially compared to the much larger cohort differences seen in the same data. For example, the difference in openness between the youngest and oldest cohorts is around d = 0.7. When cross-sectional differences are large but longitudinal changes are tiny, the most plausible explanation is generational or historical change, not adult development. The JPSP article ignores this discrepancy completely.

The deeper issue is one of scientific reasoning. The SOEP reanalysis openly confronts the identification problem: age, period, and cohort cannot be separated statistically without strong external assumptions. The JPSP article makes those assumptions silently. It does not test for period effects. It does not examine cohort-specific slopes. It does not report parallel movement across cohorts. It simply treats the observed change as aging because it comes from a longitudinal design. This is not methodological sophistication; it is a conceptual error disguised by journal prestige.

When scientific rigor and statistical identification are taken seriously, the SOEP data provide no evidence of meaningful age effects in adult personality. The large generational differences observed across decades, and the tiny changes observed within individuals over short intervals, tell us that personality is remarkably stable in adulthood and that apparent “developmental” trends are better explained by history than by age. This evidence is also consistent with results in other longitudinal panels and cross-sectional panels (see Schimmack, 2025, for a review).

The notion of lifelong personality development has intuitive appeal and a long tradition in psychology. But it has always rested on shaky methodological ground. The SOEP data, when analyzed correctly, confirm what many demographers and methodologists have long argued: cross-sectional trends confound age with cohort, and longitudinal trends confound age with period. Without confronting this structure directly, no amount of smoothing, modeling, or journal prestige can turn period effects into genuine aging.

Scientific rigor wins here, not the allure of a developmental story. And the rigorous conclusion is clear: the SOEP data do not show aging effects in adult personality. They show cultural change across generations and historical drift over time. The rest is wishful thinking built on untested assumptions.


Personality Science – Table of Content

To access the chapters of this book, you need to register first with your email.
Registration is free. Register now.

About the Author

Chapter 01 – Definition and History of Personality Science

Chapter 02 – Basic Scientific Concepts

Chapter 03 – The Structure of Personality Traits

Chapter 04 – The Measurement of Personality Traits

Chapter 05 – Personality and Situation

Chapter 06 – Stability and Change Personality

Chapter 07 – Evolutionary Origins of Personality

Chapter 08 – Genes and Environment

Chapter 09 – Biological Causes of Personality

Chapter 10 – Cultural Variation in Personality

Chapter 11 – Cultural Variation in Values and Attitudes

Chapter 12 – Unconscious Processes and Personality

Chapter 13 – Personality and Motivation and Emotion

Chapter 14 – Personality and Life-Outcomes

Making Mountains Out of Molehills: Illusory “Huge” Sex Differences



When you hear claims that “men and women differ a lot”, see effect-size numbers such as D > 2, and variables that correctly distinguish men and women with over 90% accuracy — it is worth asking what is being measured and how.

In many cases, researchers have taken many small sex differences (e.g., on spatial ability, aggression, interests) and combined them into a single composite that distinguishes men vs. women. That composite may show a large mean difference — but this does not mean there is a single giant biological cause that explains all of it.

Why this matters:

  • Each individual trait difference may have a modest effect size (say d ~ 0.2-0.4).
  • Aggregating correlated traits boosts the composite’s reliability and amplifies the mean difference.
  • A large composite difference is useful for classification (distinguishing male vs. female) but does not support the claim of a unified biological process underlying all those traits.
  • Because biological sex can already be determined with perfect accuracy, building a model to predict sex from traits is largely redundant from a causal perspective. What matters is how much sex explains trait variation, not how well traits predict sex.
  • When students or media see a big number, they often infer a big innate difference; but this reverses the direction of causality. We are using differences in masturbation and use of pornography to predict whether somebody is a man or a woman, rather than examining how much sex differences cause variation in behaviors.

Critics have long argued that psychological sex/gender differences are, on average, small and that many claims of large or unified differences collapse under closer scrutiny (Fine, 2005; Hyde, 2014). Meta-research shows that for most psychological and cognitive domains the differences are small (Zell et al., 2023; Szymanski & Henning, 2022). The largest and most consistent sex differences are observed for height (d ≈ 1.5), pornography use (d ≈ 1.5), and sex drive (d ≈ 1.0).

Reversing the direction of analysis creates another misunderstanding. When many dimensional traits are used to predict whether someone has XX or XY chromosomes, there is no remaining variability once we correctly classify more than 90 percent of people. But in the opposite direction—even d = 1.5 implies considerable unexplained variation within each group. Some men rarely watch pornography and some women do; some women have higher sex drives than many men. This within-group variation is psychologically meaningful but ignored by analyses that treat variability in predictors as error variance when predicting a dichotomous outcome (male = XY / penis; female = XX / vagina).

Conclusion

Combining many small effects into one large number is not the same as discovering a deep, singular sex-difference mechanism. There is no scientific purpose in creating a statistical predictor of sex when sex is directly observable. The only reason to compute such values is rhetorical; to make biological effects on variation in personality and other traits appear larger and more coherent than they really are.


References

References

Archer, J. (2019). The reality and evolutionary significance of human psychological sex differences. Biological Reviews, 94(4), 1381–1415. https://doi.org/10.1111/brv.12507

Eliot, L., Ahmed, A., Khan, H., & Patel, J. (2021). Dump the “dimorphism”: Comprehensive synthesis of human brain studies reveals few male–female differences beyond size. Neuroscience & Biobehavioral Reviews, 125, 667–697. https://doi.org/10.1016/j.neubiorev.2021.03.013

Fine, C. (2005). The gender similarities hypothesis. American Psychologist, 60(6), 581–592. https://doi.org/10.1037/0003-066X.60.6.581

Hyde, J. S. (2014). Gender similarities and differences. Annual Review of Psychology, 65, 373–398. https://doi.org/10.1146/annurev-psych-010213-115057

Joel, D., & Fausto-Sterling, A. (2016). Beyond sex differences: New approaches for thinking about variation in human behavior. Philosophical Transactions of the Royal Society B: Biological Sciences, 371(1688), 20150451. https://doi.org/10.1098/rstb.2015.0451

Lippa, R. A. (2010). Gender differences in personality and interests: When, where, and why? Social and Personality Psychology Compass, 4(11), 1098–1110. https://doi.org/10.1111/j.1751-9004.2010.00320.x

Su, R., Rounds, J., & Armstrong, P. I. (2009). Men and things, women and people: A meta-analysis of sex differences in interests. Psychological Bulletin, 135(6), 859–884. https://doi.org/10.1037/a0017364

Szymanski, D. M., & Henning, S. L. (2022). Are many sex/gender differences really power differences? PNAS Nexus, 3(2), pgae025. https://doi.org/10.1093/pnasnexus/pgae025

Zell, E., Strickhouser, J. E., Sedikides, C., & Alicke, M. D. (2023). The gender similarities hypothesis 2.0: Meta-analytic synthesis of psychological gender differences across the life span. Psychological Bulletin, 149(2), 109–137. https://doi.org/10.1037/bul0000380

The Ideology versus the Science of Evolved Sex Differences


1. Introduction: Competing Stories About Gender

Debates about sex differences often swing between extremes. One narrative, familiar from strands of radical feminism, portrays masculinity as dangerous—a legacy of male violence and domination. The opposite story, popularized by Roy F. Baumeister’s Is There Anything Good About Men? (2010), recasts men as civilization’s heroic builders, unfairly maligned by modern culture. Both stories appeal to emotion and morality more than data.

This essay contrasts Baumeister’s narrative with the actual empirical evidence about evolution and sex (Evolution and Sex Differences in 2025). Unlike dramatic claims that men and women are fundamentally different (“Women are from Venus, Men Are from Mars”), scientific evidence shows that men and women evolved together with shared goals to maximize adaptive fitness. There are likely biological differences related to genetic variation in the sex-chromosomes (XX vs. XY), but even for traits that are strongly influenced by these genes, men and women are not fundamentally different.

2. Baumeister’s Core Thesis

Baumeister’s book advances a provocative claim: cultures “flourish by exploiting men.”
He argues that throughout history men have been socially conditioned—and biologically predisposed—to take greater risks, work harder, and sacrifice themselves for collective benefit.

In his telling, male dominance in politics, science, and business reflects expendability and service, not privilege.

He describes men as driven by status and competition, while women, protected and valued for reproduction, focus on relationships and security.

The argument is moral as much as evolutionary. Baumeister insists he speaks “as a scientist,” yet the book only mentions data that support his ideology. The story drives the data, not the data shape the theory. Data are only used when they verify a claim, never to falsify one—a hallmark of pseudoscience, as Karl Popper argued that genuine science advances by subjecting its theories to potential falsification.

He rarely quantifies differences or cites effect sizes, and he dismisses feminism and patriarchy as conspiracy theories. Instead, he offers anecdotes about male teachers, childbirth, and marital infidelity as evidence of “how the world works.”


3. What Empirical Science Shows

The cumulative evidence from behavioral genetics, developmental endocrinology, and cross-cultural psychology paints a more complex picture (Schimmack, 2025).

1. Magnitude of differences. An undisputed evolved sex difference is the height difference between men and women. The standardized effect size is about 1.5 standard deviations. While this number is abstract, it can serve as a benchmark for potentially evolved sex differences. Most psychological sex differences are small to moderate in size (average d ≈ 0.3–0.5). Distributions overlap substantially—typically more than 70%.

2. Outdated evolutionary theories also ignore that most traits are influenced by genes on the 22 pairs of autosomes that are mixed during reproduction and do not allow for biological sex differences. Any biological differences like those in height are rooted in the fact that men have a Y-chromosome and only one X-chromosome. For example, red-green color blindness is recessive on the X-chromosome and more common in men because the expression of this gene is more likely if only one X-chromosome is present.

3. Claims about achievement are especially fragile. First, sex differences in achievement related traits (Conscientiousness) are very small and tend to favor women, and once women are given a chance to compete they are doing as well as men. Baumeister, in psychology, should know that because the sex-ratio in psychology departments has shifted dramatically since the 1950s when gender-biases made it difficult for women in academia.

In short, scientific evidence shows that men and women as probabilistically different yet fundamentally similar; two overlapping variations of one cooperative species. Baumeister may not realize this because we all suffer from consensus bias; that is, we overestimate how many people are like us: Baumeister may overestimate how many men are like him.


4. Ideological Versus Scientific Reasoning

Baumeister’s reasoning resembles moral storytelling: good men, misunderstood by society, suffer for others. Science, by contrast, treats sex differences as empirical questions about magnitude, mechanism, and context. Men are not good or bad, but evolutionary theory explains why men are more likely to be bad people: rapists, murderers than women. This is one of the strongest sex differences that have been scientifically documented (Archer, 2019). They exist because small differences in mean levels of aggression and selfishness can produce large differences in the extremes of a trait. Toxic masculinity is real, but it is limited to a small number of toxic males.


5. Scientifically False Claims

The book makes many scientifically false claims that are ideologically motivated and risk normalizing or excusing abusive behavior.

1. “Research has suggested that most women have said ‘no’ when they meant ‘yes’ at least occasionally, which introduces a further element of confusion to even the most well-intentioned young man.”

Truth: Baumeister misrepresents the original study (Muehlenhard & Hollabaugh, 1988), which found that 39 percent of college women reported ever engaging in token resistance—not “most.” Later research shows this behavior is rare, context-dependent, and declining with improved sexual-education and consent norms (Humphreys, 2004). In contrast, sexual aggression is one of the largest documented sex differences: men are far more likely to be offenders and women to be victims (Archer, 2019). Baumeister’s framing inverts this reality.

2. Baumeister: “women are plenty aggressive—if anything, more violent than men.”

Truth: A meta-analysis of heterosexual partner aggression finds d ≈ –0.05 for act frequency, meaning women report slightly more minor acts—but men cause far more serious injuries (Archer, 2000). Across all forms of violence, the difference reverses dramatically: men commit the vast majority of homicides and serious assaults worldwide (Archer, 2019). Baumeister’s claim ignores the scale and severity of male violence and misrepresents the empirical record.

3. Baumeister “From the unfeeling perspective of the system, it could be worth it to restrict female access to education.” (p. 209)

Truth: Every cross-national dataset shows the opposite: female education increases social stability, child survival, and economic growth (UNESCO, 2019; World Bank, 2020). There is no conceivable “systemic advantage” to restricting women’s education—historically or evolutionarily. This statement is not only unsupported but directly contradicted by global evidence.

4. Baumeister: “After witnessing childbirth, many men find their wives sexually disgusting and thus cheat.” (pp. 246–247)

Truth: No scientific data link childbirth observation to marital infidelity. Longitudinal studies show that relationship satisfaction and communication, not childbirth disgust, predict sexual desire and fidelity (Lawson & Mullett, 2018). Baumeister’s anecdote pathologizes normal experiences of fatherhood without evidence.

5. Baumeister: There was and is no oppression of women; patriarchy is a conspiracy theory.

Truth: “Patriarchy” in social science refers to structural male advantage, not a secret male conspiracy. Historical and economic research documents centuries of legal, educational, and occupational exclusion of women (Goldin, 1990; England, 2010). Dismissing these constraints as myth denies overwhelming empirical documentation.

6. Baumeister: “Men are exploited by society; progress depends on male expendability.”

Truth: Men historically faced higher mortality in war and dangerous work, but these risks were tightly linked to male political and economic power. Men had the benefit of minimal investment in their reproductive success, while leaving women with the risk and costs of childbirth and child rearing. Baumeister’s framing ignores male exploitation by males, not women.


6. Ideological Consequences

Research confirms that exposure to Baumeister’s own Sexual Economics Theory—which portrays sex as a female resource traded for male investment—can shape social attitudes.
Fetterolf & Rudman (2016) found that participants who viewed a video based on this theory endorsed more adversarial beliefs about heterosexual relationships, even after reading feminist rebuttals. This shows that ideas presented as neutral “science” can increase cynicism and hostility between the sexes.

Moreover, the book’s framing has been widely circulated in manosphere communities and cited on forums linked to misogynistic radicalization. In these contexts, Baumeister’s evolutionary language becomes moral ammunition, used to rationalize resentment toward women.
Such diffusion illustrates how ideological narratives dressed as science can travel far beyond academia.


7. Why Scientific Caution Matters

Scientific reasoning differs from ideological rhetoric in three ways:

  1. Falsifiability. Claims must be open to disconfirmation; Baumeister’s narrative is not.
  2. Updating. Science revises itself when evidence changes; ideology repeats itself even when data contradict it.
  3. Value neutrality. Science describes what is, not what ought to be. Moralizing about gender—positive or negative—distorts understanding.

In modern personality and evolutionary psychology, the consensus is clear:
Men and women evolved under shared pressures for cooperation, mutual dependency, and parental investment, not perpetual conflict or one-sided exploitation.


8. Conclusion I: Men and Women Evolved on Earth

Baumeister’s Is There Anything Good About Men? invites sympathy for men but mistakes ideological comfort for scientific truth. By glorifying masculine extremes and dismissing opposing evidence, it replaces inquiry with mythmaking.

The scientific picture that emerges from decades of research is subtler and more interesting.
Sex differences are real yet modest, biologically rooted yet culturally flexible. Both sexes show extraordinary variability, and both contributed to the survival of our species. Men and women did not evolve on separate planets; they evolved together, on Earth, as cooperative partners in a shared evolutionary story.


9. Conclusion II: Baumeister Lacks Scientific Credibility

Baumeister’s research record reveals a consistent pattern of selective evidence use—choosing studies that support his claims while ignoring or concealing results that do not.
His once-famous ego-depletion hypothesis—the idea that self-control operates like a limited resource—was based on publication-biased evidence.

Re-analyses of his own data show that the average effect size is close to zero once unpublished or failed studies are included (Schimmack, 2014, 2016, 2018, 2019, 2025). Meta-scientific investigations further document that his lab withheld null results, giving a misleading impression of robust support.

Baumeister himself admitted this practice in a personal email communication quoted by Schimmack:

“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.”

This admission confirms that his work exemplified the publication-bias culture that triggered psychology’s credibility crisis. Rather than using data to test hypotheses, Baumeister routinely used them to confirm preconceived beliefs—the same confirmatory pattern visible in Is There Anything Good About Men?

Scientific integrity requires falsifiability, transparency, and full reporting.
When these norms are ignored, claims cease to be scientific, even if they borrow the language of science. Authors who present untested opinions as empirical conclusions engage in narrative persuasion rather than data-driven inquiry—a form of writing closer to literature than to science.

Freedom of speech entitles Baumeister to publish ideological opinions, even offensive ones.
But academic freedom is different: it protects the search for truth through open, verifiable evidence. Baumeister’s gender arguments, like his ego-depletion studies, fail that test.
They are expressions of belief, not findings of science. The actual evidence shows not only that men and women are far more similar than his book suggests, but also that Baumeister’s own practices demonstrate a departure from scientific standards.


Key References

Archer, J. (2000). Sex differences in aggression between heterosexual partners: A meta-analytic review. Psychological Bulletin, 126(5), 651–680.
Archer, J. (2019). The reality and evolutionary significance of human psychological sex differences. Biological Reviews, 94(4), 1381–1415. https://doi.org/10.1111/brv.12507
Baumeister, R. F. (2010). Is There Anything Good About Men? Oxford University Press.
Popper, K. R. (1959). The logic of scientific discovery. London: Hutchinson. (Original work published 1934)
Schimmack, U. (2014). Roy Baumeister’s R-Index – Replicability-Index

Why Uri Simonsohn is a Jerk

Science is like an iceberg. The published record is only a fraction of the things that university -paid academics do. Some time ago, Brian Nosek dreamed about a scientific utopia of open science that would make the workings of academia more transparent, but all we got was preprints and some badges – that are apparently rolled back. He never was interested in an open discussion about the IAT or the ethics of Project Implicit.

Other famous open science academics are also less open than you may think. Datacolada benefited from open data to find evidence of fraud. As there is no law to share data, the fraudsters must kick themselves for being so foolish to share them and lose millions and reputation.

Uri Simonsohn, of Datacolada with a few fraud-scalps on his belts, is also less not eager to embrace all aspects of open science. The Datacolada blog doesn’t even have a comment section that allows sharing of alternative viewpoints or – oh my god – criticism of the work. Worse than an old-school journal with peer-review that would even allow some critical comments to be published to maintain the image of being science. DataColada: Not Open – Not Science.

I have fully embraced open science. My blog has comment sections and I have even revised errors that people have pointed out in my blog posts. Living in utopia, I have also shared emails that the authors wanted to hide like my exchange with Uri about the poor performance of p-curve when data are heterogeneous) or anonymous peer-reviews that show how bias rather than scientific criteria dictate what gets shared in journals.

This post was motivated by a search in my email inbox for a link to a book I had ordered.

Instead, I found an email exchange from 2015 between Greg Francis and Uri Simonsohn about a DataColada post about the Test of Excessive Significance, used by Greg Francis to reveal publication bias and similar to the incredibility index (Schimmack, 2012).

[24] P-curve vs. Excessive Significance Test – Data Colada

Uri writes with typical intellectual humility “In this post I use data from the Many-Labs replication project to contrast the (pointless) inferences one arrives at using the Excessive Significant Test, with the (critically important) inferences one arrives at with p-curve.”
Translation: I am great, others do dumb shit.

What is the problem with testing for publication bias? Well, we know that the null-hypothesis of no bias is false. So, we do not need a test that there is no bias.

What makes p-curve great. It tests the null-hypothesis that all results are false positives and in 2014 Uri probably believed that this is common and we need p-curve to reveal those data. However, 10 years later p-curve has mostly rejected the silly and uninformative null-hypothesis that all results are false positives, p < .05.

Ten years later, nobody should care about TEST or p-curve results that test extreme and unlikely hypotheses. Instead, we should quantify how much publication bias there is and how much evidence against the null-hypothesis data contain. Neither TES nor p-curve do this well, and we developed such a method (Bartos & Schimmack, 2022; Brunner & Schimmack, 2020). So, I am great and Uri sucks, but you don’t have to take my word for it. He never bothered to engage in constructive criticism or try to show that Jerry Brunner’s critique of p-curve is false, although the comment section allows for it. He is not interested in open science because he cannot win a scientific argument. Hence, no comments are allowed when he defends p-curve with silly simulation results.

Anyhow, for the sake of open science, here is the email exchange from 2014 that I found.

—————————————————————————-

On 27 Jun 2014, at 04:59 pm, Simonsohn, Uri <uws@wharton.upenn.edu> wrote:

Hi Greg,

Thanks for your email.
The policy is to contact authors whose research we discuss. I do not discuss research you conducted, so I did not contact you.

One could extend the policy to contacting everybody whose work is related to the post, but that would be impractical, I would have needed to contact Kahneman, Klein et at, Ioannidins & Trikalinos, Uli and you, and presumably the people whose work you have analyzed via EST and perhaps even the OSF people. Or perhaps extend the policy to contacting anybody who is likely to disagree with the post. Similarly impractical.

Looking at the comments you sent via email, note how you don’t need to refer to any paper you have written to make your arguments, they are based exclusively on new analyses I run on data you had never analyzed before. That indicates to me the post is separate from your past work.  

When I wrote a post about Bayesian analysis (http://datacolada.org/2014/01/13/13-posterior-hacking/) , I did not contact Bayesian statisticians like Kruschke or EJ. As in this case, I was talking about statistical tools they use, but not about analyses they have run, so our policy did not require me to contact them either. When we have written about replications we have not contacted Nosek.  
When I wrote about ceiling effects in one replication paper, I did not contact authors of other papers that may also have a ceiling effect, or other people who have talkeda bout ceiling effects in that paper, I only contacted the authors whose work I was directly discussing.

Now, if I write a post about analyses EJ runs, or a replication that Nosek does, then of course we will contact them.
If I write a post about your use of the EST in this or that paper, then of course I will contact you.

You may disagree with the policy,  but I thought it would be fair to share the rationale with you.

Thanks again,

Uri

—–Original Message—–
From: Gregory Francis [mailto:gfrancis@purdue.edu]
Sent: Friday, June 27, 2014 8:51 AM
To: Simonsohn, Uri
Cc: <uli.schimmack@utoronto.ca> Schimmack; Leif Nelson; Simmons, Joseph
Subject: Data Colada

Hi Uri,

I saw your Data Colada posting on the P-curve vs. the excessive significance test (http://datacolada.org/2014/06/27/24-p-curve-vs-excessive-significance-test/ ). I really don’t understand the motivation for this posting, and I think you misrepresented the TES (Test for Excess Significance- Ioannidis’ term).

In particular, you conclude that the inference from the TES is pointless because we know there are 5 studies not reported. Indeed, if you know some relevant studies were not reported (since you removed them!) then you are correct that there is no reason to run the TES.  I would suggest that the more interesting test for this set of data would be to include the 5 non-significant studies (since they were actually published). Running the TES then gives 0.9699841 (I quickly modified your code to include all published studies; I am pretty sure this correct).  The details are

Pooled d: 0.598
Observed number of significant studies: 31 Expected number of significant studies: 31.08
Chi-square: 0.0014159
p: 0.96998

So, the TES would not claim that there is anything amiss with the full set of 36 reported studies.

I also object to your argument that nobody publishes “all” findings. Taken broadly enough, the statement is true, but somewhat silly and naive. What the TES considers is whether the stated theoretical claims are consistent with the reported findings. For example, in the TES analysis of all 36 studies, the theoretical claims (a fixed effect size of d=0.598) is consistent with the reported frequency of rejecting the null. On the other hand, if we take just the 31 significant experiments, then the theoretical claim (a fixed effect size of d=0.629) is not consistent with the reported frequency of rejecting the null. One need not report all studies for consistency to hold, and if there are valid methodological reasons to not publish some studies then they should not be published. I have explained this to you many times, so I get the feeling you are being deliberately obtuse on this issue, which is a shame because you are confusing people and, in the long-run, undermining your own credibility.

I also think your post is misleading in a broader context.  The “about” section of Data Colada states::

“When discussing research by other authors we contact them before posting; we ask for suggestions to improve the post, and invite them to comment within the original blog post.”

Readers of your blog who believe you take the policy seriously should infer that Uli and I were shown a draft, asked for feedback, and given an opportunity to comment, which is not true. It is too late for you to follow parts one and two of your policy, but you can fix the third: allow Uli (if he wishes) and me to write a follow-up post on Data Colada that explains our views of the TES and p-curve analyses.  

Greg Francis

Professor of Psychological Sciences
Purdue University

What a jerk!

Greg

On 27 Jun 2014, at 04:59 pm, Simonsohn, Uri <uws@wharton.upenn.edu> wrote:

Hi Greg,

Thanks for your email.
The policy is to contact authors whose research we discuss. I do not discuss research you conducted, so I did not contact you.

One could extend the policy to contacting everybody whose work is related to the post, but that would be impractical, I would have needed to contact Kahneman, Klein et at, Ioannidins & Trikalinos, Uli and you, and presumably the people whose work you have analyzed via EST and perhaps even the OSF people. Or perhaps extend the policy to contacting anybody who is likely to disagree with the post. Similarly impractical.

Looking at the comments you sent via email, note how you don’t need to refer to any paper you have written to make your arguments, they are based exclusively on new analyses I run on data you had never analyzed before. That indicates to me the post is separate from your past work.  

When I wrote a post about Bayesian analysis (http://datacolada.org/2014/01/13/13-posterior-hacking/) , I did not contact Bayesian statisticians like Kruschke or EJ. As in this case, I was talking about statistical tools they use, but not about analyses they have run, so our policy did not require me to contact them either. When we have written about replications we have not contacted Nosek.  
When I wrote about ceiling effects in one replication paper, I did not contact authors of other papers that may also have a ceiling effect, or other people who have talkeda bout ceiling effects in that paper, I only contacted the authors whose work I was directly discussing.

Now, if I write a post about analyses EJ runs, or a replication that Nosek does, then of course we will contact them.
If I write a post about your use of the EST in this or that paper, then of course I will contact you.

You may disagree with the policy,  but I thought it would be fair to share the rationale with you.

Thanks again,

Uri

—–Original Message—–
From: Gregory Francis [mailto:gfrancis@purdue.edu]
Sent: Friday, June 27, 2014 8:51 AM
To: Simonsohn, Uri
Cc: <uli.schimmack@utoronto.ca> Schimmack; Leif Nelson; Simmons, Joseph
Subject: Data Colada

Hi Uri,

I saw your Data Colada posting on the P-curve vs. the excessive significance test (http://datacolada.org/2014/06/27/24-p-curve-vs-excessive-significance-test/ ). I really don’t understand the motivation for this posting, and I think you misrepresented the TES (Test for Excess Significance- Ioannidis’ term).

In particular, you conclude that the inference from the TES is pointless because we know there are 5 studies not reported. Indeed, if you know some relevant studies were not reported (since you removed them!) then you are correct that there is no reason to run the TES.  I would suggest that the more interesting test for this set of data would be to include the 5 non-significant studies (since they were actually published). Running the TES then gives 0.9699841 (I quickly modified your code to include all published studies; I am pretty sure this correct).  The details are

Pooled d: 0.598
Observed number of significant studies: 31 Expected number of significant studies: 31.08
Chi-square: 0.0014159
p: 0.96998

So, the TES would not claim that there is anything amiss with the full set of 36 reported studies.

I also object to your argument that nobody publishes “all” findings. Taken broadly enough, the statement is true, but somewhat silly and naive. What the TES considers is whether the stated theoretical claims are consistent with the reported findings. For example, in the TES analysis of all 36 studies, the theoretical claims (a fixed effect size of d=0.598) is consistent with the reported frequency of rejecting the null. On the other hand, if we take just the 31 significant experiments, then the theoretical claim (a fixed effect size of d=0.629) is not consistent with the reported frequency of rejecting the null. One need not report all studies for consistency to hold, and if there are valid methodological reasons to not publish some studies then they should not be published. I have explained this to you many times, so I get the feeling you are being deliberately obtuse on this issue, which is a shame because you are confusing people and, in the long-run, undermining your own credibility.

I also think your post is misleading in a broader context.  The “about” section of Data Colada states::

“When discussing research by other authors we contact them before posting; we ask for suggestions to improve the post, and invite them to comment within the original blog post.”

Readers of your blog who believe you take the policy seriously should infer that Uli and I were shown a draft, asked for feedback, and given an opportunity to comment, which is not true. It is too late for you to follow parts one and two of your policy, but you can fix the third: allow Uli (if he wishes) and me to write a follow-up post on Data Colada that explains our views of the TES and p-curve analyses.  

Greg Francis

Professor of Psychological Sciences
Purdue University

How Bad is P-Curve Really and Why Should We Care?

P-curve was introduced a little over a decade ago by Uri Simonsohn, Leif D. Nelson, and Joseph P. Simmons (2014); the same team that later launched the DataColada blog. It is a selection-model approach designed specifically for examining the evidential value of published findings when non-significant results are missing and publication bias inflates estimates of power that ignore selection bias.


The method’s goal and its historical context

Its statistical goal is instead to test the null hypothesis that all significant results are false positives. While methodologists warned about this possibility (Rosenthal, 1979), it was considered unlikely that large sets of studies could be published without real effects. However, the DataColada team showed that it can be relatively easy to produce significant results without real effects when the data are p-hacked (Simmons, Nelson, & Simonsohn (2011, Psychological Science, “False-Positive Psychology”). Awareness of inflated type-I error rates and replication failures raised concerns that most results might be false positives (Ioannidis, 2005).


Applications and Limitations

Over the past decade, p-curve has been applied in numerous meta-analyses, and the typical conclusion is that the analyzed literature shows evidential value. However, this conclusion has a critical limitation: rejecting the null hypothesis that all results are false positives does not reveal how many results are false positives, how large the true effects are, or how much reported effect sizes are inflated by publication bias. The latest version of p-curve provides an estimate of “power” to provide quantitative information about the amount of evidential value in a set of studies. This blog post examined the controversy surrounding this parameter of the p-curve model.


Scope of the Discussion

To be clear, the developers also introduced a version of p-curve for effect-size estimation, but this procedure has been used rarely and performs worse than alternative bias-correcting methods when credible nonsignificant evidence is available (see Carter et al., 2019). Consequently, the present discussion focuses on p-curve as a test of evidential value, as implemented in the public p-curve app, rather than as an estimator of effect magnitude.


The Current Debate

Morey and Davis-Stober (2025) published a formal critique in the Journal of the American Statistical Association (JASA) (see my earlier post, Rindex.08.08.25). Uri Simonsohn (2025) responded in a post on the DataColada blog (#129).

The key issue is how p-curve performs when the power of studies varies across studies (i.e., heterogeneity in power). Morey and Davis-Stober present a simulation with true mean power of 66 percent, yet p-curve returns an estimate of 87%, a 21-percentage point difference. Simonsohn shows simulations where bias is never larger than 5%.


Simulation Hacking

The controversy illustrates a broader methodological issue that might be called simulation hacking. Just as empirical researchers can obtain desired results through selective analyses (p-hacking), methodologists can shape conclusions by emphasizing simulation conditions where a method performs particularly well or poorly. This does not mean that the chosen scenarios are unrealistic; rather, it highlights that statistical procedures often perform differently across contexts. A method may be robust and informative for some purposes yet unreliable for others, depending on which assumptions the simulations accentuate.


Simulating Field Wide Heterogeneity

Figure 1: Distribution of Effect Sizes in Morey and Davis-Stober’s Simulation

Morey and Davis-Stober (2025) simulated a distribution of true effect sizes that is shown in their Figure 1. This distribution is broadly consistent with the average effect sizes reported in psychology meta-analyses (Richard et al., 2003). Such a distribution can be used to simulate p-values from studies testing a wide variety of hypotheses and research designs that aim to estimate the typical power of studies in psychology (e.g., Cohen, 1962; Schimmack, 2020; Soto & Schimmack, 2024). These conditions generate extreme heterogeneity in statistical power across studies. Morey and Davis-Stober’s analysis suggests that under such heterogeneity, p-curve will produce inflated estimates of average power.

A concrete example is provided by the Reproducibility Project (Open Science Collaboration, 2015). These data are especially informative because the outcomes of the replication studies offer an independent benchmark of the original studies’ power to produce significant results without selection bias. The observed replication rate implies an average true power of less than 40%. Schimmack (2025) analyzed the p-values of the original studies and obtained a p-curve estimate of power of 91%, 95% CI = 86% to 94% (Schimmack, 2025).

If the replication outcomes were unknown, this p-curve result would incorrectly suggest that the high proportion of significant findings in psychology journals (Sterling et al., 1995) reflects genuinely high study power rather than publication bias or p-hacking. In conclusion, a tool that was developed in response to the replication crisis to reveal p-hacking would falsely suggest that power is high and p-hacking is rare.


Simulating Meta-Analyss of P-Hacked Literatures

Simonsohn (2025) simulated studies with low power that never exceeds 80%. Examples like this can be found in meta-analysis of p-hacked studies. For example, a recent p-curve analysis of 825 terror-management studies yielded a power estimate of only 25%, 95%CI = 21% to 29%. This finding implies that exact replications of these studies would produce at most 30% significant results; a rate that is similar to the success rate in actual replication studies (Open Science Collaboration, 2015). An anecdote tells about a social psychologist who prided himself on a success rate of 1 out of 3 studies and compared it baseball, where a 30% batting average is excellent.

The problem here is not that p-curve estimates are biased. Rather, the problem is that they can be easily misinterpreted, if heterogeneity in power is ignored. After all, p-curve does reject the null-hypothesis that all studies are false positives. Assuming that all studies have the same power also implies that there are no false positive results; contrary to Simmons et al.’s (2011) suspicion that false positives are common. P-curve simply does not provide information about false positives unless all significant results are false. The power estimate could be an average of false positives and true positives with high power.

Stay Calm: Use Z-Curve

There is no need to fight over p-curve because we have a better method that works with and without heterogeneity called z-curve (Bartos & Schimmack, 2022; Brunner & Schimmack, 2020). When we developed z-curve, we compared it against alternative models. We presented all simulations, even those where p-curve performed a bit better with homogenous data. The simulation showed that both methods have only a small bias when heterogeneity is small, but p-curve has a large bias when heterogeneity is large. So, we can simply use z-curve for all data.

Here is a simple example that shows how z-curve is superior to p-curve, even if p-curve estimates are only slightly biased. The simulation uses 50% false positives, and 50% true positives with 80% power. It is easy to see that we would expect .50 * .05 + .50 * .80 = .025 + .40 = 42.5% significant results. This is the expected replication rate if the studies were replicated exactly without selection bias. It is called power in p-curve, but that term ignores that real data may contain false positives.

Figure 2: p-curve plot with power estimate

Consistent with Simonsohn’s claims, the bias in the p-curve estimate is small (p-curve estimate: 44% vs. true parameter: 42.5%), but p-curve does not tell us whether all studies have about 40% power or whether this is an average of studies that vary in power or even include false positive results.

Z-curve’s estimate of the expected replication rate (ERR) is accurate (42%). More important, it also recognizes that the data are heterogeneous. A simple way to see this is that it estimates a lower discovery rate for all studies, including non-significant results that are not reported. A discrepancy between EDR and ERR indicates heterogeneity because studies with higher power have a higher chance of being in the set of significant results.

Z-curve also estimates the expected discovery rate for the full range of z-values, including non-significant results that are not reported (see the red dotted line). The EDR of 11% is incompatible with the observed discovery rate of 100% (only significant results are published). Even the upper limit of the CI is only 18% (about 5 studies for each significant result). The p-curve power estimate cannot be used to evaluate publication bias, although p-curve is often falsely used as a test of publication bias.

Finally, the EDR can be used to estimate the false positive risk with a formula by Soric (1989). We know the true percentage is 50%. The z-curve estimate is only 45%, but the 95%CI around this estimate is wide. Most troubling, the 1,000 studies do not rule out the possibility that all studies are false positives (the 95%CI includes 100%). This is very different from the inference we may draw from the p-curve estimate of 42% power that does not suggest a high rate of false positive results.

Z-curve also provides additional information about the expected discovery rate (EDR) for different ranges of observed z-values (see percentages below the x-axis of the z-curve plot). Results that are just significant (e.g., z = 2 to 2.5) are likely to include many false positives; in this range and the expected discovery rate is only about 27%.

By contrast, studies with larger z-values (e.g., z > 4) are almost certainly based on true effects and have an expected replication probability of around 80%. Z-curve slightly overestimate replicability for these high z-values, but the main point is that discovery rates are expected to change dramatically due to heterogeneity in the probability to obtain significant results.


Conclusion

This blog post showed how silly it is to fight over p-curve with carefully selected simulation scenarios. P-curve makes the unrealistic assumption that studies are homogenous. Z-curve avoids this assumption, models heterogeneity, and provides more information about the data than p-curve can. So, researchers can just use z-curve, and the performance of p-curve is no longer relevant. It is a bit like testing assumptions about equal variances in t-tests. We can just use a t-test that avoids this assumption.

It is clear why Simonsohn does not mention a method that replaced p-curve several years ago on the Datacolada blogpost or allows comments that would alert readers to alternative methods. It is not clear why Morey and Davis-Stober criticize a method that is obsolete and do not mention that their criticisms have been addressed by a better method. But then, who understands the childish games of academics that produce publications, but not knowledge.

Unlike Datacolada my blog allows for comments and I welcome comments by Datacolada, Morey, Davis-Stober, or anybody else.


The Mythical Marriage of Fisher and Neyman-Pearson

Preface

This post grew out of a long discussion with ChatGPT about Gerd Gigerenzer’s treatment of the history of statistics and its influence on psychology in his book The Empire of Chance (1989).

I actually found this book by chance, because ChatGPT recommended it during a literature search. Psychology now has an overwhelmingly journal-based culture, where articles appear online as PDFs and are rarely accompanied by physical books. I am old enough to remember browsing the shelves of real libraries—especially the magnificent stacks at the University of Illinois and the Roberts Library in Toronto—but I stopped doing so about fifteen years ago. Younger colleagues may never know that quiet pleasure.

So, it is not surprising that few psychologists have actually read The Empire of Chance. Fortunately, I was able to access it through my University of Toronto credentials. For most readers, however, it remains locked behind a paywall.

To explore Gigerenzer’s arguments more closely, I uploaded the relevant chapters to ChatGPT (since they are not freely available) and discussed the content in light of my broader research on the history of power, significance testing, and replicability.

This post summarizes our shared understanding of how statistical thinking entered psychology, and why we concluded that Gigerenzer’s famous claim that null-hypothesis significance testing (NHST) is a hybrid of Fisher and Neyman-Pearson is inaccurate. It isn’t a hybrid at all. It’s pure Fisher.

Neyman and Pearson’s framework never gained traction. Today Neyman’s invention of confidence intervals dominates sound statistical inferences because they avoid the problems of Fisher’s significance testing without the difficulties of implementing Neyman-Pearson’s approach. So, we moved from Fischer to Neyman and forgot and Neyman-Pearson were never relevant in the use of statistics by psychologists.

Introduction

For decades psychologists have been told that the way they analyze data—null-hypothesis significance testing—is a hybrid of two rival statistical philosophies: Fisher’s significance test and the Neyman-Pearson decision framework.

Gigerenzer popularized this story in The Empire of Chance (1989), arguing that textbooks merged the two systems and gave the illusion of harmony. It’s a neat narrative—but it doesn’t survive close inspection.


1 · Fisher’s significance test

1️⃣ Make a prediction or explore whether two variables are related.
2️⃣ Collect data and compute a p-value assuming no relation (H₀).
3️⃣ If p is small enough, reject H₀ and claim support for the expected directional effect.
4️⃣ As Fisher wrote in 1935, “every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis” (The Design of Experiments, p. 16).

This deceptively simple procedure made inference a one-sided game: we seek “disproof” of H₀, not testing of a specific H₁.
In practice, rejecting H₀ is treated as confirming our theory—verification dressed up as falsification.


2 · The Neyman-Pearson alternative

Neyman and Pearson proposed a symmetric system of two hypotheses, H₀ and H₁, each with defined long-run error rates.

  • H₀ can be rejected, but H₁ can also be rejected.
  • To do so we must specify a concrete alternative, e.g., d = 0.5, and design the study with known α and β.
  • A result can therefore falsify a risky prediction (reject d = .8 means the effect is smaller than “large”).
  • If both survive, we test again.

In this framework, power and Type II error are not afterthoughts—they’re the price of claiming evidence.


3 · Why it never took root in psychology

Psychology kept Fisher’s asymmetry. Researchers learned to celebrate significant results and ignore non-significant ones. Gigerenzer claimed textbooks resolved the dispute by fusing both schools into a “hybrid model.” But the evidence tells a different story.


4 · Why the “hybrid” is a myth

1 · Fixed thresholds were Fisherian conveniences.
Before computers, tables listed critical values for .05, .01, and .001. Using them was a practical shortcut, not an adoption of Neyman-Pearson error control.
Reporting “p < .05” or adding ** for p < .01 continued Fisher’s graded-evidence tradition.

2 · Type II errors were rhetorical, not operational.
Textbooks mentioned them vaguely—“the probability of an error if H₀ is false”—but never linked them to a specific H₁ such as d = .5. β was seldom calculated or used.

3 · Power was rarely used for design or inference.
Even after Cohen (1962) called for power analysis, psychologists mostly ignored power or treated it only as planning advice for achieving significance, not to quantify type-II errors in inferences that rejected a specific H₁

4 · In practice, nothing changed.
Studies were published when p < .05 and forgotten when p > .05. Journal success rates were over 90%, reflect a one-sided testing culture, not a balanced decision framework.


5 · The broader context

Other social sciences followed different paths. Economists and sociologists, working with large samples and directly measurable variables, emphasized estimation and precision—effect sizes, standard errors, and confidence intervals. They had little interest in either Fisher’s or Neyman-Pearson’s philosophies, although interpretation of results was also influenced by significance thresholds.

Ironically, Neyman’s own (1937) invention of the confidence interval would have solved psychology’s dilemma: a CI simultaneously rejects extreme H₀ and H₁ values without pre-specifying them. Gigerenzer does not mention the modern hybrid of significance testing that uses values of 0 inside or outside the confidence interval to replace Fisher’s significance test.


6 · Conclusion

The so-called hybrid of Fisher and Neyman-Pearson is a myth.

Psychology adopted Fisher’s one-sided test with a conventional publishing threshold of p < .05 and never implemented the symmetrical logic of Neyman-Pearson decisions.

Even Cohen’s power analysis was absorbed into the same framework—another tool for ensuring significance, not for falsifying theoretical claims.

What Gigerenzer described as a marriage was never consummated.

Psychology has lived for nearly a century with Fisher alone, and is now replacing it with Neyman’s confidence intervals.

Neyman-Pearson’s marriage never produced any children.


References

Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65(3), 145–153. https://doi.org/10.1037/h0045186

Fisher, R. A. (1935). The design of experiments. Edinburgh: Oliver & Boyd.

Gigerenzer, G., Swijtink, Z., Porter, T., Daston, L., Beatty, J., & Krickeberg, K. (1989). The empire of chance: How probability changed science and everyday life. Cambridge University Press.

Gigerenzer, G. (1993). The superego, the ego, and the id of statistical reasoning. In G. Keren & C. Lewis (Eds.), A handbook for data analysis in the behavioral sciences: Methodological issues (pp. 311–339). Hillsdale, NJ: Erlbaum.

Gigerenzer, G. (2004). Mindless statistics. Journal of Socio-Economics, 33(5), 587–606. https://doi.org/10.1016/j.socec.2004.09.033

Neyman, J., & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, 231, 289–337.

Neyman, J. (1937). Outline of a theory of statistical estimation based on the classical theory of probability. Philosophical Transactions of the Royal Society of London. Series A, 236, 333–380.

Tversky, A., & Kahneman, D. (1971). Belief in the law of small numbers. Psychological Bulletin, 76(2), 105–110. https://doi.org/10.1037/h0031322