A Decade After False Positive Psychology: What Have We Learned About P-Hacking and the False Positive Risk in Psychology?

Keywords: False Positive Psychology, “p-hacking,” “false discovery rate,” “preregistration,” “Registered Reports,” “p-curve,” and “z-curve.”


False Positive Psychology: Then and Now

In 2011, Simmons, Nelson, and Simonsohn shook the field of psychology with their False Positive Psychology article in Psychological Science.
Their simulations demonstrated that undisclosed researcher flexibility — now widely known as p-hacking — could inflate the nominal 5% Type I error rate to as high as 60% in worst-case scenarios.

The message was simple: a large share of “statistically significant” findings could be false positives, especially if researchers exploit analytical flexibility.

But over a decade later, important questions remain:

  • What is the actual false positive risk (FDR) in psychology?
  • Have reforms like preregistration badges, Registered Reports, or bias-detection tools such as p-curve and z-curve reduced that risk?

1. The false positive risk: still uncertain

Since 2011, several statistical methods have tried to estimate psychology’s false discovery rate (FDR):

  • P-curve analysis (Simonsohn et al.) examines the distribution of significant p-values to detect evidential value.
  • Z-curve (Schimmack & Brunner) models the distribution of test statistics to estimate average power, selection bias, and the maximum FDR.
  • Jager & Leek (2014) used p-curve-like ideas in medicine, estimating an FDR around 14%.

Despite these advances:

  • No consensus “official” FDR exists for psychology.
  • Estimates vary widely across subfields, methods, and datasets.
  • The potential for high FDR is clear, but the actual FDR remains debated.

2. Preregistration badges: more transparency, unclear effect on p-hacking

Badges for preregistration and open data — popularized in journals like Psychological Science — were intended to reduce p-hacking and improve reproducibility.

  • van den Akker et al. (2024) found that preregistered studies more often included power analyses and had slightly larger sample sizes.
  • However, preregistration quality was inconsistent. Many left flexibility in variables, exclusion rules, or analyses — meaning p-hacking could still occur.
  • Importantly, the rate of significant results did not drop substantially, suggesting no clear reduction in FDR.

Badges may change norms and signal transparency, but without enforcement, they are no guarantee against p-hacking.


3. Registered Reports: stronger evidence for reducing bias

By contrast, Registered Reports provide strong bias control:

  • The study plan is peer-reviewed and accepted before data collection.
  • Publication is guaranteed regardless of results.
  • Scheel et al. (2021) showed that Registered Reports in psychology had roughly half the proportion of significant findings compared to standard articles (44% vs. 96%), consistent with a large drop in publication bias and inflated FDR.

4. Why p-hacking still matters even if the null is rarely strictly true

Some argue that because the null hypothesis is never exactly true, “false positives” are rare.
However:

  • Psychology often tests effects so small they have no practical significance.
  • In these cases, p-hacking can still fill the literature with results that reject the null but have trivial effect sizes — misleading theory and application.
  • The real cost of p-hacking is inflated effect size estimates and unreliable directional claims.

5. Moving forward: z-curve, p-curve, and better publication models

  • P-curve can still be useful for detecting evidential value, but it performs poorly when there’s heterogeneity in effect sizes or sample sizes.
  • Z-curve extends p-curve logic, handling heterogeneity and providing estimates of average power, selection bias, and maximum FDR.
  • Expanding Registered Reports could provide structural protection against p-hacking, making badges more than a symbolic reform.

Bottom line

Fourteen years after False Positive Psychology, we know:

  1. P-hacking can dramatically inflate false positive risk.
  2. The actual FDR in psychology remains uncertain.
  3. Badges improve transparency but have not clearly reduced FDR.
  4. Registered Reports are the most effective reform so far.
  5. Tools like z-curve can quantify bias and FDR more accurately than earlier methods like p-curve.

The conversation must now shift from detecting p-hacking to changing the incentive structure so researchers have no reason to engage in it.


Leave a Reply