Keywords: False Positive Psychology, “p-hacking,” “false discovery rate,” “preregistration,” “Registered Reports,” “p-curve,” and “z-curve.”
False Positive Psychology: Then and Now
In 2011, Simmons, Nelson, and Simonsohn shook the field of psychology with their False Positive Psychology article in Psychological Science.
Their simulations demonstrated that undisclosed researcher flexibility — now widely known as p-hacking — could inflate the nominal 5% Type I error rate to as high as 60% in worst-case scenarios.
The message was simple: a large share of “statistically significant” findings could be false positives, especially if researchers exploit analytical flexibility.
But over a decade later, important questions remain:
- What is the actual false positive risk (FDR) in psychology?
- Have reforms like preregistration badges, Registered Reports, or bias-detection tools such as p-curve and z-curve reduced that risk?
1. The false positive risk: still uncertain
Since 2011, several statistical methods have tried to estimate psychology’s false discovery rate (FDR):
- P-curve analysis (Simonsohn et al.) examines the distribution of significant p-values to detect evidential value.
- Z-curve (Schimmack & Brunner) models the distribution of test statistics to estimate average power, selection bias, and the maximum FDR.
- Jager & Leek (2014) used p-curve-like ideas in medicine, estimating an FDR around 14%.
Despite these advances:
- No consensus “official” FDR exists for psychology.
- Estimates vary widely across subfields, methods, and datasets.
- The potential for high FDR is clear, but the actual FDR remains debated.
2. Preregistration badges: more transparency, unclear effect on p-hacking
Badges for preregistration and open data — popularized in journals like Psychological Science — were intended to reduce p-hacking and improve reproducibility.
- van den Akker et al. (2024) found that preregistered studies more often included power analyses and had slightly larger sample sizes.
- However, preregistration quality was inconsistent. Many left flexibility in variables, exclusion rules, or analyses — meaning p-hacking could still occur.
- Importantly, the rate of significant results did not drop substantially, suggesting no clear reduction in FDR.
Badges may change norms and signal transparency, but without enforcement, they are no guarantee against p-hacking.
3. Registered Reports: stronger evidence for reducing bias
By contrast, Registered Reports provide strong bias control:
- The study plan is peer-reviewed and accepted before data collection.
- Publication is guaranteed regardless of results.
- Scheel et al. (2021) showed that Registered Reports in psychology had roughly half the proportion of significant findings compared to standard articles (44% vs. 96%), consistent with a large drop in publication bias and inflated FDR.
4. Why p-hacking still matters even if the null is rarely strictly true
Some argue that because the null hypothesis is never exactly true, “false positives” are rare.
However:
- Psychology often tests effects so small they have no practical significance.
- In these cases, p-hacking can still fill the literature with results that reject the null but have trivial effect sizes — misleading theory and application.
- The real cost of p-hacking is inflated effect size estimates and unreliable directional claims.
5. Moving forward: z-curve, p-curve, and better publication models
- P-curve can still be useful for detecting evidential value, but it performs poorly when there’s heterogeneity in effect sizes or sample sizes.
- Z-curve extends p-curve logic, handling heterogeneity and providing estimates of average power, selection bias, and maximum FDR.
- Expanding Registered Reports could provide structural protection against p-hacking, making badges more than a symbolic reform.
Bottom line
Fourteen years after False Positive Psychology, we know:
- P-hacking can dramatically inflate false positive risk.
- The actual FDR in psychology remains uncertain.
- Badges improve transparency but have not clearly reduced FDR.
- Registered Reports are the most effective reform so far.
- Tools like z-curve can quantify bias and FDR more accurately than earlier methods like p-curve.
The conversation must now shift from detecting p-hacking to changing the incentive structure so researchers have no reason to engage in it.