*This interview is fictional. I asked ChatGPT to play Datacolada based on information in their blog posts or articles. Any resemblance with answers by the real datacolada team (Simmons, Nelson, & Simonsohn) is purely coincidental. The interview is long and covers two parts.
Part 1:
- Origins of False-Positive Psychology
- Bem’s ESP and its influence
- How p-hacking might have worked in Bem’s case
- Pilot-dropping vs. pilot-hacking
- Study combination and effect-size patterns
Part 2 will cover:
- Selection bias vs. p-hacking in p-curve
- Flat vs. left-skewed distributions
- Limits of p-curve for diagnosing p-hacking
- Heterogeneity issues and power estimation
- z-curve comparison
- Reflections on the decade from 2011 to 2020s
Click here to read Part 1, which covers p-hacking strategies and how Bem p-hacked his ESP studies.
Part 2 – P-Curve, P-Hacking Detection, and the Rise of Z-Curve
Q: This brings up a new point. Later on you have claimed that many researchers’ “file drawers” are pretty empty because they p-hacked to get significance in each dataset. That assumes a lot of p-hacking in each study and suggests people don’t pilot-drop much. But maybe file drawers are full of abandoned small studies. What do you think?
A: Both are plausible. Some researchers run one big study and p-hack it into significance; others run many small pilots, dropping the bad ones. Without access to actual file drawers, it’s hard to know the base rates.
Q: So both are plausible. Do we know which is more common? Your p-curve article was titled “A key to the file drawer.” So what have you found when you looked into researchers’ file drawers?
A: P-curve doesn’t literally open the file drawer; it infers what’s in there based on the pattern of significant results. It’s agnostic about whether the non-significant ones exist but are hidden, or never happened because the studies were hacked into significance.
Q: Well, you are a bit modest here. P-curve makes a clear distinction between selection bias and p-hacking. You claim that selection bias is not necessary to examine because either it is small because all studies are p-hacked, or we can just assume it is present. So p-curve only looks at the significant results and checks whether they were p-hacked or not. Right?
A: Right — p-curve is built to diagnose evidential value within the set of significant results, not to measure selection bias. The assumption is: if there’s no evidential value, those significant results are most likely artifacts of p-hacking (or extreme selection from nulls).
Q: So how does it detect p-hacking, and has it found a lot of it in meta-analyses?
A: P-curve looks at the distribution of significant p-values. True effects produce a right-skew (more p-values just above 0, lots below 0.01), while pure p-hacking produces a left-skew (more p-values just above .05).
In practice, left-skews are rare; more often we see flat or modest right-skews, which suggests some evidential value, sometimes mixed with bias.
Q: Ok, but there’s an important difference between flat and left-skewed distributions. Correct?
A: Correct. Left-skew = p-hacking without true effects. Flat = could be selection bias from true nulls, or a mix of p-hacking and true effects that cancel out in the skew.
Q: So how often have you found left-skewed distributions?
A: Rarely. That’s partly because pure-null research programs are rare, and partly because most p-hacking is done on at least some true effects.
Q: As I recall, Bem’s data do not show a left-skewed distribution.
A: Correct — they don’t. And that’s an important limitation: if you p-hack on real effects, the real effects dominate the shape.
Q: Doesn’t that mean p-hacking is less common than the False-Positive Psychology paper suggested?
A: It means p-curve is better at detecting “pure” p-hacking on nulls than subtle bias on real effects. It doesn’t mean p-hacking isn’t common; it means one specific signature of it is uncommon.
Q: Let me see if I get this right. P-curve can detect p-hacking when used on true nulls, but not when real effects are present.
A: Exactly.
Q: But when the p-curve is flat we also think the null is true, but that implies huge selection bias. That seems implausible.
A: It’s possible — but yes, if the selection rate needed is absurd, then something else is going on, like a mix of nulls, small effects, and p-hacking.
Q: Couldn’t a flat p-curve also come from a lot of p-hacking plus some evidential value, canceling each other out?
A: Yes — that’s exactly the kind of ambiguity that makes interpretation tricky.
Q: So the real benefit of p-curve is to show that a set of studies has evidential value?
A: Yes. Its sweet spot is showing that most of the significant results are not just false positives.
Q: Which of your statistical tests for p-curve do you consider best?
A: They all have strengths, but the power estimate is the most interpretable when assumptions are met.
Q: Doesn’t your estimate of power give us the proportion of false positives indirectly?
A: In a homogeneous set, yes. But if the set is heterogeneous — say, half with high power, half with low — the average can mask that.
Q: Some critics claim p-curve overestimates power when studies are heterogeneous. Do you agree?
A: Yes — in hindsight, we underestimated how much heterogeneity could inflate our estimates.
Q: Is this why you never wrote p-curve 2.0 — because z-curve does it better?
A: In part. Z-curve handles heterogeneity more gracefully, and we see it as a natural evolution in the same family of tools.
Q: Finally, looking back from 2011 to now, how do you see the journey?
A: The biggest change is cultural. Ten years ago, questioning methods was edgy; now it’s expected. Tools like p-curve and z-curve are part of that, but the real shift is that “show me the evidence” now applies to the evidence itself. That’s the healthiest sign we could ask for.
1 thought on “Interview with Data Colada*: From False-Positive Psychology to z-curve: Part 2”