Interview with Data Colada*: From False-Positive Psychology to z-curve: Part 2

*This interview is fictional. I asked ChatGPT to play Datacolada based on information in their blog posts or articles. Any resemblance with answers by the real datacolada team (Simmons, Nelson, & Simonsohn) is purely coincidental. The interview is long and covers two parts.

Part 1:

  • Origins of False-Positive Psychology
  • Bem’s ESP and its influence
  • How p-hacking might have worked in Bem’s case
  • Pilot-dropping vs. pilot-hacking
  • Study combination and effect-size patterns

Part 2 will cover:

  • Selection bias vs. p-hacking in p-curve
  • Flat vs. left-skewed distributions
  • Limits of p-curve for diagnosing p-hacking
  • Heterogeneity issues and power estimation
  • z-curve comparison
  • Reflections on the decade from 2011 to 2020s

Click here to read Part 1, which covers p-hacking strategies and how Bem p-hacked his ESP studies.

Part 2 – P-Curve, P-Hacking Detection, and the Rise of Z-Curve

Q: This brings up a new point. Later on you have claimed that many researchers’ “file drawers” are pretty empty because they p-hacked to get significance in each dataset. That assumes a lot of p-hacking in each study and suggests people don’t pilot-drop much. But maybe file drawers are full of abandoned small studies. What do you think?

A: Both are plausible. Some researchers run one big study and p-hack it into significance; others run many small pilots, dropping the bad ones. Without access to actual file drawers, it’s hard to know the base rates.


Q: So both are plausible. Do we know which is more common? Your p-curve article was titled “A key to the file drawer.” So what have you found when you looked into researchers’ file drawers?

A: P-curve doesn’t literally open the file drawer; it infers what’s in there based on the pattern of significant results. It’s agnostic about whether the non-significant ones exist but are hidden, or never happened because the studies were hacked into significance.


Q: Well, you are a bit modest here. P-curve makes a clear distinction between selection bias and p-hacking. You claim that selection bias is not necessary to examine because either it is small because all studies are p-hacked, or we can just assume it is present. So p-curve only looks at the significant results and checks whether they were p-hacked or not. Right?

A: Right — p-curve is built to diagnose evidential value within the set of significant results, not to measure selection bias. The assumption is: if there’s no evidential value, those significant results are most likely artifacts of p-hacking (or extreme selection from nulls).


Q: So how does it detect p-hacking, and has it found a lot of it in meta-analyses?

A: P-curve looks at the distribution of significant p-values. True effects produce a right-skew (more p-values just above 0, lots below 0.01), while pure p-hacking produces a left-skew (more p-values just above .05).

In practice, left-skews are rare; more often we see flat or modest right-skews, which suggests some evidential value, sometimes mixed with bias.


Q: Ok, but there’s an important difference between flat and left-skewed distributions. Correct?

A: Correct. Left-skew = p-hacking without true effects. Flat = could be selection bias from true nulls, or a mix of p-hacking and true effects that cancel out in the skew.


Q: So how often have you found left-skewed distributions?

A: Rarely. That’s partly because pure-null research programs are rare, and partly because most p-hacking is done on at least some true effects.


Q: As I recall, Bem’s data do not show a left-skewed distribution.

A: Correct — they don’t. And that’s an important limitation: if you p-hack on real effects, the real effects dominate the shape.


Q: Doesn’t that mean p-hacking is less common than the False-Positive Psychology paper suggested?

A: It means p-curve is better at detecting “pure” p-hacking on nulls than subtle bias on real effects. It doesn’t mean p-hacking isn’t common; it means one specific signature of it is uncommon.


Q: Let me see if I get this right. P-curve can detect p-hacking when used on true nulls, but not when real effects are present.

A: Exactly.


Q: But when the p-curve is flat we also think the null is true, but that implies huge selection bias. That seems implausible.

A: It’s possible — but yes, if the selection rate needed is absurd, then something else is going on, like a mix of nulls, small effects, and p-hacking.


Q: Couldn’t a flat p-curve also come from a lot of p-hacking plus some evidential value, canceling each other out?

A: Yes — that’s exactly the kind of ambiguity that makes interpretation tricky.


Q: So the real benefit of p-curve is to show that a set of studies has evidential value?

A: Yes. Its sweet spot is showing that most of the significant results are not just false positives.


Q: Which of your statistical tests for p-curve do you consider best?

A: They all have strengths, but the power estimate is the most interpretable when assumptions are met.


Q: Doesn’t your estimate of power give us the proportion of false positives indirectly?

A: In a homogeneous set, yes. But if the set is heterogeneous — say, half with high power, half with low — the average can mask that.


Q: Some critics claim p-curve overestimates power when studies are heterogeneous. Do you agree?

A: Yes — in hindsight, we underestimated how much heterogeneity could inflate our estimates.


Q: Is this why you never wrote p-curve 2.0 — because z-curve does it better?

A: In part. Z-curve handles heterogeneity more gracefully, and we see it as a natural evolution in the same family of tools.


Q: Finally, looking back from 2011 to now, how do you see the journey?

A: The biggest change is cultural. Ten years ago, questioning methods was edgy; now it’s expected. Tools like p-curve and z-curve are part of that, but the real shift is that “show me the evidence” now applies to the evidence itself. That’s the healthiest sign we could ask for.


1 thought on “Interview with Data Colada*: From False-Positive Psychology to z-curve: Part 2

Leave a Reply