*This interview is fictional. I asked ChatGPT to play Datacolada based on information in their blog posts or articles. Any resemblance with answers by the real datacolada team (Simmons, Nelson, & Simonsohn) is purely coincidental. The interview is long and covers two parts.
Part 1:
- Origins of False-Positive Psychology
- Bem’s ESP and its influence
- How p-hacking might have worked in Bem’s case
- Pilot-dropping vs. pilot-hacking
- Study combination and effect-size patterns
Part 2 will cover:
- Selection bias vs. p-hacking in p-curve
- Flat vs. left-skewed distributions
- Limits of p-curve for diagnosing p-hacking
- Heterogeneity issues and power estimation
- z-curve comparison
- Reflections on the decade from 2011 to 2020s
Part 1 – From Beatles Making You Younger to Bem’s ESP
Q: Hello Data Colada, thank you for agreeing to this interview.
A: Thanks for having us. We should warn you: if you were expecting polished PR answers, you might want to reschedule. If you ask about bad practices, we’ll talk about them. If you ask about replication, we’ll show you the numbers. And if you ask whether something “probably works,” you’re going to get a simulation, a p-value distribution, and a long blog post in response.
Q: How did you get to know each other and decide to write the famous False-Positive Psychology paper?
A: The three of us — Joe Simmons, Leif Nelson, and Uri Simonsohn — were already friends and collaborators, but not in the “let’s start a revolution” sense. Around 2009–2010, psychology was in that awkward adolescence where we thought our methods were fine because, well, they were published in top journals.
Then we started noticing something odd: you could get “significant” results out of almost anything if you were flexible enough with the data — try a few different DVs, stop collecting when the p-value looked good, drop a condition here, add a covariate there. None of this felt like fraud; it was just how people did research. But statistically, it was a recipe for nonsense.
One turning point was a real study we ran where the “findings” were… let’s say… improbable. That’s when the penny dropped: if we, trying to be careful, could accidentally p-hack our way to absurd conclusions, the field at large had a bigger problem.
So we decided to do two things:
- Demonstrate just how easy it was to get false positives — not in theory, but with real experiments (e.g., listening to the Beatles makes you younger).
- Quantify how researcher degrees of freedom inflated false positive rates well beyond the magical 5%.
The False-Positive Psychology paper wasn’t meant to be famous; it was meant to be unignorable. We didn’t want to say “people cheat,” we wanted to say “the rules themselves guarantee bad science unless we change them.”
Q: Were you influenced by the publication of Bem’s preprint that showed ESP is real, at least for porn and extraverts? Was that the kind of nonsense you had in mind?
A: Oh, 100%. Bem’s paper was like the meteor that made a lot of dinosaurs in psychology realize the sky could actually fall.
We didn’t think Bem was uniquely bad — quite the opposite. He was a careful, respected researcher. His ESP studies were textbook social psych experiments… except for the part where the conclusion was impossible. That was the “Oh crap” moment.
If you can use the exact same tools we all use — the same sample sizes, the same flexible analyses, the same significance threshold — and “prove” that people can sense future porn images, then the tools are the problem.
Bem’s preprint basically did us a huge, awkward favor. It made the absurdity visible. When we showed that listening to “When I’m 64” could make you younger, people could laugh it off as a gimmick. But ESP-for-porn was different — you couldn’t just roll your eyes and move on. It made the broader point painfully clear: the methods in play could produce convincing evidence for things that aren’t real.
Q: And as they say, “the rest is history.” The False-Positive Psychology paper probably would not have been published in Psychological Science without Bem’s article?
A: Yeah… that’s probably true. If Bem’s paper hadn’t hit the field like a thunderclap, our paper might have been one more “methods rant” that editors politely reject for being “too negative” or “not of broad interest.”
Before Bem, a lot of editors and reviewers would’ve shrugged and said, “Sure, p-hacking is possible, but does it really happen in good research?” After Bem, the answer was staring them in the face — and it was saying, “I can feel future erotic pictures.”
His paper shifted the Overton window. Suddenly it was acceptable, even necessary, to question the way we ran studies and analyzed data. That gave False-Positive Psychology a much smoother ride into Psychological Science. Instead of us trying to convince the field there was a crisis, the crisis had already RSVP’d and brought champagne.
So yes, in a weird way, Bem was our unintentional co-pilot. Without him, our Beatles-aging-you study might still be in some file drawer, right next to “things that will never see daylight.”
Q: Have you ever wondered how Bem actually p-hacked his studies? Not all of your simulated strategies work. Like in the erotic pictures study, the DV was button presses. So, no multiple variables. And sample sizes were large and somewhat consistent. So, no checking until p < .05.
A: Yes — we’ve wondered about that a lot. The “classic” False-Positive Psychology tricks — lots of DVs, optional stopping — don’t seem to be the main engine in some of Bem’s experiments. That’s part of why his data are so interesting: they force you to think beyond the obvious p-hacking menu.
Possible levers: trial/participant exclusions, covariates, condition filtering, moderator analyses (like extraversion), and especially research-program-level selection — publish the ones that work.
And — important point — some of Bem’s p-values were just barely under .05. With enough experiments, even without aggressive hacking, chance alone will produce a few false positives. Combine that with any of the above, and you can end up with nine straight “significant” studies.
Q: Bem actually mentions lots of “pilot studies.” In an interview he explained that he would carefully monitor study results from the start. This suggests that he abandoned studies with results that were not promising quickly, to avoid studies with failures and capitalize on promising early trends. Is there any evidence that this was his p-hacking strategy?
A: We’ve looked into that. On paper, it sounds like “pilot-dropping” — run a bunch of pilots, keep the ones that look promising, quietly bury the rest. But when you simulate that strategy, it actually tends to backfire. It doesn’t produce the kind of strong, consistent early effects you see in Bem’s published studies.
A better fit is what we call pilot-hacking: work the pilot until it delivers a significant result — tweak the analysis, tweak exclusions, adjust the procedure — then treat that as the starting point for your main study. This inflates the effect in the early sample, making it much easier to get something publishable at the end.
Pilot-hacking feels “honest” because you’re “just refining the study,” but statistically it’s p-hacking in disguise. And given Bem’s own account of closely tracking results from the very beginning, that pattern lines up a lot better than simple pilot-dropping.
Q: Wasn’t this strategy suggested by Ulrich Schimmack from R-Index? He showed that the effect size decreases in Bem’s studies when he analyzed the raw data. Doesn’t that suggest capitalizing on early promising results?
A: Yes, Schimmack’s re-analysis pointed in that direction. When you see effect sizes shrink from the early part of the study to the later part, it’s a telltale sign that the early data were unusually strong — maybe by chance, maybe by piloting until things “look good” — and then the subsequent data dilute it.
That’s not a smoking gun for pilot-hacking, but it’s consistent with it. And if you also find that this pattern repeats whenever data from multiple smaller studies are combined, that’s another clue. It means you may be seeing a series of “promising starts” stitched together into something that looks like one coherent study but is actually several cherry-picked segments.
Q: But how did Bem add hacking to the dropping? What hacking method could he have used in the prediction of erotic pictures in a simple experiment that shows them either on the left or right side of the screen?
A: Even with a binary DV like left/right choices, there’s room for flexibility: you can decide which trials to include, set cutoffs for participant inclusion, choose how to handle missing data, or add post-hoc moderators like personality traits. But we agree — it’s narrower than in many social psych studies.
In Bem’s case, our best guess is that pilot-hacking combined with selective reporting across studies was doing most of the work. That, plus normal sampling variability, can get you a surprising number of “hits” if you run enough experiments.
Q: Aren’t you reaching a bit? The responses were left/right, not reaction times. There’s no transformation for binary responses.
A: True — you can’t play the “log-transform to get significance” game here. But p-hacking isn’t just about math transforms. It can be about deciding which subset of trials or participants “count,” or adding moderators after the fact. And if the studies are part of a bigger research program, you can pick the ones that “work” and leave the rest in the drawer.
Q: So, you have no real evidence that Bem used any of your hacking strategies. Your response to Schimmack also ignores that Bem did something even more problematic that many would not consider appropriate: combining data from different studies, where the pattern of early promising results repeats at the beginning of each subset. Wouldn’t this pilot-dropping plus combining datasets be enough to get significance?
A: Yes, combining promising subsets can be enough. The math is simple: if you start each “mini-study” with a lucky streak, then stitch them together, you get an inflated overall effect. And because each segment is big enough to look like a “real” dataset, it can fly under the radar.
Q: Can you explain in more detail why pilot-dropping alone is not sufficient?
A: If you drop a pilot after just a few trials because the effect looks small, that’s a safe bet statistically — you’re just saving time. The problem is, if you also keep pilots that look big, you bias your published set toward inflated effects.
The tricky part is that if you then keep collecting after that lucky start, the extra data will dilute the effect. That’s why pure “drop the bad pilots” doesn’t give you the consistently large effects you see in Bem’s work — you need something like pilot-hacking or stitching together short promising runs to maintain the inflation.
Q: Ok, but pilot-dropping is much more efficient than the optional stopping p-hacking in your simulations.
A: Yes, optional stopping with small increments barely moves the false-positive needle compared to starting over with a fresh sample after a disappointing pilot. The latter is more efficient at producing publishable results because you’re essentially doubling your shots on goal.
Click here to read Part 2, which will cover p-curve, z-curve, and the closing reflections.
Comment from Data Colada*
Thanks for this creative Q&A — it’s a strange and flattering experience to read an interview with ourselves that we didn’t actually give.
A few quick clarifications for readers:
1. **On p-hacking detection:** It’s true that p-curve is best at flagging “pure” p-hacking on nulls. But we wouldn’t say it’s *rarely* useful for diagnosing p-hacking. Even when there’s a mix of real and null effects, the shape of the curve can help quantify evidential value and constrain plausible amounts of bias.
2. **On heterogeneity:** We’ve always said p-curve works under heterogeneity — and in simulations, it does better than some critics think — but yes, extreme heterogeneity can bias power estimates upward. That’s one reason we like having multiple tools (p-curve, z-curve, selection models) in the meta-analytic toolbox.
3. **On z-curve:** We admire clever methods. Z-curve has interesting properties, especially in handling heterogeneity, but we don’t see it as “the better p-curve.” Different assumptions, different strengths.
4. **On pilot-dropping:** Guilty as charged — not of doing it, but of not simulating it enough in the original *False-Positive Psychology* paper. It’s a form of flexibility worth more attention.
We’ll take the ESP jokes in stride.
Thanks again for the thoughtful piece — it’s nice to see these issues getting discussed in detail, even if we didn’t write the answers ourselves.
Data Colada*