Pek Mate: Post-Study Evaluations of Power Are Possible and Necessary


Poor Undergraduate Education in Statistics

Psychologists are not famous for their quantitative chops. Many undergraduate students treat statistics courses as an ordeal to survive, not a skill to master. In North America especially, large psychology departments rely on large undergraduate programs for funding, which creates pressure to keep students happy. One way to do that is to offer “statistics-light” courses that avoid scaring off tuition-paying customers. As a result, most psychology majors graduate believing that the key to scientific discovery is to get p-values below .05, and that the results in their textbooks are trustworthy simply because they are “statistically significant.”

Graduate Training Isn’t Much Better

The situation doesn’t improve much in graduate school. Students learn how to point-and-click their way through statistical software, run analyses, and search for p-values below .05—this time with the added incentive that significant results mean publications for them and their advisors.

Over sixty years ago, Sterling (1959) warned that this obsession with p < .05 undermines the already limited value of p-values. Cohen (1994) memorably satirized the mindset with his title “The Earth is Round, p < .05.”

Sterling (1995) also documented a glaring oddity: psychology papers overwhelmingly report significant results—often 90% or more. That kind of hit rate would require researchers to be right almost all the time in their predictions. Given that most psychologists use two-sided tests (“Does X increase or decrease Y?”), they can claim a “hit” in either direction. The only way to be wrong is if the true effect is exactly zero—something many argue is rare.

But there’s another reason to be skeptical: to produce so many p < .05 results without fraud or extreme data-torturing, studies would need high statistical power. And here’s the rub—psychologists’ understanding of power is notoriously shaky. Even some quantitative psychologists, who should know better, end up publishing articles that confuse rather than clarify (see Pek et al., 2024).


What Power Actually Means

I’ve taught undergraduates about statistical power for over a decade, and AI can now spit out perfectly serviceable definitions in seconds. Here’s one example:

Statistical power is the probability that a statistical test will correctly reject the null hypothesis (H₀) when a specific alternative hypothesis (H₁) is true—in other words, the probability of detecting an effect if it exists. Power = 1 − β, where β is the Type II error rate.

Power depends on:
Effect size (true magnitude of the effect)
Sample size (bigger samples, higher power)
Significance level α (threshold for rejecting H₀)
Variability (less noise, more power)

Power is Unknown


Why care? High power reduces false negatives, makes non-significant results interpretable, supports replicability, and prevents waste of resources. Most important, power helps researchers to get the desirable p-values below .05. Why would you not plan studies to have a high probability to get a significant result, as long as the effect is not really zero.

The problem is that power depends on the the true (population) effect size. We never actually know this value—we only estimate it from samples, and those estimates vary because of sampling error. This uncertainty is not a special flaw in power analysis; it’s the nature of empirical research.

When planning studies, researchers must guess effect sizes, plug them into a formula, and get a required sample size—often aiming for the conventional 80% power. In practice, that number often comes from convenience (“How many participants can we realistically get?”) rather than a solid prior estimate and sometimes power analysis are fudged to put in a sample size and get the effect size that would give 80% power.


What 80% Power Looks Like

If we truly designed studies with 80% power and guessed the effect size perfectly, the resulting test statistics (converted to z-values) would form a predictable distribution:

Figure 1. Distribution of z-values with 80% power. The dashed line marks the significance threshold (z = 1.96, p = .05), and the solid line marks the expected mean z for 80% power (z ≈ 2.8).

  • 20% fall below the significance cut-off (non-significant)
  • 30% are significant but below the expected mean
  • 50% exceed the expected mean

This breakdown is a theoretical benchmark. If actual published results deviate strongly from it—say, 90% significant results instead of 80%—we know that our hypothesis that all studies have 80% power is wrong. Either researchers consistently underestimated effect sizes (unlikely), or results were selectively reported to favor significance (more likely).


Why Pek et al. Are Wrong About “Only for Planning”

Pek et al. (2024) argue that power is fine for planning studies but should not be used to evaluate completed ones. This is like saying we can predict a distribution under the null hypothesis and use it to interpret p-values, but once we see our data, that prediction magically disappears. It doesn’t. It is actually needed to draw inferences from the observed p-values. Thus, we are comparing actual data to a hypothetical distribution to make inferences about the direction of an effect.

The same is true for hypotheses about power. If we assume 80% power, we can predict the shape of the sampling distribution (Figure 1) and compare it with actual published data. Deviations tell us whether the true power was likely higher, lower, or compromised by selective reporting.


A Simple Example

Consider Wegener & Petty (1996), Study 2, which had three independent samples. The key interaction effect produced z-values of 2.03, 1.99, and 2.53. All are significant but fall between 1.96 and 2.8.

With 80% power, there’s only about a 30% chance of landing in this range in a single study. Doing so three times in a row has a probability of about 0.027—well below the usual 0.05 cut-off. This suggests the results were not from well-powered studies that got lucky to produce 100% significant results with 80% power, but more consistent with low power plus selective significance.


The Real Value of Power

Ironically, power is not that useful for planning studies—because effect sizes are unknown and easily fudged. Its real strength is as a diagnostic tool for evaluating the credibility of published results. If the distribution of test statistics doesn’t match what we’d expect from a plausible power level, we have reason to doubt the findings.

This doesn’t require knowing the exact “true” power—just as we can test the null hypothesis without already knowing the population effect size. Similarly, we can make predictions based on hypotheses about power and use them to evaluate observed data. It just does not follow, that we can draw inferences from normal distributions at 0, but not from normal distributions centered at 2.8 or some other value, but many psychologists confuse null-hypthesis testing with nil-hypothesis testing (Cohen, 1994).


Final Thoughts

The resistance to using power this way is puzzling. Maybe it’s a misunderstanding of the concept, maybe it’s a desire to shield the literature from uncomfortable scrutiny. Either way, learning about power is not just a technical exercise—it’s a tool for scientific self-defense. If you see a literature where most p-values cluster between .05 and .005, be skeptical. If they’re mostly below .005, you can have more confidence.

Understanding power gives you the ability to look beyond “statistically significant” and ask: Was this result ever likely to be true?


Interview with Data Colada*: From False-Positive Psychology to z-curve: Part 2

*This interview is fictional. I asked ChatGPT to play Datacolada based on information in their blog posts or articles. Any resemblance with answers by the real datacolada team (Simmons, Nelson, & Simonsohn) is purely coincidental. The interview is long and covers two parts.

Part 1:

  • Origins of False-Positive Psychology
  • Bem’s ESP and its influence
  • How p-hacking might have worked in Bem’s case
  • Pilot-dropping vs. pilot-hacking
  • Study combination and effect-size patterns

Part 2 will cover:

  • Selection bias vs. p-hacking in p-curve
  • Flat vs. left-skewed distributions
  • Limits of p-curve for diagnosing p-hacking
  • Heterogeneity issues and power estimation
  • z-curve comparison
  • Reflections on the decade from 2011 to 2020s

Click here to read Part 1, which covers p-hacking strategies and how Bem p-hacked his ESP studies.

Part 2 – P-Curve, P-Hacking Detection, and the Rise of Z-Curve

Q: This brings up a new point. Later on you have claimed that many researchers’ “file drawers” are pretty empty because they p-hacked to get significance in each dataset. That assumes a lot of p-hacking in each study and suggests people don’t pilot-drop much. But maybe file drawers are full of abandoned small studies. What do you think?

A: Both are plausible. Some researchers run one big study and p-hack it into significance; others run many small pilots, dropping the bad ones. Without access to actual file drawers, it’s hard to know the base rates.


Q: So both are plausible. Do we know which is more common? Your p-curve article was titled “A key to the file drawer.” So what have you found when you looked into researchers’ file drawers?

A: P-curve doesn’t literally open the file drawer; it infers what’s in there based on the pattern of significant results. It’s agnostic about whether the non-significant ones exist but are hidden, or never happened because the studies were hacked into significance.


Q: Well, you are a bit modest here. P-curve makes a clear distinction between selection bias and p-hacking. You claim that selection bias is not necessary to examine because either it is small because all studies are p-hacked, or we can just assume it is present. So p-curve only looks at the significant results and checks whether they were p-hacked or not. Right?

A: Right — p-curve is built to diagnose evidential value within the set of significant results, not to measure selection bias. The assumption is: if there’s no evidential value, those significant results are most likely artifacts of p-hacking (or extreme selection from nulls).


Q: So how does it detect p-hacking, and has it found a lot of it in meta-analyses?

A: P-curve looks at the distribution of significant p-values. True effects produce a right-skew (more p-values just above 0, lots below 0.01), while pure p-hacking produces a left-skew (more p-values just above .05).

In practice, left-skews are rare; more often we see flat or modest right-skews, which suggests some evidential value, sometimes mixed with bias.


Q: Ok, but there’s an important difference between flat and left-skewed distributions. Correct?

A: Correct. Left-skew = p-hacking without true effects. Flat = could be selection bias from true nulls, or a mix of p-hacking and true effects that cancel out in the skew.


Q: So how often have you found left-skewed distributions?

A: Rarely. That’s partly because pure-null research programs are rare, and partly because most p-hacking is done on at least some true effects.


Q: As I recall, Bem’s data do not show a left-skewed distribution.

A: Correct — they don’t. And that’s an important limitation: if you p-hack on real effects, the real effects dominate the shape.


Q: Doesn’t that mean p-hacking is less common than the False-Positive Psychology paper suggested?

A: It means p-curve is better at detecting “pure” p-hacking on nulls than subtle bias on real effects. It doesn’t mean p-hacking isn’t common; it means one specific signature of it is uncommon.


Q: Let me see if I get this right. P-curve can detect p-hacking when used on true nulls, but not when real effects are present.

A: Exactly.


Q: But when the p-curve is flat we also think the null is true, but that implies huge selection bias. That seems implausible.

A: It’s possible — but yes, if the selection rate needed is absurd, then something else is going on, like a mix of nulls, small effects, and p-hacking.


Q: Couldn’t a flat p-curve also come from a lot of p-hacking plus some evidential value, canceling each other out?

A: Yes — that’s exactly the kind of ambiguity that makes interpretation tricky.


Q: So the real benefit of p-curve is to show that a set of studies has evidential value?

A: Yes. Its sweet spot is showing that most of the significant results are not just false positives.


Q: Which of your statistical tests for p-curve do you consider best?

A: They all have strengths, but the power estimate is the most interpretable when assumptions are met.


Q: Doesn’t your estimate of power give us the proportion of false positives indirectly?

A: In a homogeneous set, yes. But if the set is heterogeneous — say, half with high power, half with low — the average can mask that.


Q: Some critics claim p-curve overestimates power when studies are heterogeneous. Do you agree?

A: Yes — in hindsight, we underestimated how much heterogeneity could inflate our estimates.


Q: Is this why you never wrote p-curve 2.0 — because z-curve does it better?

A: In part. Z-curve handles heterogeneity more gracefully, and we see it as a natural evolution in the same family of tools.


Q: Finally, looking back from 2011 to now, how do you see the journey?

A: The biggest change is cultural. Ten years ago, questioning methods was edgy; now it’s expected. Tools like p-curve and z-curve are part of that, but the real shift is that “show me the evidence” now applies to the evidence itself. That’s the healthiest sign we could ask for.


Interview with Data Colada*: From False-Positive Psychology to z-curve: Part 1

*This interview is fictional. I asked ChatGPT to play Datacolada based on information in their blog posts or articles. Any resemblance with answers by the real datacolada team (Simmons, Nelson, & Simonsohn) is purely coincidental. The interview is long and covers two parts.

Part 1:

  • Origins of False-Positive Psychology
  • Bem’s ESP and its influence
  • How p-hacking might have worked in Bem’s case
  • Pilot-dropping vs. pilot-hacking
  • Study combination and effect-size patterns

Part 2 will cover:

  • Selection bias vs. p-hacking in p-curve
  • Flat vs. left-skewed distributions
  • Limits of p-curve for diagnosing p-hacking
  • Heterogeneity issues and power estimation
  • z-curve comparison
  • Reflections on the decade from 2011 to 2020s

Part 1 – From Beatles Making You Younger to Bem’s ESP

Q: Hello Data Colada, thank you for agreeing to this interview.

A: Thanks for having us. We should warn you: if you were expecting polished PR answers, you might want to reschedule. If you ask about bad practices, we’ll talk about them. If you ask about replication, we’ll show you the numbers. And if you ask whether something “probably works,” you’re going to get a simulation, a p-value distribution, and a long blog post in response.


Q: How did you get to know each other and decide to write the famous False-Positive Psychology paper?

A: The three of us — Joe Simmons, Leif Nelson, and Uri Simonsohn — were already friends and collaborators, but not in the “let’s start a revolution” sense. Around 2009–2010, psychology was in that awkward adolescence where we thought our methods were fine because, well, they were published in top journals.

Then we started noticing something odd: you could get “significant” results out of almost anything if you were flexible enough with the data — try a few different DVs, stop collecting when the p-value looked good, drop a condition here, add a covariate there. None of this felt like fraud; it was just how people did research. But statistically, it was a recipe for nonsense.

One turning point was a real study we ran where the “findings” were… let’s say… improbable. That’s when the penny dropped: if we, trying to be careful, could accidentally p-hack our way to absurd conclusions, the field at large had a bigger problem.

So we decided to do two things:

  1. Demonstrate just how easy it was to get false positives — not in theory, but with real experiments (e.g., listening to the Beatles makes you younger).
  2. Quantify how researcher degrees of freedom inflated false positive rates well beyond the magical 5%.

The False-Positive Psychology paper wasn’t meant to be famous; it was meant to be unignorable. We didn’t want to say “people cheat,” we wanted to say “the rules themselves guarantee bad science unless we change them.”


Q: Were you influenced by the publication of Bem’s preprint that showed ESP is real, at least for porn and extraverts? Was that the kind of nonsense you had in mind?

A: Oh, 100%. Bem’s paper was like the meteor that made a lot of dinosaurs in psychology realize the sky could actually fall.

We didn’t think Bem was uniquely bad — quite the opposite. He was a careful, respected researcher. His ESP studies were textbook social psych experiments… except for the part where the conclusion was impossible. That was the “Oh crap” moment.

If you can use the exact same tools we all use — the same sample sizes, the same flexible analyses, the same significance threshold — and “prove” that people can sense future porn images, then the tools are the problem.

Bem’s preprint basically did us a huge, awkward favor. It made the absurdity visible. When we showed that listening to “When I’m 64” could make you younger, people could laugh it off as a gimmick. But ESP-for-porn was different — you couldn’t just roll your eyes and move on. It made the broader point painfully clear: the methods in play could produce convincing evidence for things that aren’t real.


Q: And as they say, “the rest is history.” The False-Positive Psychology paper probably would not have been published in Psychological Science without Bem’s article?

A: Yeah… that’s probably true. If Bem’s paper hadn’t hit the field like a thunderclap, our paper might have been one more “methods rant” that editors politely reject for being “too negative” or “not of broad interest.”

Before Bem, a lot of editors and reviewers would’ve shrugged and said, “Sure, p-hacking is possible, but does it really happen in good research?” After Bem, the answer was staring them in the face — and it was saying, “I can feel future erotic pictures.”

His paper shifted the Overton window. Suddenly it was acceptable, even necessary, to question the way we ran studies and analyzed data. That gave False-Positive Psychology a much smoother ride into Psychological Science. Instead of us trying to convince the field there was a crisis, the crisis had already RSVP’d and brought champagne.

So yes, in a weird way, Bem was our unintentional co-pilot. Without him, our Beatles-aging-you study might still be in some file drawer, right next to “things that will never see daylight.”


Q: Have you ever wondered how Bem actually p-hacked his studies? Not all of your simulated strategies work. Like in the erotic pictures study, the DV was button presses. So, no multiple variables. And sample sizes were large and somewhat consistent. So, no checking until p < .05.

A: Yes — we’ve wondered about that a lot. The “classic” False-Positive Psychology tricks — lots of DVs, optional stopping — don’t seem to be the main engine in some of Bem’s experiments. That’s part of why his data are so interesting: they force you to think beyond the obvious p-hacking menu.

Possible levers: trial/participant exclusions, covariates, condition filtering, moderator analyses (like extraversion), and especially research-program-level selection — publish the ones that work.

And — important point — some of Bem’s p-values were just barely under .05. With enough experiments, even without aggressive hacking, chance alone will produce a few false positives. Combine that with any of the above, and you can end up with nine straight “significant” studies.


Q: Bem actually mentions lots of “pilot studies.” In an interview he explained that he would carefully monitor study results from the start. This suggests that he abandoned studies with results that were not promising quickly, to avoid studies with failures and capitalize on promising early trends. Is there any evidence that this was his p-hacking strategy?

A: We’ve looked into that. On paper, it sounds like “pilot-dropping” — run a bunch of pilots, keep the ones that look promising, quietly bury the rest. But when you simulate that strategy, it actually tends to backfire. It doesn’t produce the kind of strong, consistent early effects you see in Bem’s published studies.

A better fit is what we call pilot-hacking: work the pilot until it delivers a significant result — tweak the analysis, tweak exclusions, adjust the procedure — then treat that as the starting point for your main study. This inflates the effect in the early sample, making it much easier to get something publishable at the end.

Pilot-hacking feels “honest” because you’re “just refining the study,” but statistically it’s p-hacking in disguise. And given Bem’s own account of closely tracking results from the very beginning, that pattern lines up a lot better than simple pilot-dropping.


Q: Wasn’t this strategy suggested by Ulrich Schimmack from R-Index? He showed that the effect size decreases in Bem’s studies when he analyzed the raw data. Doesn’t that suggest capitalizing on early promising results?

A: Yes, Schimmack’s re-analysis pointed in that direction. When you see effect sizes shrink from the early part of the study to the later part, it’s a telltale sign that the early data were unusually strong — maybe by chance, maybe by piloting until things “look good” — and then the subsequent data dilute it.

That’s not a smoking gun for pilot-hacking, but it’s consistent with it. And if you also find that this pattern repeats whenever data from multiple smaller studies are combined, that’s another clue. It means you may be seeing a series of “promising starts” stitched together into something that looks like one coherent study but is actually several cherry-picked segments.


Q: But how did Bem add hacking to the dropping? What hacking method could he have used in the prediction of erotic pictures in a simple experiment that shows them either on the left or right side of the screen?

A: Even with a binary DV like left/right choices, there’s room for flexibility: you can decide which trials to include, set cutoffs for participant inclusion, choose how to handle missing data, or add post-hoc moderators like personality traits. But we agree — it’s narrower than in many social psych studies.

In Bem’s case, our best guess is that pilot-hacking combined with selective reporting across studies was doing most of the work. That, plus normal sampling variability, can get you a surprising number of “hits” if you run enough experiments.


Q: Aren’t you reaching a bit? The responses were left/right, not reaction times. There’s no transformation for binary responses.

A: True — you can’t play the “log-transform to get significance” game here. But p-hacking isn’t just about math transforms. It can be about deciding which subset of trials or participants “count,” or adding moderators after the fact. And if the studies are part of a bigger research program, you can pick the ones that “work” and leave the rest in the drawer.


Q: So, you have no real evidence that Bem used any of your hacking strategies. Your response to Schimmack also ignores that Bem did something even more problematic that many would not consider appropriate: combining data from different studies, where the pattern of early promising results repeats at the beginning of each subset. Wouldn’t this pilot-dropping plus combining datasets be enough to get significance?

A: Yes, combining promising subsets can be enough. The math is simple: if you start each “mini-study” with a lucky streak, then stitch them together, you get an inflated overall effect. And because each segment is big enough to look like a “real” dataset, it can fly under the radar.


Q: Can you explain in more detail why pilot-dropping alone is not sufficient?

A: If you drop a pilot after just a few trials because the effect looks small, that’s a safe bet statistically — you’re just saving time. The problem is, if you also keep pilots that look big, you bias your published set toward inflated effects.

The tricky part is that if you then keep collecting after that lucky start, the extra data will dilute the effect. That’s why pure “drop the bad pilots” doesn’t give you the consistently large effects you see in Bem’s work — you need something like pilot-hacking or stitching together short promising runs to maintain the inflation.


Q: Ok, but pilot-dropping is much more efficient than the optional stopping p-hacking in your simulations.

A: Yes, optional stopping with small increments barely moves the false-positive needle compared to starting over with a fresh sample after a disappointing pilot. The latter is more efficient at producing publishable results because you’re essentially doubling your shots on goal.


Click here to read Part 2, which will cover p-curve, z-curve, and the closing reflections.

The Difference Between Power, Discovery Rates, and Expected Discovery Rates


When we talk about the credibility of research findings, three terms often come up: power, discovery rate, and expected discovery rate. They sound related — and they are — but they’re not interchangeable. In fact, using them loosely can create real confusion, especially for students learning statistics or readers trying to understand meta-science.

Let’s unpack what each term means, how they relate, and why it matters to use them precisely.


1. Power: The Conditional Probability of Significance

In classical statistics, power is defined as: Power=P(significant result∣true effect size)\text{Power} = P(\text{significant result} \mid \text{true effect size})

That’s the conditional probability of obtaining a significant result, assuming:

  • A specific true effect size (not zero),
  • A given sample size, and
  • A chosen significance criterion (α).

It’s called “conditional” because it depends on knowing (or assuming) the effect size is real and fixed. Power is typically used before a study to plan how large your sample should be.

Example:
If you believe the true effect size is d = 0.5, and you run 50 participants per group at α = 0.05, your power might be ~0.80. That means 80% of studies with that setup would produce a significant result.


2. Discovery Rate: The Observed Percentage of Significant Results

The discovery rate is simply the proportion of significant results in your observed dataset: Discovery Rate=# of significant resultsTotal # of results\text{Discovery Rate} = \frac{\text{\# of significant results}}{\text{Total \# of results}}

It’s an empirical frequency, not a probability in the theoretical sense.
If you analyze 100 tests and find 40 with p < .05, the observed discovery rate is 40%.

The problem: calling this a “rate” can mislead people into thinking it’s the true long-run probability of significance. It’s not — it’s just what happened in this sample.


3. Expected Discovery Rate (EDR): The Probability of Significance

The expected discovery rate is the long-run probability of obtaining a significant result, averaged over all studies — whether their hypotheses are true or false.

Mathematically: EDR=α⋅P(H0)+Power⋅P(H1)\text{EDR} = \alpha \cdot P(H_0) + \text{Power} \cdot P(H_1)

Where:

  • α · P(H₀) is the false positive contribution,
  • Power · P(H₁) is the true positive contribution,
  • and P(H0)P(H₀) and P(H1)P(H₁) are the proportions of studies with true nulls and true alternatives.

If P(H0)P(H₀) is small (true nulls rare), the EDR is close to the average power. If P(H0)P(H₀) is large, the EDR can be much lower than nominal power.

In meta-science, methods like z-curve use published p-values to estimate the EDR. This gives an estimate of the true long-run proportion of significant results, correcting for selection bias.


4. Why Terminology Matters

  • Power is conditional on the effect size being real.
  • Discovery rate is observed and may not match the underlying probability.
  • Expected discovery rate is unconditional, combining results from both true and false hypotheses, and represents a probability in the population of studies.

Mixing these up can lead to ontological errors — for example, treating an observed rate as if it were a true probability, or interpreting “power” when the null is actually true.


5. Takeaway

If you’re planning a study → talk about power (conditional probability).
If you’re describing what you found → talk about the discovery rate (observed proportion).
If you’re estimating the underlying probability of significance across studies → talk about the expected discovery rate (model-based probability).

Precise language helps keep the concepts — and the inferences we make from them — clear.


The Problem of Transforming Test Statistics into Z-Values for Z-Curve Analysis

The Problem of Transforming Test Statistics into Z-Values for Z-Curve Analysis

One of the distinctive features of z-curve is that it models the distribution of z-scores, even when the original studies report a wide range of test statistics — t, F, χ², correlations, and more. To make this possible, all reported test statistics are converted into two-sided p-values, and then into equivalent z-scores:

[don’t be scared by the Greek symbol. The formula just tells us to divide the two-sided p-value by 2, subtract that value from 1, and find the corresponding z-value for this p-value.
(r-code: qnorm(1-p/2)

Starting with two-sided p-values implies that the z-values are all positive. The sign is ignored because z-curve is designed for heterogeneous sets of studies where effects may be in different directions. The sign is only important for meta-analysis where negative effects can be interpreted as opposite effects to the predicted effects.

The Theoretical Concern

This transformation makes a big approximation: it treats every test statistic as if it came from a two-sided z-test with standard error 1. An anonymous reviewer on Replication Index put it bluntly:

“There is no justification that these z-scores will follow a z-distribution … it’s weird to assume that p-values transformed to z-scores might have the standard error of 1.” Questionable Reviewer Practices: Dishonest Simulations – Replicability-Index

The criticism is that real test statistics can differ meaningfully from the standard normal, especially in small samples where t and F distributions are heavier-tailed. Converting them all to z values may introduce bias into the estimated power distribution.

What the Developers Say

Jerry Brunner, co-author of z-curve, has acknowledged this is indeed an approximation:

“We pretend that all the tests were actually two-sided z-tests with the results in the predicted direction.”

Simulation studies show that the transformation works well in standard conditions. The reason is that the t-distribution approximates the normal distribution more and more as sample sizes increase. The transformation problem is limited to small samples, but what is small?

When It Becomes a Problem

The trouble starts when two things are small at the same time:

  • Per-study sample size (N) is small — especially per-cell N ≲ 20–30.
  • Number of studies (k) is small — ≲ 20–30.

In this low-N, low-k regime, simulation results show that the tz approximation can bias estimates of mean power, typically underestimating when true power is moderate to high (> .50).

As N increases, the t distribution converges toward the standard normal, and by N ≈ 80 per study the bias is negligible. Likewise, having more studies (k ≥ 100) smooths out the effect of the approximation.

The Practical Takeaway

For large, diverse meta-analytic datasets, especially with moderate or large per-study N, the p→z transformation appears to work well enough in practice. But for small-sample, small-k applications, the approximation may not hold, and z-curve estimates could be biased.

If you’re working in that small-N/small-k territory, you might consider:

  • Direct modeling of effect sizes with a selection model (e.g., weightr).
  • Sensitivity checks comparing the z-curve results to alternative bias-correction methods.
  • Directly use t-values as z-values. This will introduce the opposite bias because t-values can be much larger than z-values in small samples.
  • Fit the data with t-curve. instead of fitting a mixture model with a set of normal distributions, the model can be fitted with non-central t-distributions if all studies are small and have similar degrees of freedom.

In conclusion, the approximation of test statistics from different tests with z-values is an approximation that can introduce biases. This is only a problem when sample sizes are small (N < 30).

Datacolada Has Given Up on p-Curve

When p-curve debuted in 2014, it was billed as a powerful tool for detecting publication bias and estimating evidential value from the distribution of statistically significant results. Its creators — Leif Nelson, Joe Simmons, and Uri Simonsohn of Data Colada — vigorously defended the method in a series of posts and journal replies through 2018.

Those early defenses addressed challenges like:

  • Whether excluding p > .05 distorts results
  • How p-curve handles effect size heterogeneity
  • The method’s robustness to extreme (“ambitious”) p-hacking

By late 2018, the authors had answered their critics in technical detail and released p-curve 4.0.

Then… silence.

From 2019 onward, the Data Colada blog shifted almost entirely to fraud detection, replication case studies, and other bias-related topics. In the same period, criticism of p-curve intensified:

  • Brunner & Schimmack published simulation studies questioning its statistical properties.
  • Montoya (2024) showed that p-curve often yields unreliable or irreproducible conclusions in applied case studies.
  • Morey & Davis-Stober (2025) dissected the statistical foundations of p-curve’s evidential value tests and average power estimator, arguing that the method is flawed in principle.

Despite this, Nelson, Simmons, and Simonsohn have issued no substantive public defense of p-curve since 2018. The only post-2018 “activity” was a 2024 bug fix to the p-curve app after Richard Morey reported an error in the half-p-curve test.

Given their earlier pattern of responding quickly to criticism, the long silence is telling. While the authors have never publicly conceded that p-curve has serious problems, the lack of engagement with major post-2018 critiques suggests they have effectively abandoned the method as an active research program.

When eminent psychologists die, their passing is announced in obituaries in the American Psychologist. Not so for psychological theories or methods. In contrast, psychological theories and methods never die; they simply fade away. Maybe that should change. I say, “p-curve is dead. Long live z-curve.”


A Decade After False Positive Psychology: What Have We Learned About P-Hacking and the False Positive Risk in Psychology?

Keywords: False Positive Psychology, “p-hacking,” “false discovery rate,” “preregistration,” “Registered Reports,” “p-curve,” and “z-curve.”


False Positive Psychology: Then and Now

In 2011, Simmons, Nelson, and Simonsohn shook the field of psychology with their False Positive Psychology article in Psychological Science.
Their simulations demonstrated that undisclosed researcher flexibility — now widely known as p-hacking — could inflate the nominal 5% Type I error rate to as high as 60% in worst-case scenarios.

The message was simple: a large share of “statistically significant” findings could be false positives, especially if researchers exploit analytical flexibility.

But over a decade later, important questions remain:

  • What is the actual false positive risk (FDR) in psychology?
  • Have reforms like preregistration badges, Registered Reports, or bias-detection tools such as p-curve and z-curve reduced that risk?

1. The false positive risk: still uncertain

Since 2011, several statistical methods have tried to estimate psychology’s false discovery rate (FDR):

  • P-curve analysis (Simonsohn et al.) examines the distribution of significant p-values to detect evidential value.
  • Z-curve (Schimmack & Brunner) models the distribution of test statistics to estimate average power, selection bias, and the maximum FDR.
  • Jager & Leek (2014) used p-curve-like ideas in medicine, estimating an FDR around 14%.

Despite these advances:

  • No consensus “official” FDR exists for psychology.
  • Estimates vary widely across subfields, methods, and datasets.
  • The potential for high FDR is clear, but the actual FDR remains debated.

2. Preregistration badges: more transparency, unclear effect on p-hacking

Badges for preregistration and open data — popularized in journals like Psychological Science — were intended to reduce p-hacking and improve reproducibility.

  • van den Akker et al. (2024) found that preregistered studies more often included power analyses and had slightly larger sample sizes.
  • However, preregistration quality was inconsistent. Many left flexibility in variables, exclusion rules, or analyses — meaning p-hacking could still occur.
  • Importantly, the rate of significant results did not drop substantially, suggesting no clear reduction in FDR.

Badges may change norms and signal transparency, but without enforcement, they are no guarantee against p-hacking.


3. Registered Reports: stronger evidence for reducing bias

By contrast, Registered Reports provide strong bias control:

  • The study plan is peer-reviewed and accepted before data collection.
  • Publication is guaranteed regardless of results.
  • Scheel et al. (2021) showed that Registered Reports in psychology had roughly half the proportion of significant findings compared to standard articles (44% vs. 96%), consistent with a large drop in publication bias and inflated FDR.

4. Why p-hacking still matters even if the null is rarely strictly true

Some argue that because the null hypothesis is never exactly true, “false positives” are rare.
However:

  • Psychology often tests effects so small they have no practical significance.
  • In these cases, p-hacking can still fill the literature with results that reject the null but have trivial effect sizes — misleading theory and application.
  • The real cost of p-hacking is inflated effect size estimates and unreliable directional claims.

5. Moving forward: z-curve, p-curve, and better publication models

  • P-curve can still be useful for detecting evidential value, but it performs poorly when there’s heterogeneity in effect sizes or sample sizes.
  • Z-curve extends p-curve logic, handling heterogeneity and providing estimates of average power, selection bias, and maximum FDR.
  • Expanding Registered Reports could provide structural protection against p-hacking, making badges more than a symbolic reform.

Bottom line

Fourteen years after False Positive Psychology, we know:

  1. P-hacking can dramatically inflate false positive risk.
  2. The actual FDR in psychology remains uncertain.
  3. Badges improve transparency but have not clearly reduced FDR.
  4. Registered Reports are the most effective reform so far.
  5. Tools like z-curve can quantify bias and FDR more accurately than earlier methods like p-curve.

The conversation must now shift from detecting p-hacking to changing the incentive structure so researchers have no reason to engage in it.


Z-Curve.3.0 Tutorial: Chapter 7

You can find links to the other chapters on the post with Chapter 1.

This chapter compares z-curve 3.0 to the Bacon mixture model using both unbiased replication data and a 66-scenario simulation study. We evaluate accuracy for key meta-analytic metrics (EDR, ERR) under varying heterogeneity conditions. The results show that the default z-curve method performs better than the Bacon mixture model. The chapter explains why z-curve is performing better.

Chapter 7: Z-Curve with Bacon

Z-curve is one of the few mixture models that model selection bias. Many mixture models have been developed to analyze significant and non-significant z-values under the assumption that there is no selection bias. This assumption makes these models not very useful for meta-analysis of published results because selection bias can distort these results. Nevertheless, it is interesting to compare these mixture models with z-curve.

Z-curve was specified in four different ways using the density approach. First, the default model with seven components (ncz = 0, 1, 2, 3, 4, 5, 6) was used. Second, a model with three components with free population means and fixed SD of 1 was used. Third, a model with two components with free population means and standard deviations was used. This is a frequentist alternative to the Bayesian bacon mixture model.

The population means, population standard deviations, and weights were used to compute the expected discovery rate (EDR). This is the average power of all studies with non-significant and significant results. Computing this probability is not really useful because it matches the percentage of significant results without selection bias. So, we are not learning anything new from the estimate of average power. Here, the results help us to evaluate the performance of the different mixture models.

The code for this chapter can be found on GitHub.
zcurve3.0/Tutorial.R.Script.Chapter7.R at main · UlrichSchimmack/zcurve3.0

Illustration With Open Science Collaboration Data

A one component model with a normal distribution runs into problems when the true distribution is not normal. In a simulation with just three components centered at 0, 2, and 4 this happens when we have 50% of studies with z = 0 (the null hypothesis is true) and 50% of studies have high power (z = 4, power = 98%).

For a two-component model like bacon this is not a problem. It can fit this scenario with two components that have means of 0 and 4 and standard deviations for the sampling error of 1 .

The real problem for models with two components and normal distributions is that real data often have a long tail of studies with large z-values (z > 6). Models like p-curve cannot handle these long tails and lead to largely inflated estimates of power (see Chapter 5). Z-curve solves this problem by fitting the model to z-values below 6 and assigning a power of 1 to z-values greater than 6. Remember, even a z-value of 4 already has 98% power.

I am using the data from the replication studies of the reproducibility project (Open Science Collaboration, 2015) to illustrate z-curve and bacon (see Chapter 3 for a detailed examination of these data with z-curve). The advantage of using the replication data is that they have no publication bias.

Figure 1 shows the results with the default model with seven components and fixed variances of the sampling error at 1. I used the density method to make the results more comparable.

the results confirm that there is no publication bias. The observed discovery rate (% of significant results) is 33% and the z-curve model estimate of the expected discovery rate is 33%. The fitted line matches the histogram fairly well. The expected replication rate is 74%. That is, 74% of the studies that were successfully replicated are expected to be significant again in a second replication attempt. Based on the EDR of 33%, we can estimate the false discovery risk (FDR); that is, the maximum percentage of significant results that could still be false positive results. The FDR is 11%.

Figure 2 shows the results for a model with three components with free means and SD = 1. The results within 2 percentage points of the previous results and the EDR is overestimated by 1 percentage point. This shows how flexible mixture models are. They can fit data even if the model does not match the data generation model. The means and SD of the components are not interpretable, but they can be used to estimate average power (Brunner & Schimmack, 2020).

Figure 3 shows the results for a model with two components with free means and SDs. These results are directly comparable to the results for the bacon model which also uses two free components, but with a Bayesian estimation method. The results with the z-curve density method are similar to the previous results. This shows that a model with two free components is flexible enough to approximate the distribution of these data.

Figure 4 shows the results for the bacon model. Here the bacon model was used to fit the data and the estimated weights, means, and standard deviations were used as fixed values for the z-curve model. Importantly, the z-curve method did treat the 5 extreme values (z > 5) as studies with 100% power.

The bacon model assigned a weight of .04 to negative population means. The weight, mean and, standard deviation of the first component were w = .82, M = 0.06 and SD = 1.21. The weight for the positive component was .14. The standard interpretation of these results would be that only 14% of these results have a true effect in the predicted direction.

When we use the components to estimate a z-curve, we see similar results to the analysis with the z-curve density method. We also see that the false positive risk for significant results is 9%. With 33% significant results, this suggests that 33*(1 – .09) = 30% of the results are true positive results. This estimate is higher than we would predict based on the bacon method, 14%.

The problem with the 14% estimate is that it is based on a realistic interpretation of the components. However, some studies with z-values close to zero may have tested a true positive result with low power. The z-curve method to estimate the false positive risk avoids this problem.

Figure 5 shows the bacon results without truncating z-values at 6. Although there are only 5 z-values greater than 6, the bacon estimate now underestimates the EDR and ERR because a model that assumes a normal distribution does not capture the tail of extreme values. The low estimate of the EDR also implies an inflated estimate of the false discovery risk.

In conclusion, the frequentist z-curve method without priors performs as well or better than the Bayesian bacon method with priors. Moreover, the assumptions of the bacon mixture model that there are only two components limits its ability to fit data with a long tail of strong findings. This problem can be fixed by excluding the tail from the range of values that are fitted by the model, but there is no need to fix bacon because z-curve performs well.

The Simulation

The setup of the simulation study is described in more detail in Chapter 4. The most important information is that the original studies are assumed to be z-tests with three levels of power. Z0 assumes the test of true null-hypothesis with a center of the z-distribution at 0. This produces a power to replicate a study with p < .05 (two-sided) with an effect in the same direct. Z2 simulated a distribution centered at 2, slightly above the critical value for a significant result, z= 1.96. This is moderate power, slightly above 50%. The third component, Z4 is centered at a value of 4, and power is 98%. The mixture of these three components produces sets of studies with average power that covers the full range of power with few extreme values (z > 6) that might be considered outliers and are not used for the fitting of z-curve. The simulation program samples from the three distributions. To examine systematic biases, the simulation created 10,000 significant z-values. These simulations do not have a tail of extreme values and make it possible for bacon to perform as well or better than z-curve.

Results

Expected Discovery Rate

To compare the overall performance of all five models, I computed the root mean square error (RMSE). A value of .01 implies that estimates are on average within 1 percentage point of the true value.

RMSE
1. Default z-curve : .007
2. Three components : .007
3. Two components : .012
4. truncated bacon : .013
5. full bacon : .015

Figure 6 visually compares the default z-curve method with the full bacon method.

Both models produce good estimates, but these estimates are not useful in real data because we can simply compute the percentage of significant results to estimate average power. This method produces smaller confidence intervals and makes fewer assumptions. The only problem is that it assumes no publication bias, but that is also true for bacon.

Expected Replication Rate

The default z-curve method and the three-component model work well. The two-component model z-curve model is better than the truncated bacon model and the truncated bacon model is better than the full bacon model.

RMSE
1. Default z-curve : .007
2. Three components : .010
3. Two components : .018
4. truncated bacon : .034
5. full bacon : .011

Conclusion

This chapter compared z-curve to another mixture model that is being used to analyze data with a large number of statistical tests in large samples. The results show that z-curve performs better than this model because a model with two normal components does not capture the distribution of real data well, especially when they have a long tail of strong results. Brunner and Schimmack (2020) avoided this problem by excluding the tail of extreme z-values from the range of values to be fitted by the model. This does not mean that these values are ignored. Rather extreme values have little bias and high power. Thus, we can simply adjust the model estimates by making the assumption that the studies in the long tail have 100% power.

The bacon method is also not very useful for applied research purposes. First, we do not need to estimate the EDR without publication bias because we can simply compute the percentage of significant results to get an estimate of average power. This estimate can then be used to estimate the false discovery risk. The whole point of creating z-curve was to estimate power when publication bias is present. Z-curve is one of the few methods that have been developed and validated to detect and correct for publication bias. Another method was introduced earlier by Jager and Leek (2013), but Schimmack and Bartos showed with simulation studies that z-curve performs better. Therefore, z-curve is currently the best method when publication bias is present.

P-Curve vs. Z-Curve: Why Meta-Analysts Are Moving On

Summary (Meta-Description)

P-curve can detect evidential value, but fails under heterogeneity. Z-curve offers accurate replicability estimates with realistic assumptions.

Keywords:

  • p-curve
  • z-curve
  • p-curve vs z-curve
  • heterogeneity in meta-analysis


P-Curve vs. Z-Curve: Why Meta-Analysts Are Moving On

If you’ve been following developments in meta-analysis, you’ve probably heard of p-curve—a method introduced by Simonsohn, Nelson, and Simmons to assess whether a set of statistically significant results shows evidential value (i.e., that not all results are false positives). P-curve became popular because it was simple: take the significant p-values from a set of studies, plot their distribution, and see if there are more very small p-values (e.g., < .01) than you’d expect by chance.

But here’s the catch: p-curve makes a strong assumption that all studies have the same underlying statistical power (homogeneity). In the real world—where effect sizes and sample sizes vary—this assumption rarely holds. And when it doesn’t, p-curve’s “average power” estimate can be severely biased, often overestimating the replicability of a research area.


The Heterogeneity Problem

In practice, studies differ in:

  • Sample size
  • Effect size
  • Measurement reliability
  • Design quality

These differences produce heterogeneity in statistical power. Under heterogeneity, p-curve’s single-parameter model is misspecified—it treats all studies as if they were equally powered, leading to misleadingly high and overconfident estimates.

This problem has been demonstrated in multiple simulation studies (Brunner & Schimmack, 2020; Bartoš & Schimmack, 2022) and in Chapter 5 of the Z-Curve 3.0 Tutorial (2025). The conclusion is consistent: p-curve works for detecting evidential value, but it’s unreliable for quantifying it when power varies.


Enter Z-Curve

Z-curve was developed by Ulrich Schimmack and Jerry Brunner to directly address the heterogeneity issue. Instead of assuming one “true” power level, z-curve models the distribution of significant z-values as a mixture of components, each with its own noncentrality parameter.

This approach:

  • Accurately recovers the expected replication rate (ERR) and expected discovery rate (EDR) under heterogeneity.
  • Produces calibrated confidence intervals that reflect the true uncertainty.
  • Works with the same type of input data as p-curve (significant test statistics), so researchers can easily compare both.

Simulation Comparisons

The Z-Curve 3.0 Tutorial, Chapter 5 ran head-to-head simulations:

Featurep-curvez-curve
Model of powerSingle (homogeneous)Mixture (heterogeneous)
Accuracy under heterogeneityOverestimates powerAccurate estimates
Confidence intervalsToo narrow, misleadingProper coverage
Practical useOK for “is there any evidential value?”Best for “how much?” and “how replicable?”

Results were striking: in realistic mixed-power scenarios, z-curve’s points clustered along the 45° line (accurate estimates), while p-curve’s often overshot, giving an illusion of higher replicability than was actually present.


When to Use Each Method

  • If your only question is: Are all these studies just false positives? — p-curve can still be useful for a quick test of evidential value.
  • If you care about: How much evidential value? How replicable are these findings? — use z-curve, especially when power is likely to vary across studies.

Bottom Line

P-curve was an important step forward, but it’s been surpassed by methods that reflect the complexity of real research literatures. Z-curve is the better choice for quantifying evidential value and replicability, particularly under realistic heterogeneity.


Further Reading & Resources

  • Brunner, W., & Schimmack, U. (2020). Estimating replication rates from significant results. Meta-Psychology.
  • Schimmack, U. (2025). Z-Curve 3.0 Tutorial, Chapter 5: P-Curve vs Z-Curve. replicationindex.com
  • Simonsohn, U., Nelson, L., & Simmons, J. (2014). P-curve: A key to the file-drawer. JEP: General.

Review of “On the Poor Statistical Properties of the P-Curve Meta-Analytic Procedure”

This post-publication peer-review was created in collaboration with ChatGPT5.0.


“On the Poor Statistical Properties of the P-Curve Meta-Analytic Procedure” by Richard D. Morey & Clintin P. Davis-Stober https://doi.org/10.1080/01621459.2025.2544397

Overview

Morey & Davis-Stober present a detailed critique of p-curve, a set of statistical tests proposed by Simonsohn et al. (2014, 2015) to assess evidential value in sets of published results. They identify conceptual and statistical issues, particularly with the formal tests in the p-curve app and its average power estimation feature. Their central conclusion is that p-curve, in its current form, should not be used for either hypothesis testing or power estimation.

The paper is well-written, technically competent, and thorough in its exploration of certain flaws. However, the review also has significant omissions and imbalances. It fails to situate p-curve within the broader trajectory of methodological development in forensic meta-analysis, particularly the work by Brunner and Schimmack on z-curve, which directly addresses and resolves the main statistical weakness Morey & Davis-Stober highlight. Moreover, in some places the critique overstates p-curve’s practical shortcomings and risks giving the impression that no viable alternative exists.


Strengths of the manuscript

  1. Detailed breakdown of p-curve’s formal tests – The authors provide an in-depth explanation of EV, LEV, and related tests, clarifying the exact null and alternative hypotheses, and showing the disconnect between these hypotheses and the informal “evidential value” narrative.
  2. Identification of statistical weaknesses – The analysis of inadmissibility, non-monotonicity, and sensitivity to p-values near the significance threshold is technically sound and important for understanding the limitations of the 2015 probit-based tests.
  3. Critique of the average-power estimator – The authors convincingly show that p-curve’s current estimator is inconsistent under heterogeneity in effect sizes or sample sizes, with potential for substantial upward bias.

Limitations and omissions

  1. Lack of historical context on heterogeneity problem
    The authors treat the discovery that p-curve’s power estimation fails under heterogeneity as novel to their work. In fact, heterogeneity bias in p-curve was already documented in Van Aert et al. (2016) and more systematically addressed by Brunner & Schimmack when they introduced z-curve (Brunner & Schimmack, 2020, 2021).
    • Z-curve models the full distribution of significant results as a mixture of components with different noncentrality parameters, eliminating the unrealistic homogeneity assumption that underlies p-curve’s bias.
    • By omitting this work, the authors under-represent the state of the field and miss an opportunity to frame their critique as part of a solved problem.
  2. Overstatement of p-curve’s practical failings
    While the identified statistical properties are real, they can sound more damaging to applied use than they are:
    • For detection of evidential value, the original 2014 log-based EV test remains a valid way to reject the “all effects are zero” null in the absence of severe violations (e.g., extreme p-value dependence).
    • The main unreliability lies in quantifying evidential value with the app’s average-power feature, not in detecting it.
    • Readers may leave the paper believing that p-curve is useless in all forms, which is an overgeneralization.
  3. No acknowledgment of improved alternatives
    The authors conclude by recommending against the use of p-curve in any form, but do not inform readers that more robust successors exist:
    • Z-curve resolves the inconsistency under heterogeneity, produces bias-corrected average-power estimates with confidence intervals, and retains evidential-value detection capability.
    • Bayesian hierarchical selection models offer another path, explicitly modeling the distribution of λ\lambda and selection process.
    • Presenting these options would give readers a constructive way forward instead of ending with a methodological dead end.
  4. Unclear separation between coding issues and statistical flaws
    For example, the Montoya et al. (2024) criticism about instability from multiple candidate p-values per study is cited without noting that Simonsohn et al.’s own guidelines already require choosing the most focal test, which largely avoids this issue. This blurs the line between problems inherent to the method and problems arising from poor implementation.

Balanced interpretation

  • What the paper gets right:
    • The p-curve app’s 2015 probit-based tests (*EV, *LEV) and its average-power feature have real, demonstrable weaknesses.
    • These include inadmissibility, high sensitivity to near-threshold p-values, and inconsistency under heterogeneity.
  • What the paper underplays or omits:
    • The fact that detection of evidential value via the 2014 log-based EV test is less affected by these problems.
    • That heterogeneity bias was recognized years earlier and has already been addressed in z-curve and related mixture-model approaches.
    • That practical alternatives exist which preserve the core logic of p-curve but remove the statistical defects the authors highlight.

Conclusion

Morey & Davis-Stober’s paper is valuable as a technical audit of the p-curve app’s internal workings, but it is incomplete as a guide for applied researchers. The heterogeneity problem they identify is neither new nor insurmountable: it has been solved in the z-curve framework, which replaces the single-λ\lambda assumption with a mixture model of noncentrality parameters and yields reliable average-power estimates under realistic conditions.

A balanced takeaway is:

  • P-curve is serviceable for detecting the presence of evidential value in relatively homogeneous sets of studies, but unreliable for quantifying it in the presence of heterogeneity.
  • Z-curve (or similar mixture-model approaches) should be preferred when heterogeneity is suspected, as it retains the interpretive clarity of p-curve while avoiding its statistical weaknesses.

Table – Comparison of P-curve and Z-curve

FeatureP-curveZ-curve
Primary GoalTest for “evidential value” (reject all-H₀ null) and estimate “average power”Estimate average power / replicability while accounting for heterogeneity and selection
Key AssumptionSingle underlying noncentrality parameter (homogeneous power)Multiple underlying noncentrality parameters (heterogeneous power) via mixture model
Selection ModelImplicit truncation at α (e.g., p < .05)Explicitly models truncation and selection process
Heterogeneity HandlingNot modeled; leads to biased power estimatesModeled via mixture of truncated distributions
Evidential Value DetectionWorks reasonably when assumptions hold; EV test valid under independenceAlso provides detection, via model fit and comparison to null
Average Power EstimationBiased under heterogeneity; CIs unreliableUnbiased under heterogeneity (within sampling error); CIs valid
OutputsEV test p-value; average power estimate; CI (invalid if heterogeneity present)Expected replication rate (ERR), expected discovery rate (EDR), average power, valid CIs
Best Use CaseHomogeneous, independent, significant p-valuesRealistic, heterogeneous, possibly dependent significant p-values
Key ReferencesSimonsohn, Nelson, & Simmons (2014, 2015)Brunner & Schimmack (2020, 2021)