I have posted several blog posts about Pek et al.’s (2024) flawed discussion of power as a tool to evaluate published studies. Some readers have questioned my interpretation of the article. “Surely, this is not what they are saying.” Indeed, their claims are so ridiculous that it is hard to believe that a respected journal published the article. To make clear that they are saying exactly what I am saying they are saying (oh boy), I found relevant quotes and asked ChatGPT to translate them to avoid my own biases in the interpretation of their writing. ChatGPT also provided a point by point comparison of their key claims and my rebuttal.
I sent Pek this summary to check whether this summary of her position is accurate. I will let you know if there is a response. Feel free to express your own views in the comments section.
1. Core Claim #1 — Power is a Pre-Data Property, Not a Post-Data Measure
Plain language:
They argue that statistical power is defined as the probability of rejecting H0H_0 before you see the data, given a specified α\alpha, effect size, and sample size. Once the study is completed and the data are fixed, that probability no longer applies — you can’t “retroactively” assign a probability to an event that has already happened.
Quote:
“Thus, power is a property of a test (procedure) for some value of α, N, and μ and not a property of observed data.” (p. 7)
Quote:
“Applying a probability over random data to fixed (observed) data is a fatal ontological error.” (p. 8)
2. Core Claim #2 — Using Observed Power is Mathematically Redundant with the p-value
Plain language:
If you calculate “observed power” from the observed effect size and sample size, you are just transforming the p-value into another scale. This means observed power adds no new information beyond what you already get from the p-value.
Quote:
“Because observed power is a transformation of the p-value, it provides information about the completed study that is mathematically redundant with the p-value.” (p. 6)
3. Core Claim #3 — Using Power to Evaluate Completed Studies Is Counterproductive
Plain language:
They argue that using power as a “credibility check” for published research can be misleading and harmful. They frame this as “power for evaluation,” meaning applying power calculations to completed studies to make judgments about them (either singly or in aggregate). They say this conflicts with the definition of power and encourages the misuse of probability for fixed outcomes.
Quote:
“We also describe why using power for evaluating completed studies can be counterproductive.” (Abstract)
Quote:
“Power for evaluation…conflicts with definitional properties of statistical power and should be discouraged.” (p. 5)
4. Core Claim #4 — Average Power Estimates Still Inherit the Same Problems
Plain language:
Even if you aggregate many studies and calculate “average observed power” from meta-analytic effect sizes or benchmarks (e.g., Cohen’s “t-shirt” effect sizes), you’re still plugging in effect size estimates from data. They say this is just a more complicated version of the same post-hoc problem — the estimate is imprecise, biased, and conceptually inappropriate for evaluating completed studies.
Quote:
“Analytical and simulation evidence…demonstrates that average observed power is relatively uninformative because its estimate is highly variable and imprecise…using observed power calculations to make statements about the power of tests in those completed studies is problematic because such an application of power is an ontological impossibility of frequentist probability.” (p. 13)
5. Core Claim #5 — Estimates Based on “T-Shirt” Effect Sizes Are Especially Unreliable
Plain language:
When researchers take a benchmark effect size (e.g., “medium = 0.5”) and combine it with the actual sample sizes in published studies to estimate “average power,” Pek et al. say this is disconnected from the actual effects in the literature. They argue that these benchmarks are hypothetical and therefore do not reflect the true power of the completed studies.
Quote:
“Taken together, power calculated from hypothetical t-shirt effect sizes μ and N̂ cannot be deemed to reasonably reflect the power of completed studies.” (p. 12)
6. Supporting Argument — Appeal to Authority & Prior Warnings
Plain language:
They cite multiple influential statisticians and methodologists who have criticized post-hoc power. This is meant to reinforce that the problems they’re pointing to are widely recognized, not their invention.
Quote:
“…unsurprisingly, this ontological error has prompted several influential statisticians and methodologists to object to poststudy applications of power.” (p. 9)
7. Supporting Argument — Power Calculations Should Be Reserved for Planning Future Studies
Plain language:
Their preferred framing is that power is for design, not evaluation. They suggest that if you want to assess completed research, you should use other tools (though they don’t strongly recommend specific alternatives).
Quote:
“Uses of Uncertain Statistical Power: Designing Future Studies, Not Evaluating Completed Studies” (title)
“We emphasize that the most useful role for power is in designing studies, not in evaluating completed research.” (paraphrased from discussion)
Pek et al. vs. Rebuttal — Point-by-Point
| Pek et al. Claim / Quote | Rebuttal |
|---|---|
| Power is a property of a test (α, N, μ) and not a property of observed data“Applying a probability over random data to fixed (observed) data is a fatal ontological error.” | Power is indeed defined for the population parameters, but we routinely estimate unknown population parameters from data. Estimating μ from observed data and then computing power from it is no more an “ontological error” than estimating μ itself. Both are subject to sampling error but not conceptually invalid. |
| Observed power is a transformation of the p-value and adds no new information | True for a single study without bias correction, but irrelevant when aggregating many studies, modeling publication bias, or estimating the power distribution. In z-curve, p-values are used collectively to estimate the underlying distribution of true effects and selection, which yields information about replicability that is not in any single p-value. |
| Power for evaluation is counterproductive and should be discouraged | This overgeneralizes from the misuse of single-study observed power. Field-level evaluations (e.g., z-curve, excess significance tests, BUCSS) use many studies and account for bias, providing valuable diagnostics of research credibility. Discouraging this removes a key tool for detecting systemic issues. |
| Average observed power from meta-analytic effect sizes is imprecise, biased, and conceptually inappropriate for completed studies | All statistical estimates are uncertain — precision improves with larger datasets. Meta-analytic average power, especially when bias-corrected, is an interpretable measure of expected replicability. Critiquing its imprecision without considering dataset size or bias correction ignores advances like z-curve that address these limitations. |
| “T-shirt” effect sizes (e.g., Cohen’s benchmarks) do not reflect actual effects in the literature | Benchmarks are crude, but many “low power in psychology” studies use empirically derived effect sizes from actual published results — not just hypothetical 0.2/0.5/0.8. Even if initial estimates are imperfect, they provide a baseline showing the mismatch between typical sample sizes and plausible effect sizes. |
| Appeal to authority: influential statisticians warn against post-hoc power | Most of these warnings target naïve single-study calculations, not sophisticated bias-corrected meta-analytic methods. Citing these warnings to dismiss all uses of power for evaluation is a straw man. |
| Power should be reserved for study planning, not evaluating completed research | Evaluation is essential for identifying systemic underpowering and bias. Without it, we cannot quantify the gap between claimed and actual evidential value in a literature. Tools like z-curve and BUCSS bridge the gap between evaluation and design by using completed studies to inform better future research. |
| Footnote on Brunner & Schimmack (2020) and Simonsohn et al. (2014): labeling average power a “replicability estimate” is misleading | This criticism ignores that average bias-corrected power does correspond to the expected proportion of significant replications under similar conditions — which is the definition of replicability in this context. The label is appropriate when the estimation method accounts for selection and heterogeneity, as in z-curve. |