The Irony of Testing for Bias with the Same Tools That Created It

Meta-psychology was born from a simple observation: the way psychologists used significance testing created a distorted literature. Researchers treated p < .05 as a license to publish and p > .05 as a reason to abandon a finding. Journals rewarded significant results, reviewers demanded them, and authors learned to find them. The result was predictable: literatures stuffed with too many significant results, exaggerated effect sizes, and too few honest failures (Sterling, 1959; Sterling et al., 1995).

This critique is now familiar. Null-hypothesis significance testing, reduced to a dichotomous decision rule, encourages bad scientific behavior. It turns evidence into a yes-or-no ritual, treats p = .049 as a discovery and p = .051 as a non-event, and rewards selective reporting. Meta-psychologists have made this point repeatedly, and largely correctly.

But there is an irony. Having criticized psychologists for using a dichotomous significance test to decide which original findings count, meta-psychologists often reach for the same logic to decide whether a literature is biased.

The original sin was this: p < .05 means the effect is real.

The meta-analytic version becomes this: p < .05 means publication bias is present.

The form of the reasoning has not changed. Only the target has moved up one level.

Why this fails is clearest if we ask what a significance test can ever legitimately buy us. The most charitable defense of significance testing is that a significant result may carry information about the sign of an effect: it can tell us which direction is more plausible, even when it says little about magnitude (Jones & Tukey, 2000). That defense collapses for publication bias, because the sign is known before we collect a single study. Selective reporting favors significant results; it does not run the other way. A test whose only defensible output is a direction we already know contributes little.

What it contributes instead is a verdict that is uninformative in both directions. A significant bias test conflates magnitude with detectability: in a large literature, a trivial and harmless amount of selection can reject the null. A nonsignificant bias test conflates small bias with low power: in a small literature, severe selection can easily fail to reach significance. Either way, the binary outcome tells us little about the quantity we care about. “There is publication bias, p < .05″ is, to borrow Cohen’s (1994) famous example, about as useful as “the earth is round, p < .05.” And “There was no evidence of publication bias, p > .05” is akin to “The earth is flat, p > .05.”

The deeper irony is that meta-psychologists have relocated the mistake they diagnosed. Original researchers treated significance as a discovery machine. Bias researchers sometimes treat significance as a bias-detection machine (Siegel et al., 2021). The error is identical: a difficult inferential problem is compressed into a binary decision.

Some have pushed the argument further, claiming that tests for publication bias are useless (e.g., Simonsohn, 2014). But the folly of nil-hypothesis testing, which incidentally undermines p-curve as much as many other significance-based methods, is not a reason to ignore publication bias. We do not abandon original research because the significance ritual is empty (Cohen, 1994). We replace the ritual with something more informative.

The reform for original research was to report effect-size estimates with confidence intervals that express uncertainty. The reform for bias detection should be the same. The goal is not to decide whether bias exists, but to estimate how much is present, how uncertain that estimate is, and whether the amount of bias consistent with the data changes the substantive conclusion.

Some publication-bias methods already estimate quantities of this kind, or carry the information needed to. Yet in practice that information is discarded, and the result is reduced to whether a test was significant or a method “detected bias” (Siegel et al., 2021). And no common metric for the amount of bias has been widely adopted.

The most natural metric is the excess of significant results. If a literature reports significant findings 80% of the time but the true probability of producing significant results is between 20% and 40%, we have clear evidence of substantial bias.

This is why the amount matters more than its presence. Bias can be easy to detect yet too small to change any conclusion, or large enough to overturn a conclusion yet impossible to detect in a small set of studies, where these tests have the least power (Renkewitz & Keiner, 2019). A binary test cannot tell these cases apart; an estimate with an interval can.

In short, meta-analysis needs the same methodological reform that original research needed. It is time to abandon the nil-hypothesis ritual and replace it with estimation: estimate the amount of publication bias, quantify the uncertainty with confidence intervals, and evaluate whether conclusions remain credible after adjusting for the plausible levels of selection.

Fortunately, unlike unpublished primary studies hidden in file drawers, the data behind published meta-analyses are often available or recoverable. That makes it possible to reexamine decades of meta-analytic conclusions and ask the question that matters: not whether publication bias can be detected, but whether the amount of bias compatible with the data changes what we should believe.

Replicability-Index

Improving the replicability of empirical research

The Irony of Testing for Bias with the Same Tools That Created It

Like this:

Leave a ReplyCancel reply

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from Replicability-Index