Caliper Test - Replicability-Index

Replicability Index Encyclopedia: Caliper Test

Caliper Test of Publication Bias

The caliper test is a statistical method for detecting publication bias introduced by Gerber and Malhotra (2008a, 2008b). It tests whether the distribution of test statistics is continuous and approximately locally symmetric around a significance threshold, typically z = 1.96, corresponding to p = .05. The key assumption is that, in the absence of publication bias or p-hacking, the expected density of z-scores in a narrow band just above the threshold should be approximately equal to the expected density just below it. A significant excess of results just above the threshold suggests that researchers or publication processes have shifted results across the boundary, either through selective reporting or analytical flexibility.

Procedure

Published p-values are converted to z-scores (z = Φ⁻¹(1 − p/2)). A caliper of width w is placed symmetrically around the threshold, creating two bins: one from 1.96 to 1.96 + w (just significant) and one from 1.96 − w to 1.96 (just nonsignificant). Under the null hypothesis of no bias, the counts in the two bins should be equal. The test is conducted as a one-sided binomial test with expected probability 0.50. Gerber and Malhotra (2008a) recommended bandwidths of 5%, 10%, 15%, and 20% of the threshold value. A 10% caliper around z = 1.96, for example, compares counts in the intervals [1.764, 1.96) and [1.96, 2.156].

Applications

Gerber and Malhotra applied the caliper test to leading political science journals (APSR, AJPS) and sociology journals (ASR, AJS) and found strong evidence of publication bias (Gerber & Malhotra, 2008a; Gerber & Malhotra, 2008b). The test was subsequently adopted in economics, most notably by Brodeur, Lé, Sangnier, and Zylberberg (2016) and Brodeur, Cook, and Heyes (2020), who documented significant bunching of test statistics just above conventional thresholds across top economics journals. Berning and Weiß (2016) applied the caliper test to German social science journals, again finding evidence of bias. The test has become a standard tool in the meta-science toolkit for discipline-wide assessments of publication practices.

Strengths

The caliper test has several practical advantages. The logic is intuitive and easy to communicate. It requires only test statistics or p-values, not standardized effect sizes, making it applicable to heterogeneous literatures where effect-size metrics vary across studies and designs. For discipline-wide analyses where studies address different research questions with different effects, the caliper test avoids the strong assumptions about comparability or homogeneity required by many other methods.

Limitations

The caliper test’s local-symmetry assumption is exact for normally distributed z-values only when the noncentrality parameter equals the critical value. For the conventional threshold z = 1.96, this corresponds to a study with approximately 50% power. If power is lower, the expected distribution slopes downward across the threshold, producing more just-nonsignificant than just-significant results. If power is higher, the distribution slopes upward across the threshold, producing more just-significant than just-nonsignificant results even in the absence of publication bias. Thus, deviations from caliper symmetry can reflect the power distribution of studies rather than selective publication or p-hacking.

This vulnerability becomes more influential with wider caliper intervals. With negative slopes near the threshold, as in low-powered settings, the assumption of local flatness reduces the power of the caliper test to detect publication bias. With positive slopes near the threshold, as in high-powered settings, there are more observations in the interval above the criterion value than below it even without bias. Thus, the caliper test can falsely identify publication bias when the literature has high power or when the mixture distribution slopes upward around the significance threshold. It is therefore unclear whether positive caliper-test results in some applications reflect bias or the expected shape of the z-value distribution.

Schneck (2017) conducted a Monte Carlo simulation comparing the caliper test to Egger’s test, p-uniform, and the test for excess significance (TES). He found that the 5% caliper maintained acceptable false-positive rates but had low power with fewer than 1,000 studies. The 10% and 15% calipers showed inflated false-positive rates at large K, because wider calipers span a larger portion of the density curve where the local-uniformity assumption can break down. Schneck recommended the 5% caliper for discipline-wide analyses with large K. However, a small caliper does not solve the problem of true asymmetric distributions. With large K, even small departures from local symmetry can be estimated precisely, and the caliper test can become significant even if there is no publication bias.

Simulation studies using z-curve’s heterogeneous effect-size framework reveal the problem more starkly. In a simulation with high average power, fewer than 200 studies, and no bias, the caliper test detected bias 100% of the time. Thus, the test should not be interpreted as evidence of publication bias without inspecting the expected or observed shape of the z-value distribution.

This is not merely a calibration problem that can be fixed by adjusting the significance level or caliper width. Narrower calipers can reduce curvature-induced artifacts, but they cannot remove the conceptual mismatch between what the test assumes, local symmetry, and the actual distribution of z-values when the density slopes across the threshold.

This limitation is not shared by all bias-detection methods. Methods that model the full distribution of z-scores, such as z-curve (Brunner & Schimmack, 2020; Bartoš & Schimmack, 2022), can estimate the expected shape of the z-value distribution under heterogeneous power and selection. The advantage of the caliper test is that it can have high power to detect threshold-related discontinuities in some conditions. Its disadvantage is that it can also provide false evidence of bias when the expected distribution is asymmetric. Therefore, the caliper test should be used together with a plot of the z-value distribution. A positive slope for significant values is a red flag because it violates the local-symmetry assumption of the caliper test.

Summary

The caliper test is a simple, widely used tool for detecting threshold-related publication bias in large literatures. It is most reliable when the expected distribution of test statistics is approximately locally symmetric around the significance threshold in the absence of bias. In literatures where the z-value distribution slopes across the threshold — whether because of high power, low power, or heterogeneous true effects — the test can mistake the expected shape of the distribution for evidence of selective publication or p-hacking. This problem is especially relevant in discipline-wide analyses in the social sciences, where studies often address different hypotheses, use different designs, and have heterogeneous statistical power. Researchers using the caliper test in such settings should interpret positive results with caution and consider model-based alternatives that account for the expected shape of the z-score distribution.

References

Bartoš, F., & Schimmack, U. (2022). Z-curve 2.0: Estimating replication rates and discovery rates. Meta-Psychology, 6, MP.2020.2720.

Berning, C. C., & Weiß, B. (2016). Publication bias in the German social sciences: An application of the caliper test to three top-tier German social science journals. Quality & Quantity, 50, 901–917.

Brodeur, A., Cook, N., & Heyes, A. (2020). Methods matter: p-hacking and publication bias in causal analysis in economics. American Economic Review, 110(11), 3634–3660.

Brodeur, A., Lé, M., Sangnier, M., & Zylberberg, Y. (2016). Star Wars: The empirics strike back. American Economic Journal: Applied Economics, 8(1), 1–32.

Brunner, J., & Schimmack, U. (2020). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta-Psychology, 4, MP.2018.874.

Gerber, A. S., & Malhotra, N. (2008a). Do statistical reporting standards affect what is published? Publication bias in two leading political science journals. Quarterly Journal of Political Science, 3(3), 313–326.

Gerber, A. S., & Malhotra, N. (2008b). Publication bias in empirical sociological research: Do arbitrary significance levels distort published results? Sociological Methods & Research, 37(1), 3–30.

Gerber, A. S., Malhotra, N., Dowling, C. M., & Doherty, D. (2010). Publication bias in two political behavior literatures. American Politics Research, 38(4), 591–613.

Schneck, A. (2017). Examining publication bias — a simulation-based evaluation of statistical tests on publication bias. PeerJ, 5, e4115.

Replicability-Index

Improving the replicability of empirical research

Category Archives: Caliper Test

Publication Bias: The Caliper Test