Summary: Statistical Inference, Publication Bias, and Meta-Analysis—Key Themes and Methodological Tensions
This summary outlines central themes in ongoing discussions about the limitations of null hypothesis significance testing (NHST), the role of publication bias, and the use of statistical tools—especially meta-analytic models—to evaluate the credibility of scientific findings.
1. The Limits of NHST and Single-Study Inference
Many scholars, including Gelman and collaborators, have criticized the overreliance on NHST in single studies. Key concerns include:
- The implausibility of exact null effects in many domains;
- The tendency for NHST to produce overconfident claims from noisy data;
- The misuse of p-values as binary decision rules for significance.
Despite these limitations, NHST is sometimes defended in exploratory contexts, particularly when directional conclusions (e.g., whether an effect is positive or negative) are all that can reasonably be inferred. Under this logic, the Type S (sign) error risk can be bounded by α/2, making NHST a conservative tool for directional claims.
2. The Role of Replication and Meta-Analysis
Researchers studying replicable phenomena often turn to meta-analysis to overcome the noise and uncertainty of individual studies. Meta-analytic techniques allow for more precise estimation of effect sizes and assessment of robustness across studies.
However, meta-analysis itself is vulnerable to distortion when the underlying literature is affected by publication bias—the selective reporting of statistically significant results.
3. Publication Bias and Its Detection
Bias in the scientific literature can lead to inflated effect sizes and distorted conclusions. Various tools have been developed to detect and correct for this bias, including:
- Selection models (e.g., Hedges, Vevea, Iyengar), which model the probability that a study is published as a function of its p-value or effect size;
- Regression-based approaches, which examine the correlation between effect sizes and sampling variance;
- Power-based methods, such as z-curve, which use the distribution of test statistics to estimate the expected discovery rate (EDR) and the false discovery risk (FDR).
These methods differ in assumptions, strengths, and sensitivity to heterogeneity.
4. Power-Based Methods and the False Discovery Risk
Z-curve estimates the expected discovery rate (i.e., how often a set of studies should be significant based on their power) and compares this to the observed discovery rate to estimate the degree of selection bias. This enables:
- Estimation of the false discovery risk (FDR);
- Estimation of the sign error risk (SER), typically approximated as FDR/2 under the assumption that the null is never strictly true but many effects are close to zero.
Such approaches offer an alternative to selection models and are particularly useful in exploring the credibility of broad literatures.
5. Concerns About Heterogeneity and Interpretation
One common objection, raised for example by Gelman, is that some of these methods (e.g., z-curve or tests based on FDR) may be misapplied when studies are highly heterogeneous or only weakly related. In such cases, the meaning of aggregate statistics like FDR or average power becomes harder to interpret.
Nevertheless, even in heterogeneous collections, knowing that the overall false discovery risk is low (e.g., <5%) can still support the general claim that most statistically significant findings likely reflect true directional effects.
6. Meta-Analytic Practice and Inconsistency in Application
A point of methodological tension arises when researchers who have written extensively about publication bias and model-based corrections (including Gelman) sometimes omit such adjustments in their own empirical work. For example, in a recent paper analyzing Cochrane reviews, Gelman and collaborators used a multilevel model to estimate effect distributions but did not include bias detection or correction procedures.
While Cochrane reviews are widely regarded as high-quality, some argue that the absence of publication bias testing is inconsistent with earlier methodological positions emphasizing its importance.
Conclusion
The methodological landscape for evaluating research credibility continues to evolve. There is broad agreement on the limitations of NHST and the need for replication and meta-analysis. However, the best methods for addressing publication bias—whether via selection models, power-based approaches, or other tools—remain subject to debate.
In practice, transparency about assumptions, robustness checks across models, and clarity about the limits of inference are essential for meaningful meta-scientific progress.
Let me know if you’d like this summary adapted for a particular venue—e.g., a research report, methods paper, or preprint appendix.
I didn’t get an answer to the title. Why is Gelman is Wrong About Null Hypothesis Testing?
Well to get an answer you have to start with an objection to it. The post goes through his usual complaints and shows why they do not hold. For example, the null of NHST is never true, is not a valid objection because alpha controls sign errors which he always talks about so much. It is actually more conservative than a sign test that assumes effect sizes are never zero (PSI anybody?)
I think you are misstating the claim. the objection is that the null hypothesis is a point estimate so you are testing a hypothesis that is known to be false. One is rarely interested in when whether a correlation or a mean difference is zero in the population. We wave to know if the parameter is meaningfully different from zero. Besides testing this hypothesis doesn’t address the question you want answered. We want to know the probability of the hypothesis given the data, not the probability of the data given the hypotheses.
1. Nobody cares about the nil-hypothesis and so many say rejecting it is silly, especially when it is unlikely to be true.
I have heard this so many times, but many people have not heard the reply. The nil-hypothesis is interesting because it separates positive and negative effects. When we reject it, we do so in favor of a directional effect. We do not know the direction of effects in advance, so we are learning something.
2. The next argument is that we can make sign errors, but we have those covered with the type-I error. So that is not a solid argument, sign error probability = alpha/2.
3. And finally, we also do not care about the exact p-value because it is meaningless. So, criticizing it for expressing the wrong probability is also silly. We are using the p-value only to control a specified error rate with alpha. p < alpha is important, p = .0002 is not for inferences based on a single study. 3. Bayesians promise that they can do more with the uncertainty in small samples by using priors. The problem is that all conclusions are contingent on how good the priors are. 4. Pick your poison, but don't tell others that your poison is better than others.
That’s only true if we are using a one-tailed test. But most people use two-tailed tests, so finding a significant positive result doesn’t rule out a negative effect. I understand that a two-sided test with alpha = .05 is equivalent to a one-tailed with alpha = .025. But what if the results come out in the opposite of the hypothesized direction? Many cheat and revert to a two-tailed test. The proper thing is to conduct another study, which is why one-tailed tests are not generally recommended.
Let me educate you, if you are willing to learn.
A two-tailed test with alpha = .05 is really just two one-tailed tests in both directions with alpha = .025.
So, if you agree with me for one-tailed tests, you also agree with me for two-tailed tests, you just didn’t realize it because the logic of two-sided testing is often not explained well. This is especially problematic when Bayesians start using two-sided test to provide evidence FOR the nil-hypothesis. But let us ignore this insanity.
If I tested, one-tailed with alpha = .05, I would have to do another study or make claims about the other tail with alpha = .10. That is the reason why we test one-tailed with alpha = .025, so that we can make claims in both directions with alpha = .05. 2 * .025 = .05 🙂
I don’t agree that you can change the rules of the game after the game has been played. You have to set the alpha and number of tails before collecting data. You can’t change anything afterwards because you don’t like the results.
That is why I start two-tailed with alpha = .025 in both tails, which is the normal approach. However, sequential testing also allows you to look at the data again if you adjust your error rate. The error rate doesn’t care about your prior or post predictions. It only cares about the long run probability given alpha.
Another way to say what Ulrich is saying is that a two-tailed test is just two one-tailed tests with a Bonferroni-correction (for a case where the two conclusions “parameter > H0” and “parameter < H0" are really completely disjoint).
Nicely said.