A guest post by Virtual Andrew Gelman about Inferential Statistics

I have tried repeatedly to engage Andrew Gelman in discussions by posting about my work on his blog. I never got satisfactory answers. Fortunately, we live in a new word and can create virtual agents. Some use it for companionship. I use it to create virtual discussion partners for scientific debates. Here is my discussion with “virtual Gelman” that is based on his numerous blog posts. I asked ChatGPT to provide comments by Gelman in the style of his blog posts as well. If Gelman disagrees with his clone, he can say so in the comments section or on his blog post. Just think about this post as a hypothetical “What would Gelman say?”

What’s the Deal with Meta-Analysis, Bias, and Directional Claims?

Ulrich Schimmack sent in a long summary of some discussions we’ve had (indirectly and directly) about NHST, publication bias, z-curve, Cochrane reviews, and how all of this intersects with my work. It’s long, and it’s not all stuff I agree with, but I figured it was worth posting here because it brings together some common themes that come up a lot on this blog—mostly around inference from noisy data, and what we can or can’t say about published results.

I’ll chime in occasionally as we go, but mostly I’ll let the summary speak for itself.

Summary of Key Points

Gelman has criticized NHST, especially for single studies.
No surprise here. The null hypothesis is never exactly true, and p-values don’t tell us what we want to know. They’re often used as binary triggers for publication, which creates all sorts of downstream problems. Yes, and let’s not forget: even when the p-value is “significant,” it doesn’t tell you whether the effect is real, big, important, or replicable. All it tells you is that your model is wrong in some way. Which… it always is.
Sometimes all you can do is look at the sign.
Fair point. If your study is small and noisy, maybe you can’t estimate the effect size well, but you still want to know if the effect is more likely to be positive than negative. Some researchers argue that NHST is conservative in this setting, because the sign error rate is bounded by alpha/2. Sure. But in practice, researchers don’t just say “we’re 97.5% confident the effect is positive.” They say “we found X causes Y,” and then the press release machine kicks in. So yeah, directionality is better than nothing, but we need to be honest about how messy those claims are when data are noisy.
Meta-analysis helps, when studies are replicable.
No problem here. In fields where you can do multiple studies, you can average across them. But that only works if you include all the studies. Right. Meta-analysis is great when it’s not garbage-in, garbage-out. Which brings us to…
Publication bias is a huge problem.
This is something I’ve harped on for years: the published literature is not a representative sample of all studies done. That breaks a lot of meta-analyses. Methods like trim-and-fill or selection models try to correct for this. Z-curve is another approach—more on that in a second.
Some methods try to estimate bias by looking at power.
Z-curve is one of them. You take a bunch of z-values (including non-significant ones), estimate the expected discovery rate, and compare it to the observed discovery rate. The difference gives you an estimate of how much selection bias there is. From there, you can estimate the false discovery risk (FDR), and if you think all effects are nonzero, you can argue the sign error rate is FDR/2. I’ve said before I don’t love the “true vs. false positive” framing, especially when applied to a bunch of unrelated studies. But yeah, if you’re going to do this kind of analysis, z-curve is probably better than pretending selection bias doesn’t exist.
Selection models and z-curve are both imperfect, but useful.
Agreed. The key is to model the process that generated your data—including publication filters. If your model assumes all studies are equally likely to be published regardless of results, you’re deluding yourself.
Then there’s that Cochrane paper…
This is where the gloves come off a bit. In 2023, I coauthored a meta-analysis of clinical trials in Cochrane meta-analyses. We looked at distributions of estimated effects and tried to draw conclusions about Type S (sign) and Type M (magnitude) errors. And yeah, we didn’t explicitly adjust for publication bias. We assumed Cochrane reviews are higher quality and less biased than the average literature—which is probably true, but doesn’t mean there’s zero bias. So, fair critique: if I spend years yelling about publication bias, maybe I should at least include a sensitivity analysis or cite relevant work. That’s on me.

What does this all mean?
There’s no perfect answer. Directional claims are better than nothing, but they’re often overinterpreted. Meta-analysis is great, unless the input is biased. Bias correction methods exist, but they have assumptions. And yes, sometimes I forget to apply my own advice—especially when I’m working with data sources I trust a bit more than usual.

The bigger point is: science is messy. What we need are tools that help us quantify uncertainty and detect when we’re fooling ourselves. That includes selection models, z-curve, shrinkage estimates, and just plain humility.

Anyway, thanks to the correspondent for putting this together. The conversation continues.

Replicability-Index

Improving the replicability of empirical research

A guest post by Virtual Andrew Gelman about Inferential Statistics

Summary of Key Points

Like this:

Leave a ReplyCancel reply

Summary of Key Points

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from Replicability-Index