The Problem of Transforming Test Statistics into Z-Values for Z-Curve Analysis

The Problem of Transforming Test Statistics into Z-Values for Z-Curve Analysis

One of the distinctive features of z-curve is that it models the distribution of z-scores, even when the original studies report a wide range of test statistics — t, F, χ², correlations, and more. To make this possible, all reported test statistics are converted into two-sided p-values, and then into equivalent z-scores:

[don’t be scared by the Greek symbol. The formula just tells us to divide the two-sided p-value by 2, subtract that value from 1, and find the corresponding z-value for this p-value.
(r-code: qnorm(1-p/2)

Starting with two-sided p-values implies that the z-values are all positive. The sign is ignored because z-curve is designed for heterogeneous sets of studies where effects may be in different directions. The sign is only important for meta-analysis where negative effects can be interpreted as opposite effects to the predicted effects.

The Theoretical Concern

This transformation makes a big approximation: it treats every test statistic as if it came from a two-sided z-test with standard error 1. An anonymous reviewer on Replication Index put it bluntly:

“There is no justification that these z-scores will follow a z-distribution … it’s weird to assume that p-values transformed to z-scores might have the standard error of 1.” Questionable Reviewer Practices: Dishonest Simulations – Replicability-Index

The criticism is that real test statistics can differ meaningfully from the standard normal, especially in small samples where t and F distributions are heavier-tailed. Converting them all to z values may introduce bias into the estimated power distribution.

What the Developers Say

Jerry Brunner, co-author of z-curve, has acknowledged this is indeed an approximation:

“We pretend that all the tests were actually two-sided z-tests with the results in the predicted direction.”

Simulation studies show that the transformation works well in standard conditions. The reason is that the t-distribution approximates the normal distribution more and more as sample sizes increase. The transformation problem is limited to small samples, but what is small?

When It Becomes a Problem

The trouble starts when two things are small at the same time:

  • Per-study sample size (N) is small — especially per-cell N ≲ 20–30.
  • Number of studies (k) is small — ≲ 20–30.

In this low-N, low-k regime, simulation results show that the tz approximation can bias estimates of mean power, typically underestimating when true power is moderate to high (> .50).

As N increases, the t distribution converges toward the standard normal, and by N ≈ 80 per study the bias is negligible. Likewise, having more studies (k ≥ 100) smooths out the effect of the approximation.

The Practical Takeaway

For large, diverse meta-analytic datasets, especially with moderate or large per-study N, the p→z transformation appears to work well enough in practice. But for small-sample, small-k applications, the approximation may not hold, and z-curve estimates could be biased.

If you’re working in that small-N/small-k territory, you might consider:

  • Direct modeling of effect sizes with a selection model (e.g., weightr).
  • Sensitivity checks comparing the z-curve results to alternative bias-correction methods.
  • Directly use t-values as z-values. This will introduce the opposite bias because t-values can be much larger than z-values in small samples.
  • Fit the data with t-curve. instead of fitting a mixture model with a set of normal distributions, the model can be fitted with non-central t-distributions if all studies are small and have similar degrees of freedom.

In conclusion, the approximation of test statistics from different tests with z-values is an approximation that can introduce biases. This is only a problem when sample sizes are small (N < 30).

Leave a Reply