Hypothesis Testing and Estimation with Standardized Standard Errors

Inferential statistics in psychology lacks a unified practice. Debates about statistical inference tend to organize around a 2 × 2 structure: one dimension distinguishes frequentist from Bayesian approaches, the other distinguishes hypothesis testing from effect-size estimation. This produces four familiar schools: frequentist hypothesis testing (t tests, ANOVAs, p < .05), Bayesian hypothesis testing (Bayes factors, as advocated by Wagenmakers and Rouder), frequentist effect-size estimation (the New Statistics, confidence intervals, meta-analysis), and Bayesian effect-size estimation (posterior distributions, ROPE). The categories are not mutually exclusive in statistical theory, but as advocacy traditions in psychology they are real and consequential.

	Hypothesis testing	Effect-size estimation
Frequentist	NHST / Fisher-Neyman-Pearson hybrid	New Statistics / estimation statistics
Bayesian	Bayes factors / Bayesian model comparison	Bayesian estimation / posterior intervals / ROPE

Each school has genuine advantages. Frequentist hypothesis testing provides a transparent decision procedure. Bayes factors can quantify evidence for a null hypothesis, which became attractive when failed replications made it plausible that some effects were exactly zero. Frequentist estimation, following Cohen’s critique that the nil hypothesis is almost never literally true, redirects attention from binary decisions to effect magnitudes. Bayesian estimation adds prior information, which can improve estimates in small samples if the prior is well-calibrated.

The schools also have genuine conflicts — about the meaning of probability, the role of prior information, and whether hypothesis testing is a coherent goal at all. These conflicts have generated decades of methodological debate and a substantial literature of mutual criticism.

What the four schools share, however, is more important than what divides them. All four presuppose that the data contain enough information to support the inferences being drawn. A Bayes factor, a p-value, a confidence interval, and a posterior distribution are all answers to the question of what the data say. None of them first asks whether the data are precise enough to say anything useful. That prior question — how much sampling error is in this study? — is the topic of this article.

How Noisy Is the Study?

Inferential statistics generalizes from observations to broader conclusions. Larger samples reduce sampling error and make inductive inference more reliable. This is well known. What is less appreciated is that sampling error is typically reported in raw units, which makes it difficult to judge whether a study is precise enough to support its conclusions.

The solution is to standardize sampling error the same way Cohen standardized effect sizes. For mean differences between two groups, Cohen’s d divides the raw mean difference by the pooled standard deviation, yielding a unit-free effect size. We can apply the same logic to the standard error. The result — the standard error of the standardized mean difference — can be called the standardized standard error (SSE):

SSE = 2 / √N

This single number captures how noisy a study is, on the same scale as d itself. For hypothesis testing, the ratio d/SSE approximates a z-score; an absolute value above 2 rejects the nil hypothesis at α = .05. For confidence intervals, the approximate 95% CI is simply d ± 2·SSE, giving a total width of 4 SSEs. Both testing and estimation, in other words, are direct functions of SSE.

The practical implications are immediate. Common sample sizes in psychology produce large SSEs:

N (total)	n per group	SSE
40	20	.32
100	50	.20
200	100	.14
1000	500	.06

A study with SSE = .32 produces a 95% confidence interval roughly 1.28 d-units wide. If the true effect is small (d = .2), the interval runs from approximately −.44 to .84 — spanning small negative to large positive effects. The data cannot determine even the sign of the effect, let alone its magnitude.

No statistical method removes this limitation. Bayesian estimation can incorporate prior information, but unless that prior comes from genuinely trustworthy external evidence, it cannot substitute for data that are not there. A weakly informative prior applied to a study with SSE = .32 will produce a posterior that is dominated by the prior, not the data — which is not estimation, it is prior retrieval. The study is simply not informative about small effects.

What researchers typically do in this situation is report the point estimate without its SSE, or omit confidence intervals when results are non-significant. This conceals rather than communicates the study’s imprecision. A d = .4 reported without its SSE tells readers nothing about the true population effect size.

Fools Gold: Effect Size Estimation in Small Samples

Small samples are occasionally defensible. If an effect is expected to be large and resources are limited, a modest goal — demonstrating that the effect is positive — may be worth pursuing. But this is a directional claim, not a magnitude claim, and the distinction matters.

Consider a study with N = 40 participants (n = 20 per group), where SSE = .32, and suppose the true effect is Cohen’s large effect, d = .8. The signal-to-noise ratio is .8 / .32 = 2.5, giving approximately 70% power at α = .05 — slightly below Cohen’s recommended 80%, but perhaps acceptable under resource constraints.

Now suppose the observed estimate is d = .8. The approximate 95% confidence interval is d = .8 ± 2(.32), or d = .16 to d = 1.44. The interval excludes zero, so the study supports a positive direction. But it simultaneously includes a small effect (d = .2), a medium effect (d = .5), a large effect (d = .8), and an extremely large effect (d > 1.0). The data say almost nothing about magnitude.

The most misleading way to report this result is to write d = .8, p < .05 — as if the point estimate were a precise measurement. It is not. It is one noisy draw from a distribution with SSE = .32. A large library of statistical methods exists for correcting, shrinking, or reweighting such estimates. These methods can help when they incorporate trustworthy external information. But they cannot manufacture precision that is not in the data. A posterior distribution is not narrower than a likelihood unless the prior carries real information — and a generic weakly-informative prior does not.

A statistically significant result from a small study may support a directional claim. It does not support a magnitude claim. Reporting d = .8 as if it were a reliable effect-size estimate, rather than a noisy estimate compatible with effects ranging from small to very large, is confusing fool’s gold for real gold.

Estimation and Hypothesis Testing with Confidence Intervals

A point estimate has little standalone value if it is compatible with qualitatively different population effect sizes. An estimate of d = .5 is uninformative if the true effect could plausibly be small, medium, or large. The confidence interval makes this uncertainty explicit — but it does more than that.

Confidence intervals are widely understood as a test of the nil hypothesis: if the interval excludes zero, the effect is statistically significant. This is the least interesting thing a confidence interval does. An interval also rejects every parameter value outside it. A confidence interval ranging from d = .2 to d = .6 rejects not only d = 0, but also d = −.3, d = −.5, and d = .8. It tells us the effect is neither absent nor large. As intervals become narrower, they reject more values and carry more information about magnitude, not just direction.

This reveals that estimation and hypothesis testing are not competing approaches — they are two descriptions of the same inferential act. The apparent conflict arises because most discussions treat “hypothesis testing” as synonymous with nil-hypothesis testing. Rejecting the nil hypothesis answers only the directional question: is the effect positive or negative? Rejecting a broader range of values on both sides of a narrow interval answers the magnitude question: how big is the effect, approximately?

The practical implication is direct. Replacing p-values with confidence intervals does not by itself improve psychological science. A wide confidence interval is more honest than a p-value, but it is not more informative. It displays uncertainty rather than hiding it — which is progress — but it does not resolve the uncertainty. Resolution requires narrow intervals, and narrow intervals require small SSE. The reform agenda that simply substitutes confidence intervals for p-values has stopped one step short of the real argument: what psychologists need are studies with SSE small enough that the interval, wherever it lands, actually constrains the answer.

The same logic applies in reverse. A confidence interval that ranges from d = −.05 to d = .05 includes zero — so the traditional nil-hypothesis test is non-significant, and the result is typically filed away as uninformative. But this interval rejects d = .2, d = .5, d = −.2, and d = −.5. It provides strong evidence that the effect, whatever its sign, is too small to matter. Non-significance from a precise study is not a failure to find an answer. It is an answer: the effect is negligible. The asymmetry that treats only significant results as informative is a consequence of fixating on the nil hypothesis rather than reading the full interval.

Interpreting Standardized Sampling Error

Cohen’s lasting contribution was not just the standardized effect size but the norms that made it interpretable: .2 is small, .5 is moderate, .8 is large. These benchmarks gave psychologists a shared vocabulary for thinking about effect magnitudes. Standardized effect-size estimates are now routinely reported in psychology articles.

No equivalent norms exist for standardized sampling error. Psychologists report d and r, but rarely the SSE that determines how much those estimates can be trusted. This is the gap Cohen’s framework left open. Just as he provided benchmarks for effect sizes, we can provide benchmarks for sampling error:

SSE = .10 — small. The estimate is precise enough to support magnitude claims. SSE = .20 — moderate. Directional claims are reliable; magnitude claims require caution. SSE = .30 — large. Only directional claims about large effects are supportable.

In a two-group between-subjects design, these values correspond to total sample sizes of approximately N = 400, N = 100, and N = 44, respectively. Studies with SSE above .30 should generally be treated as exploratory. Statistically significant results from such studies are likely to produce inflated effect-size estimates — a direct consequence of the winner’s curse, where only the noisiest overestimates clear the significance threshold.

The advantage of SSE over raw sample size as a reporting standard is that sampling error depends on more than N. Repeated-measures designs, within-person contrasts, reliable outcome measures, and strong covariates all reduce SSE independently of sample size. A study with 30 participants and 50 repeated reaction-time trials per condition may have smaller SSE than a between-subjects study with N = 200. Sample-size rules of thumb cannot capture this. SSE can. By asking “how noisy is this estimate?” rather than “how many participants were there?”, researchers focus on the quantity that actually governs what inferences the data can support.

There is also an upper bound on when SSE becomes a practical concern. For very large studies — N above roughly 1,000, where SSE falls below .06 — the confidence interval remains formally correct but becomes substantively less important. The point estimate is already a close approximation of the population value, and the margin of error is small enough to be reported as a footnote rather than foregrounded in the interpretation.

Large public opinion surveys work this way: the ±2 percentage point margin appears in fine print because it rarely changes the conclusion. At this scale, studies are genuinely estimating population parameters rather than noisily gesturing at them. The SSE framework matters most in the range where psychology actually operates — studies with N between 40 and 500, where sampling error is large enough to be consequential but small enough to be routinely ignored.

Once more this suggestion prevents “sample size bragging” and false claims that a study with N = 100,000 participants is superior to a study with N = 1,000 participants. Once sample sizes are over 1,000 other characteristics of studies are much more important than sample size.

Conclusion

The debates reviewed at the outset — frequentist versus Bayesian, testing versus estimation — are genuine. But they share a prior question that none of them fully answers: how much information does this study contain? That question has a direct, interpretable answer: SSE.

SSE answers the first question every researcher should ask about their data. The confidence interval answers the second. Together, they reduce the essential inferential task to two numbers and one formula:

d ± 2 × SSE

If the interval falls entirely above zero, the data support a positive effect. If it falls entirely below zero, the data support a negative effect. If the interval is narrow enough to exclude effects too small or too large to matter, the data constrain the answer usefully. If the interval is wide, the data are too noisy — and no statistical method, frequentist or Bayesian, changes that.

The word estimate matters throughout. A d-value is not the effect size. It is an effect-size estimate, and its standalone value depends entirely on the SSE around it. A study that produces d = .8 with SSE = .32 has not demonstrated a large effect. It has demonstrated that a large effect is one of many values consistent with the data.

Real studies are often more complex, and SSE can be harder to compute in multilevel or multivariate designs. But the principle is simple and general: interpretation requires an effect-size estimate and its sampling error. Without both, researchers are not interpreting evidence.

They are reading meaning into noise.

Replicability-Index

Improving the replicability of empirical research

Hypothesis Testing and Estimation with Standardized Standard Errors

How Noisy Is the Study?

Fools Gold: Effect Size Estimation in Small Samples

Estimation and Hypothesis Testing with Confidence Intervals

Interpreting Standardized Sampling Error

Conclusion

Like this:

Leave a ReplyCancel reply

How Noisy Is the Study?

Fools Gold: Effect Size Estimation in Small Samples

Estimation and Hypothesis Testing with Confidence Intervals

Interpreting Standardized Sampling Error

Conclusion

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from Replicability-Index