Losing Sight of the Sign: ANOVA and Significance Testing

When Fisher developed the F-test at Rothamsted Experimental Station in the 1920s, he was solving a real problem. Agricultural field trials had multiple treatment conditions — different fertilizers, different watering regimes, different crop varieties — and the question was whether any of these treatments affected yield. The F-test answered exactly that question: is there more variation between treatments than within them? Any departure from the null was practically interesting because farmers don’t care about direction — they care about which treatment produces the most wheat.

The F-test does this by squaring the differences. The test statistic is always positive. Direction disappears. This is a feature, not a bug, when you have five fertilizers and want to know whether they differ. The omnibus test screens for something worth following up. Fisher then followed up with his Least Significant Difference — ordinary pairwise comparisons gated by the significant F. The omnibus test was a screening step, not the conclusion.

Psychology imported this machinery wholesale, starting in the 1940s. Gigerenzer has told the story of how Fisher’s methods were “cleansed of their agricultural odor” by textbook writers who created the null ritual — mechanical significance testing at p < .05 without specifying alternatives or computing power. But there is a more specific problem that has received less attention: the F-test, by squaring away the sign, trained psychologists to think about hypotheses in unsigned terms. “Is there an effect?” replaced “What is the effect and in which direction?”

This matters less than you might think for multi-group designs, where the omnibus F is doing real work. Testing whether five conditions differ before examining pairwise comparisons is not a ritual — it is a principled gating procedure for multiplicity control. MANOVA before univariate ANOVAs follows the same logic. These are legitimate steps in a testing hierarchy.

But psychology was not mostly running five-group designs. The workhorse experiment had two conditions: treatment versus control. With two groups, F(1, n−2) = t²(n−2). The tests are mathematically identical, but they look different. The t-test has a sign. You can see whether the treatment group scored higher or lower. The F-test strips that away. Psychologists reported F(1, 58) = 4.12, p < .05 when they could have reported t(58) = 2.03, p < .05, and every reader would have immediately seen the direction.

Things got silly. Mark Rubin (2022) argued that it is illegitimate to infer the direction of an effect from a two-sided test — that a significant F or two-sided t only licenses the claim that the means differ, not which is larger. Formally, he is correct that the F-test does not output a direction. But the means are part of the same analysis, and pretending you cannot look at them is a confusion of the test statistic with the inference. Observing that the treatment mean is 12.4 and the control mean is 8.7, and reporting F(1, 58) = 4.12, p < .05, does tell you the treatment increased the outcome. The test confirms the difference is unlikely under the null; the means tell you which way it goes.

This is where the critique of nil-hypothesis testing enters, and where it went partly wrong. Meehl (1967) and Cohen (1994) argued that rejecting the nil hypothesis is scientifically uninformative. They were right that “the means differ somewhere” is a weak conclusion — but this criticism applied most forcefully to the multi-group omnibus F, where Fisher intended it as a screening step. For two-group comparisons, the F-test was always testing a directional effect with the sign obscured.

Here is the point that seems to have been missed. With two groups, a significant F(1, df) at α = .05 is identical to a significant two-sided t(df) at α = .05. And a two-sided test at α = .05 is equivalent to two one-sided tests at α = .025 each. When you reject H₀: μ₁ = μ₂ with a two-sided test and observe that the treatment mean is higher, you have also rejected the directional null H₀: μ₁ ≤ μ₂ at p/2. If you reject μ₁ = μ₂ and observe μ₁ > μ₂, you have rejected μ₁ ≤ μ₂. The sign was always being tested — the F-test just made it invisible.

This means that much of the criticism of NHST — including Gelman and Carlin’s concern about Type S (sign) errors — rests on a misunderstanding. The worry is that a significant result might have the wrong sign, especially in underpowered studies. But a two-sided test that rejects the nil hypothesis does test the sign. The alternative hypothesis is not just “the means differ” — it is partitioned into “the treatment mean is higher” and “the treatment mean is lower,” and the data tell you which one. The sign error problem is real in the sense that underpowered studies can produce unreliable estimates, but it is not a gap in the logic of the test itself. The F-test merely hid this from view.

The real lesson is that statistical tools shape how scientists think. The F-test did not just analyze data — it structured how psychologists formulated hypotheses. By squaring away the sign, it turned every research question into “is there an effect?” rather than “how big, in which direction?” Two generations of methodologists then criticized significance testing for answering a trivial question, when the real problem was that the wrong test statistic was being used for the wrong design. Using t-tests preserves the sign. A positive and significant t-value means the treatment group has a higher mean than the control group, and that this difference is unlikely to have occurred without a real effect. A negative t-value has the opposite implications. Even when the F-test obscures the direction, the observed means still show it, and the significant test licenses the directional inference.

All of these problems dissolve with confidence intervals. A CI that excludes zero and is entirely positive tells you the sign, the magnitude, and the uncertainty in a single object. The F-test, the t-test, and the one-sided versus two-sided debate all become unnecessary.

Replicability-Index

Improving the replicability of empirical research

Losing Sight of the Sign: ANOVA and Significance Testing

Like this:

Leave a ReplyCancel reply

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from Replicability-Index