The Earth is Flat: p less than .05

In 1994, my hero Jacob Cohen wrote a scathing critic of the null-hypothesis significance testing (NHST) ritual with the funny title “The Earth is round, p < .05” The article is a classic and has been cited over 2,000 times.

As much as I admire Cohen, and as much as I like the article, the article failed to have an impact on the way psychologists conduct research.

I believe Cohen’s critique had little impact because he was concerned about the wrong problem. As the title indicates, Cohen was concerned that psychologists use the NHST ritual to test hypothesis that are known to be true. As a result, significance tests can only produce two results. Either they produce a significant result that merely confirms something that is already known, or, even worse, the statistical test fails to produce support for a correct hypothesis and the data are discarded (The Earth is Round, p > .05). No real information is gained from this ritual.

With the benefit of hindsight and scandals like Bem’s (2011) ESP article and the failure to replicate hundreds of ego-depletion and social priming studies, I think it is time to revisit Cohen’s article. The real problem with the NHST ritual is not that obviously true hypothesis are published with significant p-values. The real problem is that many false claims are published with the claim p < .05. That is, in psychology the Earth is flat, p < .05. For example, erotic stimuli have time -reversed effects on extraverts’ feelings, p < .05; subliminal primes make people walk slower, p < .05; and not eating a chocolate cookie makes people give up on a hard problem, p < .05; and showing a picture of Einstein makes people solve fewer problems, p <.05.

In addition, Cohen’s criticism of NHST was unfair and surprising, given his understanding of proper use of NHST. As he suggested himself, the article was more written out of frustration that psychologists had failed to listen to his advice how they could improve psychological science.

“Like many men my age, I mostly grouse.”

The article is valuable because it documents Cohen’s frustration with psychologists’ inability to improve their science. It is not a valid criticism of NHST.

The Nil-Hypothesis

I believe Cohen was the first to introduce the term nil-hypothesis to distinguish the default null-hypothesis that there is no effect, from other null-hypothesis that specify a specific effect size (e.g., the minimal effect size that is theoretically or practically relevant).

As if things were not bad enough in the interpretation, or misinterpretation, of NHST in this general sense, things get downright ridiculous when Ho is to the effect that the effect size (ES) is 0—that the population mean difference is 0, that the correlation is 0, that the proportion of males is .50, that the raters’ reliability is 0 (an Ho that can almost always be rejected, even with a small sample—Heaven help us! (p. 1000).

Cohen himself shows that this criticism of NHST is decades old.

More recently I discovered that in 1938, Berkson wrote. “It would be agreed by statisticians that a large sample is always better than a small sample. If, then, we know in advance the P that will result from an application of the Chi-square test to a large sample, there would seem to be no use in doing it on a smaller one. But since the result of the former test is known, it is no test at all. (p. 526).

Tukey (1991) wrote that “It is foolish to ask ‘Are the effects of A and B different?’ They are always different— for some decimal place” (p. 100).

This criticism of NHST is flawed because most of the time, NHST is used to determine whether an effect is positive or negative. Rarely, researches conduct a statistical test and then claim that there is a difference in either direction, p < .05. Just imagine a study of height differences between men and women and the conclusion is that there is a gender difference in height without mentioning the direction of the difference. This is outright ridiculous; and so is the criticism of NHST that it merely confirms something we already know (the Earth is round, p < .05).

While it is true that we know a priori that the nil-hypothesis is false in the trivial sense that any point prediction is likely to be false, we do not know a priori whether most people are left handed or right handed, whether IQ correlates positively or negatively with happiness, or whether psychology is more or less replicable than economics.

A simple way to drive home this point, is to take a step back and consider one-tailed significance tests, which are the most appropriate test for a directional hypothesis. For example, we could test the directional hypothesis that psychology is more replicable than economics. Nobody would say that confirming this hypothesis is trivial because the answer is known. The problem with a one-tailed test is that we are only allowed to draw inferences from the result, if the test result allows us to reject the null-hypothesis (NOT THE NIL-HYPOTHESIS), that economics is more replicable than psychology. If however, the data support the hypothesis that economics is more replicable than psychology we are not allowed to accept the null-hypothesis. This is awkward, if we have no real a priori reason to favor one discipline over the other.

A solution to this problem is to conduct two one-tailed tests. We first test H0: economics is more replicable than psychology and then we test H0: pychology is more replicable than economics. The problem with this approach is that conducting two tests creates a multiple comparison problem. If we are conducting both tests with alpha = .05, the real type-I error risk is alpha = .05*2 = .10.

There is a simple solution to control the multiple comparison problem. We simply low the alpha of our one-sided tests by the number of tests. Thus, we can conduct both one-sided tests with alpha = .025. Now we can draw inferences with a long-run error rate of .025*2 = .05.

As a two-sided test of the nil-hypothesis allocates half of the error regions to the two tails, we can easily see that we are essentially conducting two one-sided tests with alpha = .025 when we test the nil-hypothesis with alpha = .05.

In conclusion, the NHST ritual is not necessarily ridiculous. When we can think of the nil-hypothesis as the boundary between two regions of interest (all positive values vs. all negative values), nil-hypothesis testing is essentially a test of the direction of an effect.

Whether it is possible to build a science on directional effects without information about the size of an effect is another question. However, for many low-budget, exploratory research merely demonstrating the direction of an effect is difficult, as the replication crisis shows. For many textbook findings, we do not even know what the real sign of an effect is. Thus, the real problem was not that nil-hypothesis testing cannot produce meaningful results. The real problem was that nil-hypothesis testing was used to publish only confirming results with p < .05, which is clearly an abuse of NHST. This obvious fact was stated by Anthony Greenwald in 1975.

First, it is a truly gross ethical violation for a researcher to suppress reporting of difficult-to-explain or embarrassing data in order to
present a neat and attractive package to a journal editor.

Second, it is to be hoped that journal editors will base publication decisions
on criteria of importance and methodological soundness, uninfluenced by whether a result supports or rejects a null hypothesis.

Since 2011, it is clear that especially social psychology has violated these simple rules of good science. Thus, it is wrong to blame NHST for the crisis in psychological science. No method is immune to blatant abuse to hide the truth that many attempts to demonstrate an effect failed.

Cohen’s contribution to psychological science should not be judged by his 1994 article. I don’t know how frustrated he was by psychologists ignorance of his attempts to improve psychological science. His 1994 article shows this frustration and when we are frustrated we are not that clear in our thinking. If psychologists had listened to Cohen (1962, 1988), we would not have a crisis of confidence. We owe him to listen at least now.

Replicability-Index

Improving the replicability of empirical research

The Earth is Flat: p less than .05

Like this:

2 thoughts on “The Earth is Flat: p less than .05”

Leave a ReplyCancel reply

Share this:

Like this:

2 thoughts on “The Earth is Flat: p less than .05”

Leave a ReplyCancel reply

Discover more from Replicability-Index