Two decades ago, Wagenmakers (2007) started his crusade against p-values. His article “A practical solution to the pervasive problems of p-values” (PPPV) has been cited over 800 times, and it is Wagemmakers most cited original article (he also contributed to the OSC, 2015, reproducibility project that already garnered over 1,000 citations.
In PPPV, Wagenmaker claims that statisticians have identified three fundamental problems of p-values, (a) p-values do not quantify statistical evidence, (b) p-values depend on hypothetical data, and (c) p-values depend on researchers’ unknown intentions.
When I read the article many years ago, statistics was a side-interest for me, and I didn’t fully understand the article. Since the replication crisis started in 2011, I have learned a lot about statistics, and I am ready to share my thoughts about Wagenmakers’ critique of p-values. In short, I think Wagenmakers’ arguments are a load of rubbish and the proposed solution to use Bayesian model comparisons is likely to make matters worse.
P-Values Depend on Hypothetical Data
Most readers of this blog post are familiar with the way p-values are computed. Some data are observed. Based on this observed data, an effect size is estimated. In addition, sampling error is computed either based on sample size alone or based on observed information about the distribution of observations (variance). The ratio of the effect size and the sampling error is used to compute a test statistic. To be clear, the same test statistics are used in frequentist statistics with p-values as in Bayesian statistics. So, any problems that occur during these steps are the same for p-values and Bayesian statistics.
What are the hypothetical data that Wagenmakers sees as a problem?
These hypothetical data are data expected under H0, without which it is impossible to construct the sampling distribution of the test statistic
t(xrep | H0).
Two things should be immediately obvious. First, the hypothetical data are no more or less hypothetical than the null-hypothesis. The null-hypothesis is hypothetical (hypothesis – hypothetical, see the connection) and based on the null-hypothesis predictions about the distribution of a test-statistic are made. The actual data are then compared to this prediction. There are no hypothetical data. There is a hypothetical distribution and an actual test statistic. Inferences are based on the comparison. Second, the “hypothetical data” that are expected under H0 are also expected in a Bayesian statistical framework because the same sampling distribution is used to compute the Bayesian Information Criterion or a Bayes Factor.
In short, it is easy to see that Wagenmakers’ problem is not a problem at all. Theories and hypotheses are abstractions. To use inferential statistics, the prediction have to be translated into a sampling distribution of a test statistics.
Wagenmakers presents an example from Pratt (1962) in full to drive home his point; and I reproduce this example again in full.
An engineer draws a random sample of electron
tubes and measures the plate voltage under certain
conditions with a very accurate volt-meter, accurate
enough so that measurement error is negligible compared
with the variability of the tubes. A statistician
examines the measurements, which look normally
distributed and vary from 75 to 99 volts with a mean
of 87 and a standard deviation of 4. He makes the
ordinary normal analysis, giving a confidence interval
for the true mean. Later he visits the engineer’s
laboratory, and notices that the volt meter used reads
only as far as 100, so the population appears to be
“censored.” This necessitates a new analysis, if the
statistician is orthodox. However, the engineer says
he has another meter, equally accurate and reading to
1000 volts, which he would have used if any voltage
had been over 100. This is a relief to the orthodox
statistician, because it means the population was effectively
uncensored after all. But the next day the
engineer telephones and says: “I just discovered my
high-range volt-meter was not working the day I did
the experiment you analyzed for me.” The statistician
ascertains that the engineer would not have held
up the experiment until the meter was fixed, and informs
him that a new analysis will be required. The
engineer is astounded. He says: “But the experiment
turned out just the same as if the high-range meter
had been working. I obtained the precise voltages
of my sample anyway, so I learned exactly what I
would have learned if the high-range meter had
been available. Next you’ll be asking me about my
What is the problem here? Truncating the measure at 100 changes the statistical model. If we have to suspect that the data are truncated, we cannot use a statistical model that assumes a normal distribution. We could use a non-parametric test to get a p-value or a more sophisticated model that models the truncation process. This model would notice that there is little truncation in these hypothetical data because there are actually no values greater than 100.
Thus, this example merely illustrated that statistical inferences depend on the proper modeling of the sampling distribution of a test statistic. All statistical inferences are only valid if the assumptions of the statistical model hold. Otherwise, all bets are off. Most important, this is also true for Bayesian statistics because they rely on the same test statistics and distribution assumptions as p-values. There is nothing magical about Bayes Factors that would allow them to produce valid inferences when distribution assumptions are violated.
P-Values Depend on Researchers’ Intentions
The second alleged problem of p-values is that they depend on researchers’ intentions.
“The same data may yield quite different p values, depending on the intention with which the experiment was carried out.”
This fact is illustrated with several examples like this one.
Imagine that you answered 9 out of 12 questions about statistics correctly (if it were possible to say what is correct and what is false), and I wanted to compute the p-value that you were simply guessing. The two-sided p-value is p = .146, if we assume that the test has 12 questions in total, However, the p-value is .033.
Since 2011, it is well known that data peaking alters the statistical model and that optional stopping alters p-values. If the decision to terminate data collection was in any way systematically influenced by some previous results, a p-value that assumes no data-peaking occurred is wrong because it is based on the wrong statistical model. Undisclosed checking of data is now known as a questionable research practice (John et al., 2012). Thus, Wagenmakers’ example merely shows that p-values cannot be trusted when researchers engaged in questionable research practices. It does not show that p-values are inherently flawed.
How does Bayesian statistic avoid this problem? It avoids the problem only partially. Bayes Factors always express information as a comparison between two models. As long as researchers peak at the data and continue because the data do not favor either model, data peaking does not introduce a bias. However, if they would peak and continue data collection until the data favor one model, Bayesian statistics would be just as biased by data peaking as the use of p-values. Even data peaking with inconclusive data can be biased if one of the models is implausible and would never receive support. In this case, the data can only produce evidence for one model or be undecided, which leads to the same problem that Wagenmakers sees with p-values. For example, testing the null-hypothesis against Wagenmaker’s prior that assumes large effects of 1 SD or more would eventually produce evidence for the null-hypothesis, even if it were false because the data can never produce support for the implausible alternative hypothesis.
In conclusion, the second argument is a good reason for preregistration and against the use of questionable research practices, but not a good argument against p-values.
P Values Do Not Quantify Statistical Evidence
The third claim is probably the most surprising for users of p-values. The main reason for computing p-values is that they are considered to be a common metric that can be used across different types of studies. Everything else being equal, a lower p-value is assumed to provide stronger evidence against the null-hypothesis.
In the Fisherian framework of statistical hypothesis testing, a p value is meant to indicate “the strength of the evidence against the hypothesis” (Fisher, 1958, p. 80).
What are the chances that all textbook writers got this wrong?
To make his point, Wagenmakers uses the ambiguity of everyday language and decides that “the most common and well-worked-out definition is the Bayesian definition”
Nobody is surprised that p-values do not provide evidence given a Bayesian definition of evidence, just like nobody would be surprised that Bayes Factors do not provide information about the long-run probability of false positive discoveries.
What is surprising is that Wagenmakers provides no argument. Instead, he reviews some surveys of statisticians and psychologists that examined the influence of sample size on the evaluation of identical p-values.
For example, which study produces stronger evidence against the null-hypothesis. A study with N = 300 and p = .01 or a study with N = 30 and p = .01. Most statisticians favor the larger study. A quick survey in the Psychological Method Discussion group confirmed this finding. 37 respondents favored the larger sample, 7 said no difference, and 4 favored the smaller sample.
Although this is interesting, it does not answer the question whether a p-value of .0001 provides stronger evidence against the null-hypothesis than a p-value of .10, which is the question at hand.
So, Wagenmakers strongest argument against p-values that they are misinterpreted as a measure of strength of evidence is not an argument at all.
In short, Wagenmakers has been successful in casting doubt about the use of p-values amongst psychologists. He was able to do so because statistics training in psychology is poor and most users of p-values have only a vague understanding of the underlying statistical theory. As a result, they are swayed by strong claims that they cannot evaluate. It took me some time, and away from my original research, to understand these issues. In my opinion, Wagenmakers critique falls apart under closer scrutiny.
The main problem of p-values is that they are not Bayesian, but that is only a problem if you like Bayesian statistics. For most practical purposes, p-values and Bayes-Factors lead to the same conclusions regarding the rejection of the null-hypothesis. In addition, Bayes-Factors offer the false promise that they can provide evidence for the nil-hypothesis, which is also false, but the topic of another blog post.
The real problem in psychological science is not the use of p-values, but the abuse of p-values. That is, a study with N = 30 participants and p = .01 would produce just as much evidence as a study with N = 300 and p = .01, if we wouldn’t have to worry that the researcher with N = 30 also ran 300 participants, but only presented the results of one study that produced a significant result by chance. For this reason, I have invested my time and energy in studying the real power of studies to produce significant results and to detect the use of questionable research practices. It does not matter to me whether effect size estimates and sampling error are reported as confidence intervals, converted into p-values, or reported as Bayes Factors. What matters is that the results are credible and strong claims are supported by strong evidence, no matter how it is reported.
Related blog Posts
Why Wagenmakers is wrong (about Bayesian Analysis of Bem, 2011)