Psychology is in a crisis. The crisis is ongoing and at least 60 years old (Sterling, 1959). Since 2011, psychologists are increasingly becoming aware that there is a crisis, and there are some signs of improvement. However, there is also a lot of confusion about the extent of the crises and the cause of the crisis.
The root cause of the crisis is poor training and understanding of statistics. Most psychologists, especially in North America, receive about four introductory courses on statistics, taught by instructors with equally poor training in statistics. As statistics is essential to analyzing the noisy data psychologists tend to produce in their studies, many things can go wrong when psychologists compute, report, or interpret statistical results.
The most important statistic that psychologists compute and report is the p-value, which is compared to a criterion value, to claim that a hypothesis was supported; as predicted, yadi yadi yada, p < .05.
The reporting of p-values without understanding what they mean has been called a ritual. Bla, bla bla, p < .05. Also, bla bla bla, p < .05. Claims about statistical significance are used to give the allure of rigorous science to claims about human nature (willpower is like a muscle).
Some psychologists like to blame statisticians for the crisis in psychology. After all, they invented p-values and significance testing. So, if psychologists used these tools and are in a crisis, statisticians are to blame for the crisis. This is utter nonsense. Fisher, Neyman, Pearson, and living statisticians would be appalled by the p-value ritual performed by psychologists.
Proper Interpretation of P-Values
Giegerenzer (2002) pointed out that humans, especially those with little training in math, are better at processing frequency information than probabilities. Using frequency tables it is very simple to understand significance testing with p-values.
Most important, exact p-values are difficult to interpret, but that does not mean they are useless. In fact, the exact p-value is of no particular importance for significance testing. What matters is the comparison of p-values to a criterion value. This criterion value is called alpha. If the p-value is below alpha, a result is said to be statistically significant. If the p-value is above alpha, the result is said to be non-significant. We can interpret the outcome that p < alpha or that p > alpha. We cannot interpret the outcome that p = x.
So, the real question is not what do p-values mean, but what does it mean that p is less than alpha?
To answer this question, we have to use hypothetical scenarios because we never truly know whether a hypothesis is true or false.
Scenario A assumes that a researcher tests 100 hypothesis with alpha = .05 as criterion. Importantly, unbeknownst to the researcher all of his hypothesis are true. Thus, none of the statistical tests produce a false positive result. The percentage of significant results depends on the power of the test. To simplify things, I assume 50% power. Given these assumptions, we can fill in the 2 x 2 frequency table that crosses the truth of a hypothesis, true vs. false, with the statistical result, significant (SIG) vs. non-significant (NS).
The percentage of false positive results (FALSE & SIG) is 0/100 = 0. In frequentist statistics percentages are equivalent with probabilities. So, the probability of a false positive result in this set of studies is 0.
Scenario B is the other extreme, where unbeknownst to the researcher all of the hypothesis are false (e.g., Bem, 2011). We do not need to make assumptions about power because there are no true hypotheses.
In this scenario, the percentage of false positive results (FALSE & SIG) is 5/100 = 5%. This is not a co-incidence. Alpha sets the upper limit of the percentage of significant results for all tests.
False Positives = k * alpha; with k = number of studies conducted
If the percentage of false hypothesis is somewhere between 0 (Scenario A) and 100 (Scenario B), the percentage of false positives will be between 0 and alpha. For example, if there were 60 false hypotheses and 40 true hypotheses, 3% of all tests would be false positives.
In reality, the true percentage of false positives is never known because this would require knowledge about truth of hypothesis. However, the maximum of false positives is known and is set by alpha. This makes interpretation of p-values in relation to alpha easy. We can say that the long-run frequency of false positives is no greater than alpha. It could be less than alpha, but it cannot be more than alpha. Thus, the probability of a false positive result is p < alpha.
In scenario A, 50 significant results were observed. Given k = 100 and alpha = .05, no more than 5 of these significant results can be false positives. Thus, at least 45 (45/50 = 90% ) of the significant results are true positives.
In scenario B, only 5 significant results were observed. As we would expect 5 significant results if all hypotheses were false, none of the significant results is likely to be a true positive.
In scenario C, 23 significant results were observed. As 5 results could be false positives, 18 significant results (18/23 = 78% of the significant results are true positives.
In conclusion, the comparison of p-values to a criterion value alpha makes it possible to create a worst-case scenario with a maximum number of false positive results, which is set by the criterion value alpha. Credible results would have a high proportion of true positives among the significant results.
The Real Problem with Psychologists Use of P-Values
Exactly 60 years ago, Sterling (1959) published his survey of statistical tests in psychology journals. The article should be mandatory reading for every aspiring psychological scientist. Sterling’s article starts with a debate about the use of .05 as an “industry standard” to evaluate statistical results. More important, he points out that using such a standard to publish only significant results is extremely problematic.
General adherence to such a rigid strategy is interesting by itself but might have no further consequences on the decisions reached. However, when a fixed level of significance is used as a critical criterion for selecting reports for dissemination in professional journals it may result in embarrassing and unanticipated results (p. 31).
The problem is that we can only longer quantify the percentage of significant results. Once only significant results are published, the percentage of significant results is by definition 100%. It would be wrong to assume that only 5% of these significant results can be false positives because the 5% value applies to all studies that were conducted, not to the set of studies with significant results.
For scenario A, we would falsely assume that only 50 studies were conducted and expect no more than 50*.05 = 2.5 false positives, when the real number is 5.
For scenario B, we would falsely assume that only 5 studies were conducted and expect no more than 5*.05 = 0.25 false positives, when the real number is 5.
For scenario C, we would falsely assume that only 23 studies were conducted and expect no more than 23*.05 = 1.15 false positives, when the real number is 5.
Without knowing the percentage of studies that produced significant results, p < .05 is meaningless. Maybe this explains why psychologists have such a hard time interpreting p-values. Statistical significance has been used as a criterion to write up a result or to publish results for so long that it may seem as if the main purpose of significance testing is to find publishable results. This is wrong. Publishing only significant results makes p-values and claims about statistical significance meaningless.
Sterling (1959) found a success rate in psychology journals over 90%. In 1995, he replicated this finding (Sterling et al., 1995). He points out that these published significant results provide no information about the credibility and replicability of the published results because (a) many more studies are conducted than reported and (c) replication studies that test whether a published result can be trusted are lacking. As a result, statistical results reported in psychology journals have no meaning because they provide no information about the risk that a significant result is a false positive result.
The reader’s expectations are that Ho will be rejected. What risks does he take in making a Type I error by rejecting Ho with the author? The author intended to indicate the probability of such a risk by stating a level of significance. On the other hand, the reader has to consider the selection that may have taken place among a set of similar experiments for which the one that obtained large differences by chance had the better opportunity to come under his scrutiny. The problem simply is that a Type I error (rejecting the null hypothesis when it is true) has a fair opportunity to end up in print when the correct decision is the acceptance of Ho for a particular set of experimental variables. Before the reader can make an intelligent decision he must have some information concerning the distribution of outcomes of similar experiments or at least the assurance that a similar experiment has never been performed. Since the latter information is unobtainable he is in a dilemma. One thing is clear however. The risk stated by the author cannot be accepted at its face value once the author’s conclusions appear in print. It may be safe to conclude that pursuing statistical analyses under the conditions outlined here may have considerable less merit than psychologists like to ascribe to statistics in experimental design. (p. 34)
Sixty years later, only a minority of psychologists seem to understand that journals filled with results that were selected based on p < alpha provide no statistical information, just like counting only correct answers on exams doesn’t tell us anything about students’ learning. While this is probably obvious to most readers who are teaching, it is less clear why it is so hard for them to understand that only reporting significant results doesn’t tell us anything about the quality of psychological science.
P-values provide useful information that can be compared to a significance criterion to make educated guesses about the proportion of false positive results in the long-run. In the absence of any other information, the long-run frequency is also the best guess for a randomly drawn single study from the set of all studies.
However, to be meaningful, it is important that all tests are published. If only significant results are published, p-values no longer provide any meaningful information about the risk of false positives. Thus, the questionable practice of publishing only significant results that confirm hypothesis has to end. As long as this practice persists, psychology is not a science, no matter whether psychologists call themselves scientists or not.
Sterling (1959) would not have been surprised by the massive replication failures, especially in social psychology, that led to a crisis of confidence in recent years. P-values selected for significance provide only illusory assurances that false positives are rare and power is high. It is time to wake up to the reality that much progress in psychology is build on a mountain of non-significant results that remain hidden. This is the time for true psychological scientists to be humble, to admit to mistakes, and to reform psychological science. And these reforms have to start with better and more rigorous teaching of statistics and mandatory reading of Sterling’s classic article that predicted the replication crisis.