# Statistics Wars: Don't change alpha. Change the null-hypothesis!

The statistics wars go back all the way to Fisher, Pearson, and Neyman-Pearson(Jr), and there is no end in sight. I have no illusion that I will be able to end these debates, but at least I can offer a fresh perspective. Lately, statisticians and empirical researchers like me who dabble in statistics have been debating whether p-values should be banned and if they are not banned outright whether they should be compared to a criterion value of .05 or .005 or be chosen on an individual basis. Others have advocated the use of Bayes-Factors.

However, most of these proposals have focused on the traditional approach to test the null-hypothesis that the effect size is zero. Cohen (1994) called this the nil-hypothesis to emphasize that this is only one of many ways to specify the hypothesis that is to be rejected in order to provide evidence for a hypothesis.

For example, a nil-hypothesis is that the difference in the average height of men and women is exactly zero). Many statisticians have pointed out that a precise null-hypothesis is often wrong a priori and that little information is provided by rejecting it. The only way to make nil-hypothesis testing meaningful is to think about the nil-hypothesis as a boundary value that distinguishes two opposing hypothesis. One hypothesis is that men are taller than women and the other is that women are taller than men. When data allow rejecting the nil-hypothesis, the direction of the mean difference in the sample makes it possible to reject one of the two directional hypotheses. That is, if the sample mean height of men is higher than the sample mean height of women, the hypothesis that women are taller than men can be rejected.

However, the use of the nil-hypothesis as a boundary value does not solve another problem of nil-hypothesis testing. Namely, specifying the null-hypothesis as a point value makes it impossible to find evidence for it. That is, we could never show that men and women have the same height or the same intelligence or the same life-satisfaction. The reason is that the population difference will always be different from zero, even if this difference is too small to be practically meaningful. A related problem is that rejecting the nil-hypothesis provides no information about effect sizes. A significant result can be obtained with a large effect size and with a small effect size.

In conclusion, nil-hypothesis testing has a number of problems, and many criticism of null-hypothesis testing are really criticism of nil-hypothesis testing. A simple solution to the problem of nil-hypothesis testing is to change the null-hypothesis by specifying a minimal effect size that makes a finding theoretically or practically useful. Although this effect size can vary from research question to research question, Cohen’s criteria for standardized effect sizes can give some guidance about reasonable values for a minimal effect size. Using the example of mean differences, Cohen considered an effect size of d = .2 small, but meaningful. So, it makes sense to set a criterion for a minimum effect size somewhere between 0 and .2, and d = .1 seems a reasonable value.

We can even apply this criterion retrospectively to published studies with some interesting implications for the interpretation of published results. Shifting the null-hypothesis from d = 0 to d < abs(.1), we are essentially raising the criterion value that a test statistic has to meet in order to be significant. Let me illustrate this first with a simple one-sample t-test with N = 100.

Conveniently, the sampling error for N = 100 is 1/sqrt(100) = .1. To achieve significance with alpha = .05 (two-tailed) and H0:d = 0, the test statistic has to be greater than t.crit = 1.98. However, if we change H0 to d > abs(.1), the t-distribution is now centered at the t-value that is expected for an effect size of d = .1. The criterion value to get significance is now t.crit = 3.01. Thus, some published results that were able to reject the nil-hypothesis would be non-significant when the null-hypothesis specifies a range of values between d = -.1 to .1.

If the null-hypothesis is specified in terms of standardized effect sizes, the critical values vary as a function of sample size. For example, with N = 10 the critical t-value is 2.67, with N = 100 it is 3.01, and with N = 1,000 it is 5.14. An alternative approach is to specify H0 in terms of a fixed test statistic which implies different effect sizes for the boundary value. For example, with t = 2.5, the effect sizes would be d = .06 with N = 10, d = .05 with N = 100, and d = .02 with N = 1000. This makes sense because researchers should use larger samples to test weaker effects. The example also shows that a t-value of 2.5 specifies a very narrow range of values around zero. However, the example was based on one-sample t-tests. For the typical comparison of two groups, a criterion value of 2.5 corresponds to an effect size of d = .1 with N = 100. So, while t = 2.5 is arbitrary, it is a meaningful value to test for statistical significance. With N = 100, t(98) = 2.5 corresponds to an alpha criterion of .014, which is a bit more stringent than .05, but not as strict as a criterion value of .005. With N = 100, alpha = .005 corresponds to a criterion value of t.crit = 2.87, which implies a boundary value of d = .17.

In conclusion, statistical significance depends on the specification of the null-hypothesis. While it is common to specify the null-hypothesis as an effect size of zero, this is neither necessary, nor ideal. An alternative approach is to (re)specify the null-hypothesis in terms of a minimum effect size that makes a finding theoretically interesting or practically important. If the population effect size is below this value, the results could also be used to show that a hypothesis is false. Examination of various effect sizes shows that criterion values in the range between 2 and 3 provide can be used to define reasonable boundary values that vary around a value of d = .1

The problem with t-distributions is that they differ as a function of the degrees of freedom. To create a common metric it is possible to convert t-values into p-values and then to convert the p-values into z-scores. A z-score of 2.5 corresponds to a p-value of .01 (exact .0124) and an effect size of d = .13 with N = 100 in a between-subject design. This seems to be a reasonable criterion value to evaluate statistical significance when the null-hypothesis is defined as a range of smallish values around zero and alpha is .05.

Shifting the significance criterion in this way can dramatically change the evaluation of published results, especially results that are just significant, p < .05 & p > .01. There have been concerns that many of these results have been obtained with questionable research practices that were used to reject the nil-hypothesis. However, these results would not be strong enough to reject the modified hypothesis that the population effect size exceeds a minimum value of theoretical or practical significance. Thus, no debates about the use of questionable research practices are needed. There is also no need to reduce the type-I error rate at the expense of increasing the type-II error rate. It can be simply noted that the evidence is insufficient to reject the hypothesis that the effect size is greater than zero but too small to be important. This would shift any debates towards discussion about effect sizes and proponents of theories would have to make clear which effect sizes they consider to be theoretically important. I believe that this would be more productive than quibbling over alpha levels.

To demonstrate the implications of redefining the null-hypothesis, I use the results of the replicability project (Open Science Collaboration, 2015). The first z-curve shows the traditional analysis for the nil-hypothesis and alpha = .05, which has z = 1.96 as the criterion value for statistical significance (red vertical line).

Figure 1 shows that 86 out of 90 studies reported a test-statistic that exceeded the criterion value of 1.96 for H0:d = 0, alpha = .05 (two-tailed). The other four studies met the criterion for marginal significance (alpha = .10, two-tailed or .05 one-tailed). The figure also shows that the distribution of observed z-scores is not consistent with sampling error. The steep drop at z = 1.96 is inconsistent with random sampling error. A comparison of the observed discovery rate (86/90, 96%) and the expected discovery rate 43% shows evidence that the published results are selected from a larger set of studies/tests with non-significant results. Even the upper limit of the confidence interval around this estimate (71%) is well below the observed discovery rate, showing evidence of publication bias. Z-curve estimates that only 60% of the published results would reproduce a significant result in an actual replication attempt. The actual success rate for these studies was 39%.

Results look different when the null-hypothesis is changed to correspond to a range of effect sizes around zero that correspond to a criterion value of z = 2.5. Along with shifting the significance criterion, z-curve is also only fitted to studies that produced z-scores greater than 2.5. As questionable research practices have a particularly strong effect on the distribution of just significant results, the new estimates are less influenced by these practices.

Figure 2 shows the results. Most important, the observed discovery rate dropped from 96% to 61%, indicating that many of the original results provided just enough evidence to reject the nil-hypothesis, but not enough evidence to rule out even small effect sizes. The observed discovery rate is also more in line with the expected discovery rate. Thus, some of the missing non-significant results may have been published as just significant results. This is also implied by the greater frequency of results with z-scores between 2 and 2.5 than the model predicts (grey curve). However, the expected replication rate of 63% is still much higher than the actual replication rate with a criterion value of 2.5 (33%). Thus, other factors may contribute to the low success rate in the actual replication studies of the replicability project.

Conclusion

In conclusion, statisticians have been arguing about p-values, significance levels, and Bayes-Factors. Proponents of Bayes-Factors have argued that their approach is supreme because Bayes-Factors can provide evidence for the null-hypothesis. I argue that this is wrong because it is theoretically impossible to demonstrate that a population effect size is exactly zero or any other specific value. A better solution is to specify the null-hypothesis as a range of values that are too small to be meaningful. This makes it theoretically possible to demonstrate that a population effect size is above or below the boundary value. This approach can also be applied retrospectively to published studies. I illustrate this by defining the null-hypothesis as the region of effect sizes that is defined by the effect size that corresponds to a z-score of 2.5. While a z-score of 2.5 corresponds to p = .01 (two-tailed) for the nil-hypothesis, I use this criterion value to maintain an error rate of 5% and to change the null-hypothesis to a range of values around zero that becomes smaller as sample sizes increase.

As p-hacking is often used to just reject the nil-hypothesis, changing the null-hypothesis to a range of values around zero makes many ‘significant’ results non-significant. That is, the evidence is too weak to exclude even trivial effect sizes. This does not mean that the hypothesis is wrong or that original authors did p-hack their data. However, it does mean that they can no longer point to their original results as empirical evidence. Rather they have to conduct new studies to demonstrate with larger samples that they can reject the new null-hypothesis that the predicted effect meets some minimal standard of practical or theoretical significance. With a clear criterion value for significance, authors also risk to obtain evidence that positively contradicts their predictions. Thus, the biggest improvement that arises form rethinking null-hypothesis testing is that authors have to specify effect sizes a priori and that that studies can provide evidence for and against a zero. Thus, changing the nil-hypothesis to a null-hypothesis with a non-null value makes it possible to provide evidence for or against a theory. In contrast, computing Bayes-Factors in favor of the nil-hypothesis fails to achieve this goal because the nil-hypothesis is always wrong, the real question is only how wrong.

# Tukey 1991 explains Null-Hypothesis Testing in 8 Paragraphs

1. We need to distinguish regions of effect sizes and precise values. The value 0 is a precise value. All positive values or all negative values are regions of values.

2. The most common use of null-hypothesis testing is to test whether the point-null or nil-hypothesis (Cohen, 1994) is consistent with the data.

3. Tukey explains that this hypothesis is likely to be false all the time. “All we know about the world teaches us that the effect of A and B are always different”. Many critics of NHST have suggested that this makes it useless to test the nil-hypothesis because we already know that it is false (the prior probability of H0 being true is 0, no data can change this).

4. NHST becomes useful when we think about the null-hypothesis (no difference) as the boundary value that distinguishes two regions. We are really testing the direction of the mean difference (or the sign of of a correlation coefficient). Once we can reject the nil-hypothesis (p < alpha) in a two-sided test, we are allowed to interpret the direction of the mean difference in a sample as the mean difference in the population (i.e., if we had studied all people from which the sample was drawn).

5. Some psychologists have criticized NHST because it can never provide evidence for the nil-hypothesis (Rouder, Wagenmakers). This criticism is based on a misunderstanding of NHST. Tukey explains we should never accept the nil-hypothesis because we can never provide empirical support FOR a precise effect size.

6. Once we have evidence that the nil-hypothesis is false and the effect is either positive or negative, we may ask follow-up questions about the size of an effect.

7. A good way to answer these questions is to conduct NHST with confidence intervals. If the confidence interval includes 0, we cannot draw inferences about the direction of the effect. However, if the confidence interval does not include 0, we can make inferences about the direction of an effect and the boundaries of the intervals provide information about plausible values for the smallest and the largest possible effect size.

8. In conclusion, we can think about two-sided tests as an efficient way of conducting two one-sided tests without inflating the type-I error probability. Rejecting the hypothesis that there is no effect is not interesting. Determining the direction of an effect is and NHST is a useful tool to do so.

9. I probably made things worse by paraphrasing Tukey. Therefore I also posted the relevant section of his article below. # What would Cohen say? A comment on p < .005

Most psychologists are trained in Fisherian statistics, which has become known as Null-Hypothesis Significance Testing (NHST).  NHST compares an observed effect size against a hypothetical effect size. The hypothetical effect size is typically zero; that is, the hypothesis is that there is no effect.  The deviation of the observed effect size from zero relative to the amount of sampling error provides a test statistic (test statistic = effect size / sampling error).  The test statistic can then be compared to a criterion value. The criterion value is typically chosen so that only 5% of test statistics would exceed the criterion value by chance alone.  If the test statistic exceeds this value, the null-hypothesis is rejected in favor of the inference that an effect greater than zero was present.

One major problem of NHST is that non-significant results are not considered.  To address this limitation, Neyman and Pearson extended Fisherian statistic and introduced the concepts of type-I (alpha) and type-II (beta) errors.  A type-I error occurs when researchers falsely reject a true null-hypothesis; that is, they infer from a significant result that an effect was present, when there is actually no effect.  The type-I error rate is fixed by the criterion for significance, which is typically p < .05.  This means, that a set of studies cannot produce more than 5% false-positive results.  The maximum of 5% false positive results would only be observed if all studies have no effect. In this case, we would expect 5% significant results and 95% non-significant results.

The important contribution by Neyman and Pearson was to consider the complementary type-II error.  A type-II error occurs when an effect is present, but a study produces a non-significant result.  In this case, researchers fail to detect a true effect.  The type-II error rate depends on the size of the effect and the amount of sampling error.  If effect sizes are small and sampling error is large, test statistics will often be too small to exceed the criterion value.

Neyman-Pearson statistics was popularized in psychology by Jacob Cohen.  In 1962, Cohen examined effect sizes and sample sizes (as a proxy for sampling error) in the Journal of Abnormal and Social Psychology and concluded that there is a high risk of type-II errors because sample sizes are too small to detect even moderate effect sizes and inadequate to detect small effect sizes.  Over the next decades, methodologists have repeatedly pointed out that psychologists often conduct studies with a high risk to fail; that is, to provide empirical evidence for real effects (Sedlemeier & Gigerenzer, 1989).

The concern about type-II errors has been largely ignored by empirical psychologists.  One possible reason is that journals had no problem filling volumes with significant results, while rejecting 80% of submissions that also presented significant results.  Apparently, type-II errors were much less common than methodologists feared.

However, in 2011 it became apparent that the high success rate in journals was illusory. Published results were not representative of studies that were conducted. Instead, researchers used questionable research practices or simply did not report studies with non-significant results.  In other words, the type-II error rate was as high as methodologists suspected, but selection of significant results created the impression that nearly all studies were successful in producing significant results.  The influential “False Positive Psychology” article suggested that it is very easy to produce significant results without an actual effect.  This led to the fear that many published results in psychology may be false positive results.

Doubt about the replicability and credibility of published results has led to numerous recommendations for the improvement of psychological science.  One of the most obvious recommendations is to ensure that published results are representative of the studies that are actually being conducted.  Given the high type-II error rates, this would mean that journals would be filled with many non-significant and inconclusive results.  This is not a very attractive solution because it is not clear what the scientific community can learn from an inconclusive result.  A better solution would be to increase the statistical power of studies. Statistical power is simply the inverse of a type-II error (power = 1 – beta).  As power increases, studies with a true effect have a higher chance of producing a true positive result (e.g., a drug is an effective treatment for a disease). Numerous articles have suggested that researchers should increase power to increase replicability and credibility of published results (e.g., Schimmack, 2012).

In a recent article, a team of 72 authors proposed another solution. They recommended that psychologists should reduce the probability of a type-I error from 5% (1 out of 20 studies) to 0.5% (1 out of 200 studies).  This recommendation is based on the belief that the replication crisis in psychology reflects a large number of type-I errors.  By reducing the alpha criterion, the rate of type-I errors will be reduced from a maximum of 10 out of 200 studies to 1 out of 200 studies.

I believe that this recommendation is misguided because it ignores the consequences of a more stringent significance criterion on type-II errors.  Keeping resources and sampling error constant, reducing the type-I error rate increases the type-II error rate. This is undesirable because the actual type-II error is already large.

For example, a between-subject comparison of two means with a standardized effect size of d = .4 and a sample size of N = 100 (n = 50 per cell) has a 50% risk of a type-II error.  The risk of a type-II error rises to 80%, if alpha is reduced to .005.  It makes no sense to conduct a study with an 80% chance of failure (Tversky & Kahneman, 1971).  Thus, the call for a lower alpha implies that researchers will have to invest more resources to discover true positive results.  Many researchers may simply lack the resources to meet this stringent significance criterion.

My suggestion is exactly opposite to the recommendation of a more stringent criterion.  The main problem for selection bias in journals is that even the existing criterion of p < .05 is too stringent and leads to a high percentage of type-II errors that cannot be published.  This has produced the replication crisis with large file-drawers of studies with p-values greater than .05,  the use of questionable research practices, and publications of inflated effect sizes that cannot be replicated.

To avoid this problem, researchers should use a significance criterion that balances the risk of a type-I and type-II error.  For example, in a between-subject design with an expected effect size of d = .4 and N = 100, researchers should use p < .20 for significance, which reduces the risk of a type -II error to 20%.  In this case, type-I and type-II error are balanced.  If the study produces a p-value of, say, .15, researchers can publish the result with the conclusion that the study provided evidence for the effect. At the same time, readers are warned that they should not interpret this result as strong evidence for the effect because there is a 20% probability of a type-I error.

Given this positive result, researchers can then follow up their initial study with a larger replication study that allows for a stricter type-I error control, while holding power constant.   With d = 4, they now need N = 200 participants to have 80% power and alpha = .05.  Even if the second study does not produce a significant result (the probability that two studies with 80% power are significant is only 64%, Schimmack, 2012), researchers can combine the results of both studies and with N = 300, the combined studies have 80% power with alpha = .01.

The advantage of starting with smaller studies with a higher alpha criterion is that researchers are able to test risky hypothesis with a smaller amount of resources.  In the example, the first study used “only” 100 participants.  In contrast, the proposal to require p < .005 as evidence for an original, risky study implies that researchers need to invest a lot of resources in a risky study that may provide inconclusive results if it fails to produce a significant result.  A power analysis shows that a sample size of N = 338 participants is needed to have 80% power for an effect size of d = .4 and p < .005 as criterion for significance.

Rather than investing 300 participants into a risky study that may produce a non-significant and uninteresting result (eating green jelly beans does not cure cancer), researchers may be better able and willing to start with 100 participants and to follow up an encouraging result with a larger follow-up study.  The evidential value that arises from one study with 300 participants or two studies with 100 and 200 participants is the same, but requiring p < .005 from the start discourages risky studies and puts even more pressure on researchers to produce significant results if all of their resources are used for a single study.  In contrast, lowering alpha reduces the need for questionable research practices and reduces the risk of type-II errors.

In conclusion, it is time to learn Neyman-Pearson statistic and to remember Cohen’s important contribution that many studies in psychology are underpowered.  Low power produces inconclusive results that are not worthwhile publishing.  A study with low power is like a high-jumper that puts the bar too high and fails every time. We learned nothing about the jumpers’ ability. Scientists may learn from high-jump contests where jumpers start with lower and realistic heights and then raise the bar when they succeeded.  In the same manner, researchers should conduct pilot studies or risky exploratory studies with small samples and a high type-I error probability and lower the alpha criterion gradually if the results are encouraging, while maintaining a reasonably low type-II error.

Evidently, a significant result with alpha = .20 does not provide conclusive evidence for an effect.  However, the arbitrary p < .005 criterion also fails short of demonstrating conclusively that an effect exists.  Journals publish thousands of results a year and some of these results may be false positives, even if the error rate is set at 1 out of 200. Thus, p < .005 is neither defensible as a criterion for a first exploratory study, nor conclusive evidence for an effect.  A better criterion for conclusive evidence is that an effect can be replicated across different laboratories and a type-I error probability of less than 1 out of a billion (6 sigma).  This is by no means an unrealistic target.  To achieve this criterion with an effect size of d = .4, a sample size of N = 1,000 is needed.  The combined evidence of 5 labs with N = 200 per lab would be sufficient to produce conclusive evidence for an effect, but only if there is no selection bias.  Thus, the best way to increase the credibility of psychological science is to conduct studies with high power and to minimize selection bias.

This is what I believe Cohen would have said, but even if I am wrong about this, I think it follows from his futile efforts to teach psychologists about type-II errors and statistical power.

# Are Most Published Results in Psychology False? An Empirical Study

Why Most Published Research Findings  are False by John P. A. Ioannidis

In 2005, John P. A. Ioannidis wrote an influential article with the title “Why Most Published Research Findings are False.” The article starts with the observation that “there is increasing concern that most current published research findings are false” (e124). Later on, however, the concern becomes a fact. “It can be proven that most claimed research findings are false” (e124). It is not surprising that an article that claims to have proof for such a stunning claim has received a lot of attention (2,199 citations and 399 citations in 2016 alone in Web of Science).

Most citing articles focus on the possibility that many or even more than half of all published results could be false. Few articles cite Ioannidis to make the factual statement that most published results are false, and there appears to be no critical examination of Ioannidis’s simulations that he used to support his claim.

This blog post shows that these simulations make questionable assumptions and shows with empirical data that Ioannidis’s simulations are inconsistent with actual data.

Critical Examination of Ioannidis’s Simulations

First, it is important to define what a false finding is. In many sciences, a finding is published when a statistical test produced a significant result (p < .05). For example, a drug trial may show a significant difference between a drug and a placebo control condition with a p-value of .02. This finding is then interpreted as evidence for the effectiveness of the drug.

How could this published finding be false? The logic of significance testing makes this clear. The only inference that is being made is that the population effect size (i.e., the effect size that could be obtained if the same experiment were repeated with an infinite number of participants) is different from zero and in the same direction as the one observed in the study. Thus, the claim that most significant results are false implies that in more than 50% of all published significant results the null-hypothesis was true. That is, a false positive result was reported.

Ioannidis then introduces the positive predictive value (PPV). The positive predictive value is the proportion of positive results (p < .05) that are true positives.

(1) PPV = TP/(TP + FP)

PTP = True Positive Results, FP = False Positive Results

The proportion of true positive results (TP) depends on the percentage of true hypothesis (PTH) and the probability of producing a significant result when a hypothesis is true. This probability is known as statistical power. Statistical power is typically defined as 1 minus the type-II error (beta).

(2) TP = PTH * Power = PTH * (1 – beta)

The probability of a false positive result depends on the proportion of false hypotheses (PFH) and the criterion for significance (alpha).

(3) FP = PFH * alpha

This means that the actual proportion of true significant results is a function of the ratio of true and false hypotheses (PTH:PFH), power, and alpha.

(4) PPV = (PTH*power) / ((PTH*power) + (PFH * alpha))

Ioannidis translates his claim that most published findings are false into a PPV below 50%. This would mean that the null-hypothesis is true in more than 50% of published results that falsely rejected it.

(5) (PTH*power) / ((PTH*power) + (PFH * alpha))  < .50

Equation (5) can be simplied to the inequality equation

(6) alpha > PTH/PFH * power

We can rearrange formula (6) and substitute PFH with (1-PHT) to determine the maximum proportion of true hypotheses to produce over 50% false positive results.

(7a)  =  alpha = PTH/(1-PTH) * power

(7b) = alpha*(1-PTH) = PTH * power

(7c) = alpha – PTH*alpha = PTH * power

(7d) =  alpha = PTH*alpha + PTH*power

(7e) = alpha = PTH(alpha + power)

(7f) =  alpha/(power + alpha) = PTH

Table 1 shows the results.

Power                  PTH / PFH
90%                       5  / 95
80%                       6  / 94
70%                       7  / 93
60%                       8  / 92
50%                       9  / 91
40%                      11 / 89
30%                       14 / 86
20%                      20 / 80
10%                       33 / 67

Even if researchers would conduct studies with only 20% power to discover true positive results, we would only obtain more than 50% false positive results if only 20% of hypothesis were true. This makes it rather implausible that most published results could be false.

To justify his bold claim, Ioannidis introduces the notion of bias. Bias can be introduced due to various questionable research practices that help researchers to report significant results. The main effect of these practices is that the probability of a false positive result to become significant increases.

Simmons et al. (2011) showed that massive use several questionable research practices (p-hacking) can increase the risk of a false positive result from the nominal 5% to 60%. If we assume that bias is rampant and substitute the nominal alpha of 5% with an assumed alpha of 50%, fewer false hypotheses are needed to produce more false than true positives (Table 2).

Power                 PTH/PFH
90%                     40 / 60
80%                     43 / 57
70%                     46 / 54
60%                     50 / 50
50%                     55 / 45
40%                     60 / 40
30%                     67 / 33
20%                     75 / 25
10%                      86 / 14

If we assume that bias inflates the risk of type-I errors from 5% to 60%, it is no longer implausible that most research findings are false. In fact, more than 50% of published results would be false if researchers tested hypothesis with 50% power and 50% of tested hypothesis are false.

However, the calculations in Table 2 ignore the fact that questionable research practices that inflate false positives also decrease the rate of false negatives. For example, a researcher who continues testing until a significant result is obtained, increases the chances of obtaining a significant result no matter whether the hypothesis is true or false.

Ioannidis recognizes this, but he assumes that bias has the same effect for true hypothesis and false hypothesis. This assumption is questionable because it is easier to produce a significant result if an effect exists than if no effect exists. Ioannidis’s assumption implies that bias increases the proportion of false positive results a lot more than the proportion of true positive results.

For example, if power is 50%, only 50% of true hypothesis produce a significant result. However, with a bias factor of .4, another 40% of the false negative results will become significant, adding another .4*.5 = 20% true positive results to the number of true positive results. This gives a total of 70% positive results, which is a 40% increase over the number of positive results that would have been obtained without bias. However, this increase in true positive results pales in comparison to the effect that 40% bias has on the rate of false positives. As there are 95% true negatives, 40% bias produces another .95*.40 = 38% of false positive results. So instead of 5% false positive results, bias increases the percentage of false positive results from 5% to 43%, an increase by 760%. Thus, the effect of bias on the PPV is not equal. A 40% increase of false positives has a much stronger impact on the PPV than a 40% increase of true positives. Ioannidis provides no rational for this bias model.

A bigger concern is that Ioannidis makes sweeping claims about the proportion of false published findings based on untested assumptions about the proportion of null-effects, statistical power, and the amount of bias due to questionable research practices.
For example, he suggests that 4 out of 5 discoveries in adequately powered (80% power) exploratory epidemiological studies are false positives (PPV = .20). To arrive at this estimate, he assumes that only 1 out of 11 hypotheses is true and that for every 1000 studies, bias adds only 1000* .30*.10*.20 = 6 true positives results compared to 1000* .30*.90*.95 = 265 false positive results (i.e., 44:1 ratio). The assumed bias turns a PPV of 62% without bias into a PPV of 20% with bias. These untested assumptions are used to support the claim that “simulations show that for most study designs and settings, it is more likely for a research claim to be false than true.” (e124).

Many of these assumptions can be challenged. For example, statisticians have pointed out that the null-hypothesis is unlikely to be true in most studies (Cohen, 1994). This does not mean that all published results are true, but Ioannidis’ claims rest on the opposite assumption that most hypothesis are a priori false. This makes little sense when the a priori hypothesis is specified as a null-effect and even a small effect size is sufficient for a hypothesis to be correct.

Ioannidis also ignores attempts to estimate the typical power of studies (Cohen, 1962). At least in psychology, the typical power is estimated to be around 50%. As shown in Table 2, even massive bias would still produce more true than false positive results, if the null-hypothesis is false in no more than 50% of all statistical tests.

In conclusion, Ioannidis’s claim that most published results are false depends heavily on untested assumptions and cannot be considered a factual assessment of the actual number of false results in published journals.

Testing Ioannidis’s Simulations

10 years after the publication of “Why Most Published Research Findings Are False,”  it is possible to put Ioannidis’s simulations to an empirical test. Powergraphs (Schimmack, 2015) can be used to estimate the average replicability of published test results. For this purpose, each test statistic is converted into a z-value. A powergraph is foremost a histogram of z-values. The distribution of z-values provides information about the average statistical power of published results because studies with higher power produce higher z-values.

Figure 1 illustrates the distribution of z-values that is expected for Ioanndis’s model for “adequately powered exploratory epidemiological study” (Simulation 6 in Figure 4). Ioannidis assumes that for every true positive, there are 10 false positives (R = 1:10). He also assumed that studies have 80% power to detect a true positive. In addition, he assumed 30% bias. A 30% bias implies that for every 100 false hypotheses, there would be 33 (100*[.30*.95+.05]) rather than 5 false positive results (.95*.30+.05)/.95). The effect on false negatives is much smaller (100*[.30*.20 + .80]). Bias was modeled by increasing the number of attempts to produce a significant result so that proportion of true and false hypothesis matched the predicted proportions. Given an assumed 1:10 ratio of true to false hypothesis, the ratio is 335 false hypotheses to 86 true hypotheses. The simulation assumed that researchers tested 100,000 false hypotheses and observed 35000 false positive results and that they tested 10,000 true hypotheses and observed 8,600 true positive results. Bias was simulated by increasing the number of tests to produce the predicted ratio of true and false positive results.

Figure 1 only shows significant results because only significant results would be reported as positive results. Figure 1 shows that a high proportion of z-values are in the range between 1.95 (p = .05) and 3 (p = .001). Powergraphs use z-curve (Schimmack & Brunner, 2016) to estimate the probability that an exact replication study would replicate a significant result. In this simulation, this probability is a mixture of false positives and studies with 80% power. The true average probability is 20%. The z-curve estimate is 21%. Z-curve can also estimate the replicability for other sets of studies. The figure on the right shows replicability for studies that produced an observed z-score greater than 3 (p < .001). The estimate shows an average replicability of 59%. Thus, researchers can increase the chance of replicating published findings by adjusting the criterion value and ignoring significant results with p-values greater than p = .001, even if they were reported as significant with p < .05.

Figure 2 shows the distribution of z-values for Ioannidis’s example of a research program that produces more true than false positives, PPV = .85 (Simulation 1 in Table 4). Visual inspection of Figure 1 and Figure 2 is sufficient to show that a robust research program produces a dramatically different distribution of z-values. The distribution of z-values in Figure 2 and a replicability estimate of 67% are impossible if most of the published significant results were false.  The maximum value that could be obtained is obtained with a PPV of 50% and 100% power for the true positive results, which yields a replicability estimate of .05*.50 + 1*.50 = 55%. As power is much lower than 100%, the real maximum value is below 50%.

The powergraph on the right shows the replicability estimate for tests that produced a z-value greater than 3 (p < .001). As only a small proportion of false positives are included in this set, z-curve correctly estimates the average power of these studies as 80%. These examples demonstrate that it is possible to test Ioannidis’s claim that most published (significant) results are false empirically. The distribution of test results provides relevant information about the proportion of false positives and power. If actual data are more similar to the distribution in Figure 1, it is possible that most published results are false positives, although it is impossible to distinguish false positives from false negatives with extremely low power. In contrast, if data look more like those in Figure 2, the evidence would contradict Ioannidis’s bold and unsupported claim that most published results are false.

The maximum replicabiltiy that could be obtained with 50% false-positives would require that the true positive studies have 100% power. In this case, replicability would be .50*.05 + .50*1 = 52.5%.  However, 100% power is unrealistic. Figure 3 shows the distribution for a scenario with 90% power and 100% bias and an equal percentage of true and false hypotheses. The true replicabilty for this scenario is .05*.50 + .90 * .50 = 47.5%. z-curve slightly overestimates replicabilty and produced an estimate of 51%.  Even 90% power is unlikely in a real set of data. Thus, replicability estimates above 50% are inconsistent with Ioannidis’s hypothesis that most published positive results are false.  Moreover, the distribution of z-values greater than 3 is also informative. If positive results are a mixture of many false positive results and true positive results with high power, the replicabilty estimate for z-values greater than 3 should be high. In contrast, if this estimate is not much higher than the estimate for all z-values, it suggest that there is a high proportion of studies that produced true positive results with low power. Empirical Evidence

I have produced powergraphs and replicability estimates for over 100 psychology journals (2015 Replicabilty Rankings). Not a single journal produced a replicability estimate below 50%. Below are a few selected examples.

The Journal of Experimental Psychology: Learning, Memory and Cognition publishes results from cognitive psychology. In 2015, a replication project (OSC, 2015) demonstrated that 50% of significant results produced a significant result in a replication study. It is unlikely that all non-significant results were false positives. Thus, the results show that Ioannidis’s claim that most published results are false does not apply to results published in this journal. The powergraphs further support this conclusion. The graphs look a lot more like Figure 2 than Figure 1 and the replicability estimate is even higher than the one expected from Ioannidis’s simulation with a PPV of 85%.

Another journal that was subjected to replication attempts was Psychological Science. The success rate for Psychological Science was below 50%. However, it is important to keep in mind that a non-significant result in a replication study does not prove that the original result was a false positive. Thus, the PPV could still be greater than 50%. The powergraph for Psychological Science shows more z-values in the range between 2 and 3 (p > .001). Nevertheless, the replicability estimate is comparable to the one in Figure 2 which simulated a high PPV of 85%. Closer inspection of the results published in this journal would be required to determine whether a PPV below .50 is plausible.

The third journal that was subjected to a replication attempt was the Journal of Personality and Social Psychology. The journal has three sections, but I focus on the Attitude and Social Cognition section because many replication studies were from this section. The success rate of replication studies was only 25%. However, there is controversy about the reason for this high number of failed replications and once more it is not clear what percentage of failed replications were due to false positive results in the original studies. One problem with the journal rankings is that they are based on automated extraction of all test results. Ioannidis might argue that his claim focused only on test results that tested an original, novel, or an important finding, whereas articles also often report significance tests for other effects. For example, an intervention study may show a strong decrease in depression, when only the interaction with treatment is theoretically relevant.

I am currently working on powergraphs that are limited to theoretically important statistical tests. These results may show lower replicability estimates. Thus, it remains to be seen how consistent Ioannidis’s predictions are for tests of novel and original hypotheses. Powergraphs provide a valuable tool to address this important question.

Moreover, powergraphs can be used to examine whether science is improving. So far, powergraphs of psychology journals have shown no systematic improvement in response to concerns about high false positive rates in published journals. The powergraphs for 2016 will be published soon. Stay tuned.

# The Association for Psychological Science Improves Success Rate from 95% to 100% by Dropping Hypothesis Testing: The Sample Mean is the Sample Mean, Type-I Error 0%

The editor of Psychological Science published an Editorial with the title “Business Not as Usual.” (see also Observer interview and new Submission Guidelines) The new submission guidelines recommend the following statistical approach.

Effective January 2014, Psychological Science recommends the use of the “new statistics”—effect sizes, confidence intervals, and meta-analysis—to avoid problems associated with null-hypothesis significance testing (NHST). Authors are encouraged to consult this Psychological Science tutorial by Geoff Cumming, which shows why estimation and meta-analysis are more informative than NHST and how they foster development of a cumulative, quantitative discipline. Cumming has also prepared a video workshop on the new statistics that can be found here.

The editorial is a response to the current crisis in psychology that many findings cannot be replicated and the discovery that numerous articles in Psychological Science show clear evidence of reporting biases that lead to inflated false-positive rates and effect sizes (Francis, 2013).

The editorial is titled “Business not as usual.”  So what is the radical response that will ensure increased replicability of results published in Psychological Science? One solution is to increase transparency and openness to discourage the use of deceptive research practices (e.g., not publishing undesirable results or selective reporting of dependent variables that showed desirable results). The other solution is to abandon null-hypothesis significance testing.

Problem of the Old Statistics: Researchers had to demonstrate that their empirical results could have occurred only with a 5% probability if there is no effect in the population.

Null-hypothesis testing has been the main method to relate theories to empirical data. An article typically first states a theory and then derives a theoretical prediction from the theory. The theoretical prediction is then used to design a study that can be used to test the theoretical prediction. The prediction is tested by computing the ratio of the effect size and sampling error (signal-to-noise) ratio. The next step is to determine the probability of obtaining the observed signal-to-noise ratio or an even more extreme one under the assumption that the true effect size is zero. If this probability is smaller than a criterion value, typically p < .05, the results are interpreted as evidence that the theoretical prediction is true. If the probability does not meet the criterion, the data are considered inconclusive.

However, non-significant results are irrelevant because Psychological Science is only interested in publishing research that supports innovative novel findings. Nobody wants to know that drinking fennel tea does not cure cancer, but everybody wants to know about a treatment that actually cures cancer. So, the main objective of statistical analyses was to provide empirical evidence for a predicted effect by demonstrating that an obtained result would occur only with a 5% probability if the hypothesis were false.

Solution to the problem of Significance Testing: Drop the Significance Criterion. Just report your sample mean and the 95% confidence interval around it. Eich claims that “researchers have recognized,…, essential problems with NHST in general, and with dichotomous thinking (“significant” vs. “non-significant” ) thinking it engenders in particular. It is true that statisticians have been arguing about the best way to test theoretical predictions with empirical data. In fact, they are still arguing. Thus, it is interesting to examine how Psychological Science found a solution to the elusive problem of statistical inference. The answer is to avoid statistical inferences altogether and to avoid dichotomous thinking. Does fennel tea cure cancer? Maybe, 95%CI d = -.4 to d = +4. No need to test for statistical significance. No need to worry about inadequate sample sizes. Just do a study and report your sample means with a confidence interval. It is that easy to fix the problems of psychological science.

The problem is that every study produces a sample mean and a confidence interval. So, how do the editors of Psychological Science pick the 5% of submitted manuscripts that will be accepted for publication? Eich lists three criteria.

1. What will the reader of this article learn about psychology that he or she did not know (or could not have known) before?

The effect of manipulation X on dependent variable Y is d = .2, 95%CI = -.2 to .6. We can conclude from this result that it is unlikely that the manipulation leads to a moderate decrease or a strong increase in the dependent variable Y.

1. Why is that knowledge important for the field?

The finding that the experimental manipulation of Y in the laboratory is somewhat more likely to produce an increase than a decrease, but could also have no effect at all has important implications for public policy.

1. How are the claims made in the article justified by the methods used?

The claims made in this article are supported by the use of Cumming’s New Statistics. Based on a precision analysis, the sample size was N = 100 (n = 50 per condition) to achieve a precision of .4 standard deviations. The study was preregistered and the data are publicly available with the code to analyze the data (SPPS t-test groups x (1,2) / var y.).

If this sounds wrong to you and you are a member of APS, you may want to write to Erich Eich and ask for some better guidelines that can be used to evaluate whether a sample mean or two or three or four sample means should be published in your top journal.