Statistics is a mess. Statistics education is a mess. Not surprising, the understanding of statistics by applied research workers is a mess. This was less of a problem when there was only one way to conduct statistical analyses. Nobody knew what they were doing, but at least everybody was doing the same thing. Now we have a multiverse of statistical approaches and applied research workers are happy to mix and match statistics to fit their needs. This is making the reporting of results worse and leads to logical contradictions.

For example, the authors of an article in a recent issue of Psychological Science that shall remain anonymous claimed (a) that a Bayesian Hypothesis Test provided evidence for the nil-hypothesis (effect size of zero) and (b) claimed that their preregistered replication study had high statistical power. This makes no sense, because power is defined as the probability of correctly rejecting the null-hypothesis, which assumes an a priori effect size greater than zero. Power is simply not defined when the hypothesis is that the population effect size is zero.

Errors like this are avoidable, if we realize that Neyman introduced confidence intervals to make hypothesis testing easier and more informative. Here is a brief introduction to think clearly about hypothesis testing that should help applied research workers to understand what they are doing.

**Effect Size and Sampling Error **

The most important information that applied research workers should report are (a) an estimate of the effect size and (b) an estimate of sampling error. Every statistics course should start with introducing these two concepts because all other statistics like p-values or Bayes-Factors or confidence intervals are based on effect size and sampling error. They are also the most useful information for meta-analysis.

Information about effect sizes and sampling error can be in the form of unstandardized values (e.g., 5 cm difference in height with SE = 2 cm) or in standardized form (d = .5, SE = .2). This is not relevant for hypothesis testing and I will use standardized effect sizes for my example.

**Specifying the Null-Hypothesis**

The null-hypothesis is the hypothesis that a researcher believes to be untrue. It is the hypothesis that they want to reject or NULLify. The biggest mistake in statistics is the assumption that this hypothesis is always that there is no effect (effect size of zero). Cohen (1994) called this hypothesis the nil-hypothesis to distinguish it from other null-hypotheses.

For example, in a directional test that studying harder leads to higher grades, the null-hypothesis specifies all non-positive values (zero and all negative values). When this null-hypothesis is rejected, it automatically implies that the alternative hypothesis is true (given a specific error criterion and a bunch of assumption, not in a mathematically proven sense). Normally, we go through various steps to reject the null-hypothesis, to then affirm the alternative. However, with confidence intervals we can directly affirm the alternative.

**Calculating Confidence Intervals**

A confidence interval requires three pieces of information.

1. An ESTIMATE of the effect size. This estimate is provided by the mean difference in a sample. For example, the height difference of 5cm or d = .5 are estimates of the population mean difference in height.

2. An Estimate of sampling error. In simple designs, sampling error is a function of sample size, but even then we are making assumptions that can be violated and difficult or impossible to test in small samples In more complex designs, sampling error depends on other statistics that are sample dependent. Thus, sampling error is also just an estimate. The main job of statisticians is to find plausible estimates of sampling error for applied research workers. Applied researchers simply use the information that is provided by statistics programs. In our example, sampling error was estimated to be d = .2.

3. The third piece of information is how confident we want to be in our inferences. All data-based inferences are inductions that can be wrong, but we can specify the probability of being wrong. This quantity is known as the type-I error with the Greek symbol alpha. A common value is alpha = .05. This implies that we have a long-run error rate of no more than 5%. If we obtain 100 confidence intervals, the long-run error rate is limited to no more than 5% false inferences in favor of the alternative hypothesis. With alpha = .05, sampling error has to be multiplied by approximately 2 to compute a confidence interval.

To use our example, with d = .5, and SE = .2, we can create a confidence interval that ranges from d = .5 – .2*2 = .1 to .5 + .2*2 = .9. We can now state that WITHOUT ANY OTHER INFORMATION that may be relevant (e.g., we already know the alternative is true based on a much larger trustworthy prior study and our study is only a classroom demonstration) that the data support our hypothesis that there is a positive effect because the confidence interval fits into the predicted interval; that is values from .1 to .9 fit into the set of values from 0 to infinity.

A more common way to express this finding is to state that the confidence interval does not include the largest value of the null-hypothesis, which is zero. However, this leads to the impression that we tested the nil-hypothesis, and rejected it. But that is not true. We also rejected all the values less than 0. Thus, we did not test or reject the nil-hypothesis. We tested and rejected the null-hypothesis of effect sizes ranging from -infinity to 0. But it is also not necessary to state that we rejected this null-hypothesis because this statement is redundant with the statement we actually want to make. We found evidence for our hypothesis that the effect size is positive (i.e., in the range from 0 to infinity excluding 0).

I hope this example makes it clear how hypothesis testing with confidence intervals works. We first specify a range of values that we think are plausible (e.g., all positive values). We then compute a confidence interval of values that are consistent with our data. We then examine whether the confidence interval falls into the hypothesized range of values. When this is the case, we infer that the data support our hypothesis.

**Different Outcomes**

When we divide the range of possible effect sizes into two mutually exclusive regions, we can distinguish three possible outcomes.

One possible outcome is that the confidence interval falls into a predicted region. In this case, the data provide support for the prediction.

One possible outcome is that the confidence interval overlaps with the predicted range of values, but also falls outside the range of predicted values. For example, the data could have produced an effect size estimate of d = .1 and an confidence interval ranging from -.3 to .5. In this case, the data are inconclusive. It is possible that the population effect size is, as predicted, positive, or it is negative.

Another possible outcome is that the confidence interval falls entirely outside the predicted range of values (e.g., d = -.5, confidence interval -.9 to -.1). In this case, the data disconfirm the prediction of a positive effect. It follows that it is not even necessary to make a prediction one way or the other. We can simply see whether the confidence interval fits into one or the other region and infer that the population effect size is in the region that contains the confidence interval.

**Do We Need A Priori Hypotheses?**

Let’s assume that we predicted a positive effect and our hypothesis covers all effect sizes greater than zero and the confidence interval includes values from d = .1 to .9. We said that this finding allows us to accept our hypothesis that the effect size is positive; that is, it is within an interval ranging from 0 to infinity without zero. However, the confidence interval provides a much smaller range of values. A confidence interval ranging from .1 to .9 not only excludes negative values or a value of zero, it also excludes values of 1 or 2. Thus, we are not using all of the information that our data are providing when we simply infer from the data that the effect size is positive, which includes trivial values of 0.0000001 and implausible values of 999.9. The advantage of reporting results with confidence intervals is that we can specify a narrow range of values that are consistent with the data. This is particularly helpful when the confidence interval is close to zero. For example, a confidence interval that ranges from d = 0.001 to d = .801 can be used to claim that the effect size is positive, but it cannot be used to claim that the effect size is theoretically meaningful, unless d = .001 is theoretically meaningful.

**Specifying A Minimum Effect Size**

To make progress, psychology has to start taking effect sizes more seriously, and this is best achieved by reporting confidence intervals. Confidence intervals ranging from d = .8 to d = 1.2 and ranging from d = .01 to d = .41 are both consistent with the prediction that there is a positive effect, p < .01. However, the two confidence intervals also specify very different ranges of possible effect sizes. Whereas the first confidence interval rejects the hypothesis that effect sizes are small or moderate, the second confidence interval rejects large effect sizes. Traditional hypothesis testing with p-values hides this distinction and makes it look as if these two studies produced identical results. However the lowest value of the first interval (d = .8) is higher than the highest value of the second interval (d = .41), which actually implies that the results are significantly different from each other. Thus, these two studies produced conflicting results when we consider effect sizes, while giving the same answer about the direction of an effect.

If predictions were made in terms of a minimum effect size that is theoretically or practically relevant, the distinction between the two results would also be visible. For example, a standard criterion for a minimum effect size could be a small effect size of d = .2. Using this criterion, the first study confirms predictions (i.e.., the confidence interval from .8 to 1.2 falls into the region from .2 to infinity), but the second study does not, d = .01 to .41 is partially outside the interval from .2 to infinity. In this case, the data are inconclusive.

If the population effect size is zero (e.g., effect of random future events on behavior), confidence intervals will cluster around zero. This makes it hard to fit confidence intervals within a region that is below a minimum effect size (e.g., d = -.2 to d = .2). This is the reason why it is empirically difficult to provide evidence for the absence of an effect. Reducing the minimum effect size makes it even harder and eventually impossible. However, logically there is nothing special about providing evidence for the absence of an effect. We are again dividing the range of plausible effects into two regions: (a) values below the minimum effect size and (b) values above the minimum effect size. We then decide in favor of the interval that fully contains the confidence interval. Of course, we can do this also without an a priori range of effect sizes. For example, if we find a confidence interval ranging from -.15 to +.18, we can infer from this finding that the population effect size is small (less than .2).

**But What about Bayesian Statistics?**

Bayesian statistics also uses information about effect sizes and sampling error. The main difference is that Bayesians assume that we have prior knowledge that can inform our interpretation of results. For example, if one-hundred studies already tested the same hypothesis, we can use the information of these studies. In this case, it would also be possible to conduct a meta-analysis and to draw inferences on evidence from all 101 studies, rather than just a single study. Bayesians also sometimes incorporate information that is harder to quantify. However, the main logic of hypothesis testing with confidence intervals or Bayesian credibility intervals does not change. Ironically, Bayesians also tend to use alpha = .05 when they report 95% credibility intervals. The only difference is that information that is external to the data (prior distributions) is used, whereas confidence intervals rely exclusively on information from the data.

**Conclusion**

I hope that this blog post helps researchers to better understand what they are doing. Empirical studies provides estimates of two important statistics, an effect size estimate and an sampling error estimate. This information can be used to create intervals that specify a range of values that are likely to contain the population effect size. Hypotheses testing divides the range of possible values into regions and decides in favor of hypotheses that fully contain the confidence interval. However, hypothesis testing is redundant and less informative because we can simply decide in favor of the values that are inside the confidence interval which is smaller than the range of values specified by a theoretical prediction. The use of confidence intervals makes it possible to identify weak evidence (confidence interval excludes zero, but not very small values that are not theoretically interesting) and also makes it possible to provide evidence for the absence of an effect (confidence interval only includes trivial values).

A common criticism of hypothesis testing is that it is difficult to understand and not intuitive. The use of confidence intervals solves this problem. Seeing whether a small objects fits into a larger object is probably achieved at some early developmental stage in Piaget’s model and most applied research workers should be able to carry out these comparisons. Standardized effect sizes also help with evaluating the size of objects. Thus, confidence intervals provide all of the information that applied research workers need to carry out empirical studies and to draw inferences from these studies. The main statistical challenge is to obtain estimates of sampling error in complex designs; that is the job of statisticians. The main job of empirical research workers is to collect theoretically or practically important data with small sampling error.

I’m not a researcher, but I appreciate your site.

Great post. I plan to share this with other research friends. The next step is improving statistical training and literacy in the young generation of researchers like myself.

Thank you for the feedback.

“It is the hypothesis that they want to reject or NULLify”. Perhaps just a poor choice of words, but “want to reject” is at the heart of so many of social psych’s problems.

I see what you are saying but the only thing you can do with the nil-hypothesis is to reject it. So the nature of the null-hypothesis is the reason for the problem that everybody wants to reject it.

Hi Ulrich,

“Power is simply not defined when the hypothesis is that the population effect size is zero.” Maybe I do not understand this sentence correctly, but I think, this is not correct, Power is simply defined as the probability to identify

a particular effect different from the effect under study. However, it is not a feature of an underlying distribution, but of a decision-theoretic comparison between two distributions, in the most simple case of two sampling distributions of t (we call it t-test). Mathematically, the form of the central t distribution (aka Student’s t) is not much different from non-central t distributions, except for the non-centrality parameter (which is 0, reflecting the absence of any effect, for the central t distribution). Assume now, we are testing our research hypothesis of, e.g., d>=0.5 as meaningful non-nil null hypothesis (d_0) (since we are hard-boned falsificationists!). The alternative is h1: d<0.5. Then, the probabilty to identify a possibly correct alternative d_1 (which actually is Power, right?) will be higher the larger the difference between d_0 and d_1 is, right? Which means that the Power to identify an alternative correctly is higher for d_1=0 than for larger values of d, e.g. d_1=0.3. Which also means that Power for the correct identification for a true population effect of d=0 does not only exist, but is also computable.

Hi Uwe,

I am follow the common definition of power as a conditional probability.

“The power of a binary hypothesis test is the probability that the test rejects the null hypothesis (H0) when a specific alternative hypothesis (H1) is

true.”

https://en.wikipedia.org/wiki/Power_(statistics)

However, I have no problem to use the term also for cases where the null is true and “power” equals alpha.

As long as it is clear what we are talking about, both are important and useful concepts.

But doesn’t my example exactly fit the definition from Wikipedia, provided my null hypothesis (d=0.5, assumed effect) is false and a specific alternative (d=0, no effect) is true?

Maybe I am not understanding your point. If I am testing d .5, power would not be defined if the true population parameter is d = .5, No matter how many subjects I have, I will only get significant results at the rate of alpha.