Failure to Accept the Null-Hypothesis: A Case Study

The past decade has revealed many flaws in the way psychologists conduct empirical tests of theories. The key problem is that psychologists lacked an accepted strategy to conclude that a prediction was not supported. This fundamental flaw can be traced back to Fisher’s introduction of significance testing. In Fisher’s the null-hypothesis is typically specified as the absence of an effect in either direction. That is, the effect size is exactly zero. Significance testing examines how much empirical results deviate from this prediction. If the probability of the result or even more extreme deviations is less than 5%, the null-hypothesis is rejected. However, if the p-value is greater than .05, no inferences can be drawn from the finding because there are two explanations for this finding. Either the null-hypothesis is true or it is false and the result is a false negative result. The probability of this false negative results is unspecified in Fisher’s framework. This asymmetrical approach to significance testing continues to dominate psychological science.

Criticism of this one-sided approach to significance testing is nearly as old as nil-hypothesis significance testing itself (Greenwald, 1975; Sterling, 1959). Greenwald’s (1975) article is notable because it provided a careful analysis of the problem and it pointed towards a solution to this problem that is rooted in Neyman-Pearson’s alternative to Fisher’s significance testing. Greenwald (1975) showed how it is possible to “Accept the Null-Hypothesis Gracefully” (p. 16).

“Use a range, rather than a point, null hypothesis. The procedural recommendations to follow are much easier to apply if the researcher has decided, in advance of data collection, just what magnitude of effect on a dependent measure or measure of association is large enough not to be considered trivial. This decision may have to be made somewhat arbitrarily but seems better to be made somewhat arbitrarily before data collection than to be made after examination of the data.” (p. 16).

The reason is simply that it is impossible to provide evidence for the nil-hypothesis that an effect size is exactly zero, just like it is impossible to show than an effect size equals any other precise value (e..g., r = .1). Although Greenwald made this sensible suggestion over 40 years ago, it is nearly impossible to find articles that specify a range of effect sizes a priori (e.g.., we expected the effect size to be in the range between r = .3 and r = .5 or we expected the correlation to be larger than r = .1).

Bad training continues to be a main reason for the lack of progress in psychological science. However, other factors also play a role. First, specifying effect sizes a priori has implications for the specification of sample sizes. A researcher who declares that effect sizes as small as r = .1 are meaningful and expected needs large samples to obtain precise effect size estimates. For example, assuming the population correlation is r = .2 and a researcher wants to show that it is at least r = .1, a one-sided test with alpha = .05 and 95% power (i.e., the probability of a successful outcome) is N = 1,035. As most sample sizes in psychology are below N = 200, most studies simply lack the precision to test hypothesis that predict small effects. A solution to this might be to focus on hypotheses that predict large effect sizes. However, to show that a population correlation of r = .4 is greater than r = .3, still requires N = 833 participants. In fact, most studies in psychology barely have enough power to demonstrate that moderate correlations, r = .3, are greater than zero, N = 138. In short, most studies are too small to provide evidence for the null-hypothesis that effect sizes are small than a minimum effect size. Not surprisingly, psychological theories are rarely abandoned because empirical results seemed to support the null-hypothesis.

However, occasionally studies do have large samples and it would be possible to follow Greenwald’s (1975) recommendation to specify a minimum effect size a priori. For example, Greenwald and colleagues conducted a study with N = 1,411 participants who reported their intentions to vote for Obama or McCain in the 2008 US elections. The main hypothesis was that implicit measures of racial attitudes like the race IAT would add to the prediction because some White Democrats might not vote for a Black Democratic candidate. It would have been possible to specify an minimum effect size based on a meta-analysis that was published in the same year. This meta-analysis of smaller studies suggested that the average race IAT – criterion correlation was r = .236. The explicit – criterion correlation was r = .186, effect, and the explicit-implicit correlation was only r = .117. Given the lower estimates for the explicit measures and the low explicit-implicit correlation, a regression analysis would only slightly reduce the effect size for the incremental predictive validity of the race IAT, b = .225. Thus, it would have been possible to test the hypothesis that the effect size is at least b = .1, which would imply that adding the race IAT as a predictor explains at least 1% additional variance in voting behaviors.

In reality, the statistical analyses were conducted with prejudice against the null-hypothesis. First, Greenwald et al. (2009) noted that “conservatism and symbolic racism
were the two strongest predictors of voting intention (see Table 1)” (p. 247).

A straightforward way to test the hypothesis that the race IAT contributes to the prediction of voting would simply add the standardized race IAT as an additional predictor and use the regression coefficient to test the prediction that implicit bias as measured with the race IAT contributes to voting against Obama. A more stringent test of incremental predictive validity would also include the other explicit prejudice measures because measurement error alone can produce incremental predictive validity for measures of the same construct. However, this is not what the authors did. Instead, they examined whether the four racial attitude measures jointly predicted variance in addition to political orientation. This was the case, with 2% additional explained variance (p < .0010). However, this result does not tell us anything about the unique contribution of the race IAT. The unique contributions of the four measures were not reported. Instead, another regression model tested whether the race IAT and a second implicit measure (the Affective Misattribution Task) explained incremental variance in addition to political orientation. In this model “the pair of implicit measures incrementally predicted only 0.6% of voting intention variance, p = .05” (p. 247). This model also does not tell us anything about the importance of the race IAT because it was not reported how much of the joint contribution was explained by the race IAT alone. The inclusion of the AMP also makes it impossible to test the statistical significance for the race IAT because most of the prediction may come from the shared variance between the two implicit measures, r = .218. Most important, the model does not test whether the race IAT predicts voting above and beyond explicit measures, including symbolic racism.

Another multiple regression analysis entered symbolic racism and the two implicit measures. In this analysis, the two implicit measures combined explained an additional 0.7% of the variance, but this was not statistically significant, p = .07.

They then fitted the model with all predictor variables. In this model, the four attitude measures explained an additional 1.3% of the variance, p = .01, but no information is provided about the unique contribution of the race IAT or the joint contribution of the two implicit measures. The authors merely comment that “among the four race attitude measures,
the thermometer difference measure was the strongest incremental predictor and was also the only one of the four that was individually statistically significant in their simultaneous entry after both symbolic racism and conservatism (p. 247).

To put it mildly, the presented results carefully avoid reporting the crucial result about the incremental predictive validity of the race IAT after explicit measures of prejudice are entered into the equation. Adding the AMP only creates confusion because the empirical question is how much the race IAT adds to the prediction of voting behavior. Whether this variance is shared with another implicit measure or not is not relevant.

Table 1 can be used to obtain the results that were not reported in the article. A regression analysis shows a standardized effect size estimate of 0.000 with a 95%CI that ranges from -.047 to .046. The upper limit of this confidence interval is below the minimum effect size of .1 that was used to specify a reasonable null-hypothesis. Thus, the only study that had sufficient precision to the incremental predictive validity of the race IAT shows that the IAT does not make a meaningful, notable, practically significant contribution to the prediction of racial bias in voting. In contrast, several self-report measures did show that racial bias influenced voting behavior above and beyond the influence of political orientation.

Greenwald et al.’s (2009) article illustrates Greenwald’s (1975) prejudice against the null-hypotheses. Rather than reporting a straightforward result, they present several analyses that disguise the fact that the race IAT did not predict voting behavior. Based on these questionable analyses, the authors misrepresent the findings. For example, they claim that “both the implicit and explicit (i.e., self-report) race attitude measures successfully predicted voting.” They omit that this statement is only correct when political orientation and symbolic racism are not used as predictors.

They then argue that their results “supplement the substantial existing evidence that race attitude IAT measures predict individual behavior (reviewed by Greenwald et al., 2009)” (p. 248). This statement is false. The meta-analysis suggested that incremental predictive validity of the race IAT is r ~ .2, whereas this study shows an effect size of r ~ 0 when political orientation is taken into account.

The abstract, often the only information that is available or read, further misleads readers. “The implicit race attitude measures (Implicit Association Test and Affect Misattribution Procedure) predicted vote choice independently of the self-report race attitude measures, and also independently of political conservatism and symbolic racism. These findings support construct validity of the implicit measures” (p. 242). Careful reading of the results section shows that the statement refers to separate analyses in which implicit measures are tested controlling for explicit attitude ratings OR political orientation OR symbolic racism. The new results presented here show that the race IAT does not predict voting controlling for explicit attitudes AND political orientation AND symbolic racism.

The deceptive analysis of these data has led to many citations that the race IAT is an important predictor of actual behavior. For example, in their popular book “Blindspot” Banaji and Greenwald list this study as an example that “the Race IAT predicted racially discriminatory behavior. A continuing stream of additional studies that have been completed since publication of the meta-analysis likewise supports that conclusion. Here are a few examples of race-relevant behaviors that were predicted by automatic White preference in these more recent studies: voting for John McCain rather than Barack Obama in the 2008 U.S. presidential election” (p. 49)

Kurdi and Banaji (2017) use the study to claim that “investigators have used implicit race attitudes to predict widely divergent outcome measures” (p. 282), without noting that even the reported results showed less than 1% incremental predictive validity. A review of prejudice measures features this study as an example of predictive validity (Fiske & North, 2014).

Of course, a single study with a single criterion is insufficient to accept the null-hypothesis that the race IAT lacks incremental predictive validity. A new meta-analysis by Kurdi with Greenwald as co-author provides new evidence about the typical amount of incremental predictive validity of the incremental predictive validity of the race IAT. The only problem is that this information is not provided. I therefore analyzed the open data to get this information. The meta-analytic results suggest an implicit-criterion correlation of r = .100, se = .01, an explicit-criterion correlation of r = .127, se = .02, and an implicit-explicit correlation of of r = .139, se = .022. A regression analysis yields an estimate of the incremental predictive validity for the race IAT of .084, 95%CI = .040 to .121. While this effect size is statistically significant in a test against the nil-hypothesis, it is also statistically different from Greenwald et al.s’ (2009) estimate of b = .225. Moreover, the point estimate is below .1, which could be used to affirm the null-hypothesis, but the confidence interval includes a value of .1. Thus, there is a 20% chance (an 80%CI would not include .1) that the effect size is greater than .1, but it is unlikely(p < .05) that it is greater than .12.

Greenwald and Lai (2020) wrote an Annual Review article about implicit measures. It mentions that estimates of the predictive validity of IATs have decreased from r = .274 (Greenwald et all, 2009) to r = .097 (Kurdi et al., 2019). No mention is made of a range of effect sizes that would support the null-hypothesis that implicit measures do not add to the prediction of prejudice because they do not measure an implicit cause of behavior that is distinct from causes of prejudice that are reflected in self-report measures. Thus, Greenwald fails to follow the advice of his younger self to provide a strong test of a theory by specifying effect sizes that would provide support for the null-hypothesis and against his theory of implicit cognitions.

It is not only ironic to illustrate the prejudice against falsification with Greenwald’s own research. It also shows that the one-sided testing of theories that avoids failures is not only a lack of proper training in statistics or philosophy of science. After all, Greenwald demonstrated that he is well aware of the problems with nil-hypothesis testing. Thus, only motivated biases can explain the one-sided examination of the evidence. Once a researcher has made a name for themselves, they are no longer neutral observers like judges or juries. They are more like prosecutors who will try as hard as possible to get a conviction and ignore evidence that may support a non-guilty verdict. To make matters worse, science does not really have an adversarial system where a defense lawyer stands up for the defendant (i.e., the null-hypothesis) and no evidence can be presented to support the defendant.

Once we realize the power of motivated reasoning, it is clear that we need to separate the work of theory development and theory evaluation. We cannot let researchers who developed a theory conduct meta-analyses and write review articles, just like we cannot ask film directors to write their own movie reviews. We should leave meta-analyses and reviews to a group of theoretical psychologists who do not conduct original research. As grant money for original research is extremely limited and a lot of time and energy is wasted on grant proposals, there is ample capacity for psychologist to become meta-psychologist. Their work also needs to be evaluated differently. The aim of meta-psychology is not to make novel discoveries, but to confirm that claims by original researchers about their discoveries are actually robust, replicable, and credible. Given the well-documented bias in the published literature, a lot of work remains to be done.

Replicability-Index

Improving the replicability of empirical research

Failure to Accept the Null-Hypothesis: A Case Study

Like this:

Leave a ReplyCancel reply

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from Replicability-Index