For decades, psychologists have misused the scientific method and statistical significance testing. Instead of using significance tests to confirm or falsify theoretical predictions, they only published statistically significant results that confirmed predictions. This selection for significance undermines the ability of statistical tests to distinguish between true and false hypotheses (Sterling, 1959).
Another problem is that psychologists ignore effect size. Significant results with the nil-hypothesis (no effect) only reject the hypothesis that the effect size is not zero. It is still possible that the population effect size is so small that it has no practical significance. In the 1990s, psychologists addressed this problem by publishing standardized effect sizes. The problem is that selection for significance also inflates these effect size estimates. Thus, journals may publish effect size estimates that seem important, when the actual effect sizes are trivial.
The impressive reproducibility project (OSC, 2015) found that original effect sizes were cut in half in replication studies that did not select for significance. In other words, population effect sizes are, on average, inflated by 100%. Importantly, this average inflation applied equally to cognitive and social psychology. However, social psychology has more replication failures which also implies larger inflation of effect sizes. Thus, most published effect sizes in social psychology are likely to provide misleading information about the actual effect sizes.
There have been some dramatic examples of effect-size inflation. Most prominently, a large literature with the ego-depletion paradigm (Baumeister et al., 1998) produced a meta-analytic mean effect size of d = .6. However, in a recent replication study that was organized by researchers who had published many studies with results selected for significance produced only an effect size of d = .06 without selection for significance (Schmeichel & Vohs, 2020). It is not important whether this effect size is different from zero or not. The real null-hypothesis here is an effect size of d = .6, and d = .06 is both statistically and practically significantly different from .6. In other words, the effect sizes in studies selected for significance were dramatically inflated by about 1000%. This means that none of the published results on ego-depletion are credible.
As I pointed out in my criticism of research practices in social psychology (Schimmack, 2012), other paradigms in social psychology have produced equally shocking inflation of effect sizes.
One possible explanation is that researchers do not care about effect sizes. Researchers may not consider it unethical to use questionable research methods that inflate effect sizes as long as they are convinced that the sign of the reported effect is consistent with the sign of the true effect. For example, the theory that implicit attitudes are malleable is supported by a positive effect of experimental manipulations on the implicit association test, no matter whether the effect size is d = .8 (Dasgupta & Greenwald, 2001) or d = .08 (Joy-Gaba & Nosek, 2010), and the influence of blood glucose levels on self-control is supported by a strong correlation of r = .6 (Gailliot et al., 2007) and a weak correlation of r = .1 (Dvorak & Simons, 2009).
How have IAT researchers responded to the realization that original effect sizes may have been dramatically inflated? Not much. Citations show that the original article with the 10 times inflated effect size is still cited much more frequently than the replication study with a trivial effect size.
Closer inspection of these citations shows that implicit bias researchers continue to cite the old study as if it provided credible evidence.
Axt, Casola, and Nosek (2019) mention the new study, but do not mention the results.
“The closest are studies investigating malleability of implicit attitudes (Joy-Gaba &
Nosek, 2010; Lai et al., 2014). For example, in Lai et al. (2014), priming the concept of multiculturalism was moderately effective at reducing implicit preferences for White
versus Black people, but did not alter implicit preferences for White versus Hispanic people or White versus Asian people.”
Devine, Forscher, Austin, and Cox (2012) wrote
“The reality of lingering racial disparities, combinedwith the empirically established links between implicit bias and pernicious discriminatory outcomes, has led to a clarion call for strategies to reduce these biases (Fiske, 1998; Smedley, Stith, & Nelson, 2003). In response, the field has witnessed an explosion of empirical efforts to reduce implicit biases (Blair, 2002). These efforts have yielded a number of easy-to-implement strategies, such as taking the perspective of stigmatized others (Galinsky & Moskowitz, 2000) and imagining counter-stereotypic examples (Blair, Ma,& Lenton, 2001; Dasgupta & Greenwald, 2001), that lead to substantial reductions in implicit bias, at least for a short time (i.e., up to 24 hours)” (p. 1268).
Lai et al. (2014) write.
“How can the expression of implicit racial preferences be reduced to mitigate subsequent discriminatory behavior? Indeed, significant progress has been made in the goal of identifying the processes underlying malleability and change in implicit evaluations (Dasgupta & Greenwald, 2001; Mitchell, Nosek, & Banaji, 2003; Olson & Fazio, 2006; Rudman, Ashmore, & Gary, 2001; for reviews, see Blair, 2002; Dasgupta, 2009; Gawronski & Bodenhausen, 2006;
Gawronski & Sritharan, 2010; Lai, Hoffman, & Nosek., 2013; Sritharan & Gawronski, 2010).
Even more problematic is the statement “Prior research demonstrates that exposure to positive Black and negative White exemplars can shift implicit racial preferences (Dasgupta & Greenwald, 2001; Joy-Gaba & Nosek, 2010).” as if d = .8 is equivalent to d = .08 (Lai et al., 2014, p. 1771)
Payne and colleagues write
“Numerous studies have documented that performance on implicit bias tests is malleable in response to various manipulations of the context. For example, implicit racial bias scores
can be shifted by interacting with an African American experimenter, listening to rap music, or looking at a photo of Denzel Washington (Dasgupta & Greenwald, 2001; Lowery, Hardin, &
Sinclair, 2001; Rudman & Lee, 2002).” (p. 235).
A positive example that cites Nosek and Joy-Gaba (2010) correctly comes from an outsider.
Natalie Salmanowitz’s article writes in Journal of Law and the Biosciences that “a short, impersonal exposure to counterstereotypical exemplars cannot be expected to counteract a lifetime of ingrained mental associations” (p. 180).
In conclusion, science is self-correctiong, IAT researchers are not self-correcting, therefore IAT research is not science until IAT researchers are honest about the research practices that produced dramatically inflated effect sizes and irreproducible results. Open practices alone are not enough. Honesty and a commitment to pursing the truth (rather than fame or happiness) is essential for scientific progress.