Category Archives: Implicit Bias

Implicit Racism, Starbucks, and the Failure of Experimental Social Psychology

Implicit racism is in the news again (CNN).   A manager of a Starbucks in Philadelphia called 911 to ask police to remove two Black men from the coffee store because they had not purchased anything.  The problem is that many White customers frequent Starbucks without purchasing things and the police is not called.  The incident caused widespread protests and Starbucks announced that it would close all of its stores for “implicit bias training.”

Starbucks’ CEO Derrick Johnson explains the need for store-wide training in this quote.

“The Starbucks situation provides dangerous insight regarding the failure of our nation to take implicit bias seriously,” said the group’s president and CEO Derrick Johnson in a statement. “We refuse to believe that our unconscious bias –the racism we are often unaware of—can and does make its way into our actions and policies.”

But was it implicit bias? It does not matter. CEO Derrick Johnson could have talked about racism without changing what happened or the need for training.

“The Starbucks situation provides dangerous insight regarding the failure of our nation to take racism seriously,” said the group’s president and CEO Derrick Johnson in a statement. “We refuse to believe that we are racists and that racism can and does make its way into our actions and policies.”

We have not heard from the store manager why she called the police. This post is not about a single incidence at Starbucks because psychological science can rarely provide satisfactory answers to single events.  However, the call for training of thousands of Starbucks’ employees is not a single event.  It implies that social psychologists have developed scientific ways to measure “implicit bias” and developed ways to change it. This is the topic of this post.

What is implicit bias and what can be done to reduce it?

The term “implicit” has a long history in psychology, but it rose to prominence in the early 1990s when computers became more widely used in psychological research.  Computers made it possible to present stimuli on screens rather than on paper and to measure reaction times rather than self-ratings.  Computerized tasks were first used in cognitive psychology to demonstrate that people have associations that can influence their behaviors.  For example, participants are faster to determine that “doctor” is a word if the word is presented after a related word like “hospital” or “nurse.”

The term implicit is used for effects like this because the effect occurs without participants’ intention, conscious reflection, or deliberation. They do not want to respond this way, but they do, whether they want to or not.  Implicit effects can occur with or without awareness, but they are generally uncontrollable.

After a while, social psychologists started to use computerized tasks that were developed by cognitive psychologists to study social topics like prejudice.  Most studies used White participants to demonstrate prejudice with implicit tasks. For example, the association task described above can be easily modified by showing traditionally White or Black names (in the beginning computers could not present pictures) or faces.

Given the widespread prevalence of stereotypes about African Americans, many of these studies demonstrated that White participants respond differently to Black or White stimuli.  Nobody doubts these effects.  However, there remain two unanswered questions about these effects.

What (the fuck) is Implicit Racial Bias?

First, do responses in this implicit task with racial stimuli measure a specific form of prejudice?  That is, do implicit tasks measure plain old prejudice with a new measure or do they actually measure a new form of prejudice?  The main problem is that psychologists are not very good at distinguishing constructs and measures.  This goes back to the days when psychologists equated measures and constructs.  For example, to answer the difficult question whether IQ tests measure intelligence, it was simply postulated that intelligence is what IQ tests measure.  Similarly, there is no clear definition of implicit racial bias.  In social psychology implicit racism is essentially whatever leads to different responses to Black and White stimuli in an implicit task.

The main problem with this definition is that different implicit tasks show low convergent validity.  Somebody can take two different “implicit tests” (the popular Implicit Association Test, IAT, or the Affective Misattribution Task) and get different results.  The correlations between two different tests range from 0 to .3, which means that the tests disagree more with each other than that they agree.

20 years after the first implicit tasks were used to study prejudice we still do not know whether implicit bias even exist or how it could be measured, despite the fact that these tests are made available to the public to “test their racial bias.”  These tests do not meet the standards of real psychological tests and nobody should take their test scores too seriously.  A brief moment of self-reflection is likely to provide better evidence about your own feelings towards different social groups.  How would you feel if somebody from this group would move in next door? How would you feel if somebody from this group would marry your son or daughter?  Responses to questions like this have been used for over 100 years and they still show that most people have a preference for their own group over most other groups.  The main concern is that respondents may not answer these survey questions honestly.  But if you do so in private for yourself and you are honest to yourself, you will know better how prejudice you are towards different groups than by taking an implicit test.

What was the Starbucks’ manager thinking or feeling when she called 911? The answer to this question would be more informative than giving her an implicit bias test.

Is it possible to Reduce Implicit Bias?

Any scientific answer to this question requires measuring implicit bias.  The ideal study to examine the effectiveness of any intervention is a randomized controlled trial.  In this case it is easy to do so because many White Americans who are prejudice do not want to be prejudice. They learned to be prejudice through parents, friends, school, or media. Racism has been part of American culture for a long time and even individuals who do not want to be prejudice respond differently to White and African Americans.  So, there is no ethical problem in subjecting participants to an anti-racism training program. It is like asking smokers who want to quit smoking to participate in a test of a new treatment of nicotine addiction.

Unfortunately, social psychologists are not trained in running well-controlled intervention studies.  They are mainly trained to do experiments that examine the immediate effects of an experimental manipulation on some measure of interest.  Another problem is that published articles typically report only report successful experiments.  This publication bias leads to the wrong impression that it may be easy to change implicit bias.

For example, one of the leading social psychologist on implicit bias published an article with the title “On the Malleability of Automatic Attitudes: Combating Automatic
Prejudice With Images of Admired and Disliked Individuals” (Dasgupta & Greenwald, 2001).  The title makes two (implicit) claims.  Implicit attitudes can change  (it is malleable) and this article introduces a method that successfully reduced it (combating it).  This article was published 17 years ago and it has been cited 537 times so far.


Study 1

The first experiment relied on a small sample of university students (N = 48).  The study had three experimental conditions with n = 18, 15, and 15 for each condition.  It is now recognized that studies with fewer than n = 20 participants per condition are questionable (Simmons et al., 2011).

The key finding in this study was that scores on the Implicit Association Test (IAT) were lower when participants were exposure to positive examples of African Americans (e.g., Denzel Washington) and negative examples of European Americans (e.g., Jeffrey Dahmer – A serial killer)  than in the control condition, F(1, 31) = 5.23, p = .023.

The observed mean difference is d = .80.  This is considered a large effect. For an intervention to increase IQ it would imply an increase by 80% of a standard deviation or 12 IQ points.  However, in small samples, these estimates of effect size vary a lot.  To get an impression of the range of variability it is useful to compute the 95%CI around the observed effect size. It ranges form d = .10 to 1.49. This means that the actual effect size could be just 10% of a standard deviation, which in the IQ analogy would imply an increase by just 1.5 points.  Essentially, the results merely suggest that there is a positive effect, but they do not provide any information about the size of the effect. It could be very small or it could be very large.

Unusual for social psychology experiments, the authors brought participants back 24 hours after the manipulation to see whether the brief exposure to positive examples had a lasting effect on IAT scores.  As the results were published, we already know that it did. The only question is how strong the evidence was.

The result remained just significant, F(1, 31) = 4.16, p = .04999. A p-value greater than .05 would be non-significant, meaning the study provided insufficient evidence for a lasting change.  More troublesome is that the 95%CI around the observed mean difference of d = .73 ranged from d = .01 to 1.45.  This means it is possible that the actual effect size is just 1% of a standard deviation or 0.15 IQ points.  The small sample size simply makes it impossible to say how large the effect really is.

Study 2

Study 1 provided encouraging results in a small sample.  A logical extension for Study 2 would be to replicate the results of Study 1 with a larger sample in order to get a better sense of the size of the effect.  Another possible extension could be to see whether repeated presentations of positive examples over a longer time period can have lasting effects that last longer than 24 hours.  However, multiple-study articles in social psychology are rarely programmatic in this way (Schimmack, 2012). Instead, they are more a colorfull mosaic of studies that were selected to support a good story like “it is possible to combat implicit bias.”

The sample size in Study 2 was reduced from 48 to 26 participants.  This is a terrible decision because the results in Study 1 were barely significant and reducing sample sizes increases the risk of a false negative result (the intervention actually works, but the study fails to show it).

The purpose of Study 2 was to generalize the results of racial bias to aging bias.  Instead of African and European Americans, participants were exposed to positive and negative examples of young and old people and performed an age-IAT (old vs. young).

The statistical analysis showed again a significant mean difference, F(1, 24) = 5.13, p = .033.  However, the 95%CI again showed a wide range of possible effect sizes from d = .11 to 1.74.  Thus, the study provides no reliable information about the size of the effect.

Moreover, it has to be noted that study two did not report whether a 24-hour follow up was conducted or not.  Thus, there is no replication of the finding in Study 1 that a small intervention can have an effect that lasts 24 hours.

Publication Bias: Another Form of Implicit Bias [the bias researchers do not want to talk about in public]

Significance tests are only valid if the data are based on a representative sample of possible observations.  However, it is well-known that most journals, including social psychology journals publish only successful studies (p < .05) and that researchers use questionable research practices to meet this requirement.  Even two studies are sufficient to examine whether the results are representative or not.

The Test of Insufficient Variance examines whether reported p-values are too similar than we would expect based on a representative sample of data.  Selection for significance reduces variability in p-values because p-values greater than .05 are missing.

This article reported a p-value of .023 in Study 1 and .033 in Study 2.   These p-values were converted int z-values; 2.27 and 2.13, respectively. The variance for these two z-scores is 0.01.  Given the small sample sizes, it was necessary to run simulations to estimate the expected variance for two independent p-values in studies with 24 and 31 degrees of freedom. The expected variance is 0.875.  The probability of observing a variance of 0.01 or less with an expected variance of 0.875 is p = .085.  This finding raises concerns about the assumption that the reported results were based on a representative sample of observations.

In conclusion, the widely cited article with the promising title that scores on implicit bias measures are malleable and that it is possible to combat implicit bias provided very preliminary results that by no means provide conclusive evidence that merely presenting a few positive examples of African Americans reduces prejudice.

A Large-Scale Replication Study 

Nine years later, Joy-Gaba and Nosek (2010) examined whether the results reported by Dasgupta and Greenwald could be replicated.  The title of the article “The Surprisingly Limited Malleability of Implicit Racial Evaluations” foreshadows the results.

“Implicit preferences for Whites compared to Blacks can be reduced via exposure to admired Black and disliked White individuals (Dasgupta & Greenwald, 2001). In four studies (total N = 4,628), while attempting to clarify the mechanism, we found that implicit preferences for Whites were weaker in the “positive Blacks” exposure condition compared to a control condition (weighted average d = .08). This effect was substantially smaller than the original demonstration (Dasgupta & Greenwald, 2001; d = .82).”

On the one hand, the results can be interpreted as a successful replication because the study with 4,628 participants again rejected the null-hypothesis that the intervention has absolutely no effect.  However, the mean difference in the replication study is only d = .08, which corresponds to an effect size estimate of 1.2 IQ points if the study had tried to raise IQ.  Moreover, it is clear that the original study was only able to report a significant result because the observed mean difference in this study was inflated by 1000%.

Study 1

Participants in Study 1 were Canadian students (N = 1,403). The study differed in that it separated exposure to positive Black examples and negative White examples.  Ideally, real-world training programs would aim to increase liking of African Americans rather than make people think about White people as serial killers.  So, the use of only positive examples of African Americans makes an additional contribution by examining a positive intervention without negative examples of Whites.  The study also included age to replicate Study 2.

Like US Americans, Canadian students also showed a preference for White over Blacks on the Implicit Association Test. So failures to replicate the intervention effect are not due to a lack of racism in Canada.

A focused analysis of the race condition showed no effect of exposure to positive Black examples, t(670) = .09, p = .93.  The 95%CI of the mean difference in this study ranged from -.15 to .16.  This means that with a maximum error probability of 5%, it is possible to rule out effect sizes greater than .16.  This finding is not entirely inconsistent with the original article because the original study was inconclusive about effect sizes.

The replication study is able to provide a more precise estimate of the effect size and the results show that the effect size could be 0, but it could not be d = .2, which is typically used as a reference point for a small effect.

Study 2a

Study 2a reintroduced the original manipulation that exposed participants to positive examples of African Americans and negative examples of European Americans.  This study showed a significant difference between the intervention condition and a control condition that exposed participants to flowers and insects, t(589) = 2.08, p = .038.  The 95%CI for the effect size estimate ranged from d = .02 to .35.

It is difficult to interpret this result in combination with the result from Study 1.  First, the results of the two studies are not significantly different from each other.  It is therefore not possible to conclude that manipulations with negative examples of Whites are more effective than those that just show positive examples of Blacks.  In combination, the results of Study 1 and 2a are not significant, meaning it is not clear whether the intervention has any effect at all.  Nevertheless, the significant result in Study 2a suggests that presenting negative examples of Whites may influence responses on the race IAT.

Study 2b

Study 2b is an exact replication of Study 2a.  It also replicated a significant mean difference between participants exposed to positive Black and negative White examples and the control condition, t(788) = 1.99, p = .047 (reported as p = .05). The 95%CI ranges  from d = .002 to d = .28.

The problem is that now three studies produced significant results with exposure to positive Black and negative White examples (Original Study 1; replication Study 2a & 2b) and all three studies had just significant p-values (p = .023, p = .038, p = .047). This is unlikely without selection of data to attain significance.

Study 3

The main purpose of Study 3 was to compare an online sample, an online student sample, and a lab student sample. None of the three samples showed a significant mean difference.

Online sample: t(999) = .96, p = .34

Online student sample: t(93) = 0.51, p = .61

Lab student sample: t(75) = 0.70, p = .48

The non-significant results for the student samples are not surprising because sample sizes are too small to detect small effects.  The non-significant result for the large online sample is more interesting.  It confirms that the two p-values in Studies 2a and 2b were too similar. Study 3 produces greater variability in p-values that is expected and given the small effect size variability was increased by a non-significant result rather than a highly significant one.


In conclusion, there is no reliable evidence that merely presenting a few positive Black examples alters responses on the Implicit Association Test.   There is some suggestive evidence that presenting negative White examples may reduce prejudice presumably by decreasing favorable responses to Whites, but even this effect is very weak and may not last more than a few minutes or hours.

The large replication study shows that the highly cited original article provided misleading evidence that responses on implicit bias measures can be easily and dramatically changed by presenting positive examples of African Americans. If it were this easy to reduce prejudice, racism wouldn’t be the problem that it still is.

Newest Evidence

In a major effort, Lai et al. (2016) examined several interventions that might be used to combat racism.  The first problem with the article is that the literature review fails to mention Joy-Gaba and Nosek’s finding that interventions were rather ineffective or evidence that implicit racism measures show little natural variation over time (Cunningham et al., 2001). Instead they suggest that the ” dominant view has changed over the past 15 years to one of implicit malleability” [what they mean malleability of responses on implict tasks with racial stimuli].  While this may accurately reflect changes in social psychologists’ opinions, it ignores that there is no credible evidence to suggest that implicit attitude measures are malleable.

More important, the study also failed to find evidence that a brief manipulation could change performance on the IAT a day or more later, despite a large sample size to detect even small lasting effects.  However, some manipulations produced immediate effects on IAT scores.  The strongest effect was observed for a manipulation that required vivid imagination.

Vivid counterstereotypic scenario.

Participants in this intervention read a vivid second-person story in which they are the
protagonist. The participant imagines walking down a street late at night after drinking at a bar. Suddenly, a White man in his forties assaults the participant, throws him/her into the trunk of his car, and drives away. After some time, the White man opens the trunk and assaults the participant again. A young Black man notices the second assault and knocks out the White assailant, saving the day.  After reading the story, participants are told the next task (i.e., the race IAT) was supposed to affirm the associations: White = Bad, Black = Good. Participants were instructed to keep the story in mind during the IAT.

When given this instruction, the pro-White bias in the IAT was reduced.  However, one day later (Study 2) or two or three days later (Study 1) IAT performance was not significantly different from a control condition.

In conclusion, social psychologists have found out something that most people already know.  Changing attitudes, including prejudice, is hard because they are stable and difficult to change, even when participants want to change them.  A simple, 5-minute manipulation is not an intervention and it will not produce lasting changes in attitudes.

General Discussion

Social psychology has failed Black people who would like to be treated with the same respect as White people and White people who do not want to be racist.

Since Martin Luther King gave his dream speech, America has made progress towards a goal of racial equality without the help of social psychologists. Nevertheless, racial bias remains a problem, but social psychologists are too busy with sterile experiments that have no application to the real world (No! Starbucks’ employees should not imagine being abducted by White sociopaths to avoid calling 911 on Black patrons of their stores) and performance on an implicit bias test is only relevant if it predicted behavior and it doesn’t do that very well.

The whole notion of implicit bias is a creation by social psychologists without scientific foundations, but 911 calls that kill black people are real.  Maybe Starbucks could  fund some real racism research at Howard University because the mostly White professors at elite Universities seem to be unable to develop and test real interventions that can influence real behavior.

And last but not least, don’t listen to self-proclaimed White experts.



Social psychologists who have failed to validate measures and failed to conduct real intervention studies that might actually work are not experts.  It doesn’t take a Ph.D. to figure out some simple things that can be taught in a one-day workshop for Starbucks’ employees.  After all, the goal is just to get employees to treat all customers equally, which doesn’t even require a change in attitudes.

Here is one simple rule.  If you are ready to call 911 to remove somebody from your coffee shop and the person is Black, ask yourself before you dial whether you would do the same if the person were White and looked like you or your brother or sister. If so, go ahead. If not, don’t touch that dial.  Let them sit at a table like you let dozens of other people sit at their table because you make most of your money from people on the go anyways. Or buy them a coffee, or do something, but think twice or three times before you call the police.

And so what if it is just a PR campaign.  It is a good one. I am sure there are a few people who would celebrate a nation-wide racism training day for police (maybe without shutting down all police stations).

Real change comes from real people who protest.  Don’t wait for the academics to figure out how to combat automatic prejudice.  They are more interested in citations and further research than to provide real solutions to real problems.  Trust me, I know. I am (was?) a White social psychologist myself.