Category Archives: Thinking Fast and Slow

A Meta-Scientific Perspective on “Thinking: Fast and Slow

December 30, 2020Implicit Priming, Kahneman, r-index, Thinking Fast and Slow, Z-CurveUlrich Schimmack

2011 was an important year in the history of psychology, especially social psychology. First, it became apparent that one social psychologist had faked results for dozens of publications (https://en.wikipedia.org/wiki/Diederik_Stapel). Second, a highly respected journal published an article with the incredible claim that humans can foresee random events in the future, if they are presented without awareness (https://replicationindex.com/2018/01/05/bem-retraction/). Third, Nobel Laureate Daniel Kahneman published a popular book that reviewed his own work, but also many findings from social psychology (https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow).

It is likely that Kahneman’s book, or at least some of his chapters, would be very different from the actual book, if it had been written just a few years later. However, in 2011 most psychologists believed that most published results in their journals can be trusted. This changed when Bem (2011) was able to provide seemingly credible scientific evidence for paranormal phenomena nobody was willing to believe. It became apparent that even articles with several significant statistical results could not be trusted.

Kahneman also started to wonder whether some of the results that he used in his book were real. A major concern was that implicit priming results might not be replicable. Implicit priming assumes that stimuli that are presented outside of awareness can still influence behavior (e.g., you may have heard the fake story that a movie theater owner flashed a picture of a Coke bottle on the screen and that everybody rushed to the concession stand to buy a Coke without knowing why they suddenly wanted one). In 2012, Kahneman wrote a letter to the leading researcher of implicit priming studies, expressing his doubts about priming results, that attracted a lot of attention (Young, 2012).

Several years later, it has become clear that the implicit priming literature is not trustworthy and that many of the claims in Kahneman’s Chapter 4 are not based on solid empirical foundations (Schimmack, Heene, & Kesavan, 2017). Kahneman acknowledged this in a comment on our work (Kahneman, 2017).

We initially planned to present our findings for all chapters in more detail, but we got busy with other things. However, once in a while I am getting inquires about the other chapters (Engber). So, I am using some free time over the holidays to give a brief overview of the results for all chapters.

The Replicability Index (R-Index) is based on two statistics (Schimmack, 2016). One statistic is simply the percentage of significant results. In a popular book that discusses discoveries, this value is essentially 100%. The problem with selecting significant results from a broader literature is that significance alone, p < .05, does not provide sufficient information about true versus false discoveries. It also does not tell us how replicable a result is. Information about replicability can be obtained by converting the exact p-value into an estimate of statistical power. For example, p = .05 implies 50% power and p = .005 implies 80% power with alpha = .05. This is a simple mathematical transformation. As power determines the probability of a significant result, it also predicts the probability of a successful replication. A study with p = .005 is more likely to replicate than a study with p = .05.

There are two problems with point-estimates of power. One problem is that p-values are highly variable, which also produces high variability / uncertainty in power estimates. With a single p-value, the actual power could range pretty much from the minimum of .05 to the maximum of 1 for most power estimates. This problem is reduced in a meta-analysis of p-values. As more values become available, the average power estimate is closer to the actual average power.

The second problem is that selection of significant results (e.g., to write a book about discoveries) inflates power estimates. This problem can be addressed by comparing the success rate or discovery rate (i.e., the percentage of significant results) with the average power. Without publication bias, the discovery rate should match average power (Brunner & Schimmack, 2020). When publication bias is present, the discovery rate exceeds average power (Schimmack, 2012). Thus, the difference between the discovery rate (in this case 100%) and the average power estimates provides information about the extend of publication bias. The R-Index is a simple correction for the inflation that is introduced by selecting significant results. To correct for inflation the difference between the discovery rate and the average power estimate is subtracted from the mean power estimate. For example, if all studies are significant and the mean power estimate is 80%, the discrepancy is 20%, and the R-Index is 60%. If all studies are significant and the mean power estimate is only 60%, the R-Index is 20%.

When I first developed the R-Index, I assumed that it would be better to use the median (e.g.., power estimates of .50, .80, .90 would produce a median value of .80 and an R-Index of 60. However, the long-run success rate is determined by the mean. For example, .50, .80, .90 would produce a mean of .73, and an R-Index of 47. However, the median overestimates success rates in this scenario and it is more appropriate to use the mean. As a result, the R-Index results presented here differ somewhat from those shared publically in an article by Engber.

Table 1 shows the number of results that were available and the R-Index for chapters that mentioned empirical results. The chapters vary dramatically in terms of the number of studies that are presented (Table 1). The number of results ranges from 2 for chapters 14 and 16 to 55 for Chapter 5. For small sets of studies, the R-Index may not be very reliable, but it is all we have unless we do a careful analysis of each effect and replication studies.

Chapter 4 is the priming chapter that we carefully analyzed (Schimmack, Heene, & Kesavan, 2017).Table 1 shows that Chapter 4 is the worst chapter with an R-Index of 19. An R-Index below 50 implies that there is a less than 50% chance that a result will replicate. Tversky and Kahneman (1971) themselves warned against studies that provide so little evidence for a hypothesis. A 50% probability of answering multiple choice questions correctly is also used to fail students. So, we decided to give chapters with an R-Index below 50 a failing grade. Other chapters with failing grades are Chapter 3, 6, 711, 14, 16. Chapter 24 has the highest highest score (80, wich is an A- in the Canadian grading scheme), but there are only 8 results.

Chapter 24 is called “The Engine of Capitalism”

A main theme of this chapter is that optimism is a blessing and that individuals who are more optimistic are fortunate. It also makes the claim that optimism is “largely inherited” (typical estimates of heritability are about 40-50%), and that optimism contributes to higher well-being (a claim that has been controversial since it has been made, Taylor & Brown, 1988; Block & Colvin, 1994). Most of the research is based on self-ratings, which may inflate positive correlations between measures of optimism and well-being (cf. Schimmack & Kim, 2020). Of course, depressed individuals have lower well-being and tend to be pessimistic, but whether optimism is really preferable over realism remains an open question. Many other claims about optimists are made without citing actual studies.

Even some of the studies with a high R-Index seem questionable with the hindsight of 2020. For example, Fox et al.’s (2009) study of attentional biases and variation in the serotonin transporter gene is questionable because single-genetic variant research is largely considered unreliable today. Moreover, attentional-bias paradigms also have low reliability. Taken together, this implies that correlations between genetic markers and attentional bias measures are dramatically inflated by chance and unlikely to replicate.

Another problem with narrative reviews of single studies is that effect sizes are often omitted. For example, Puri and Robinson’s finding that optimism (estimates of how long you are going to live) and economic risk-taking are correlated is based on a large sample. This makes it possible to infer that there is a relationship with high confidence. A large sample also allows fairly precise estimates of the size of the relationship, which is a correlation of r = .09. A simple way to understand what this correlation means is to think about the increase in predicting in risk taking. Without any predictor, we have a 50% chance for somebody to be above or below the average (median) in risk-taking. With a predictor that is correlated r = .09, our ability to predict risk taking increases from 50% to 55%.

Even more problematic, the next article that is cited for a different claim shows a correlation of r = -.04 between a measure of over-confidence and risk-taking (Busenitz & Barney, 1997). In this study with a small sample (N = 124 entrepreneurs, N = 95 managers), over-confidence was a predictor of being an entrepreneur, z = 2.89, R-Index = .64.

The study by Cassar and Craig (2009) provides strong evidence for hindsight bias, R-Index = 1. Entrepreneurs who were unable to turn a start-up into an operating business underestimated how optimistic they were about their venture (actual: 80%, retrospective: 60%).

Sometimes claims are only loosely related to a cited article (Hmieleski & Baron, 2009). The statement “this reasoning leads to a hypothesis: the people who have the greatest influence on the lives of others are likely to be optimistic and overconfident, and to take more risks than they realize” is linked to a study that used optimism to predict revenue growth and employment growth. Optimism was a negative predictor, although the main claim was that the effect of optimism also depends on experience and dynamism.

A very robust effect was used for the claim that most people see themselves as above average on positive traits (e.g., overestimate their intelligence) (Williams & Gilovich, 2008), R-Index = 1. However, the meaning of this finding is still controversial. For example, the above average effect disappears when individuals are asked to rate themselves and familiar others (e.g., friends). In this case, ratings of others are more favorable than ratings of self (Kim et al., 2019).

Kahneman then does mention the alternative explanation for better-than-average effects (Windschitl et al., 2008). Namely rather than actually thinking that they are better than average, respondents simply respond positively to questions about qualities that they think they have without considering others or the average person. For example, most drivers have not had a major accident and that may be sufficient to say that they are a good driver. They then also rate themselves as better than the average driver without considering that most other drivers also did not have a major accident. R-Index = .92.

So, are most people really overconfident and does optimism really have benefits and increase happiness? We don’t really know, even 10 years after Kahneman wrote his book.

Meanwhile, the statistical analysis of published results has also made some progress. I analyzed all test statistics with the latest version of z-curve (Bartos & Schimmack, 2020). All test-statistics are converted into absolute z-scores that reflect the strength of evidence against the null-hypothesis that there is no effect.

The figure shows the distribution of z-scores. As the book focussed on discoveries most test-statistics are significant with p < .05 (two-tailed, which corresponds to z = 1.96. The distribution of z-scores shows that these significant results were selected from a larger set of tests that produced non-significant results. The z-curve estimate is that the significant results are only 12% of all tests that were conducted. This is a problem.

Evidently, these results are selected from a larger set of studies that produced non-significant results. These results may not even have been published (publication bias). To estimate how replicable the significant results are, z-curve estimates the mean power of the significant results. This is similar to the R-Index, but the R-Index is only an approximate correction for information. Z-curve does properly correct for the selection for significance. The mean power is 46%, which implies that only half of the results would be replicated in exact replication studies. The success rate in actual replication studies is often lower and may be as low as the estimated discovery rate (Bartos & Schimmack, 2020). So, replicability is somewhere between 12% and 46%. Even if half of the results are replicable, we do not know which results are replicable and which one’s are not. The Chapter-based analyses provide some clues which findings may be less trustworthy (implicit priming) and which ones may be more trustworthy (overconfidence), but the main conclusion is that the empirical basis for claims in “Thinking: Fast and Slow” is shaky.

Conclusion

In conclusion, Daniel Kahneman is a distinguished psychologist who has made valuable contributions to the study of human decision making. His work with Amos Tversky was recognized with a Nobel Memorial Prize in Economics (APA). It is surely interesting to read what he has to say about psychological topics that range from cognition to well-being. However, his thoughts are based on a scientific literature with shaky foundations. Like everybody else in 2011, Kahneman trusted individual studies to be robust and replicable because they presented a statistically significant result. In hindsight it is clear that this is not the case. Narrative literature reviews of individual studies reflect scientists’ intuitions (Fast Thinking, System 1) as much or more than empirical findings. Readers of “Thinking: Fast and Slow” should read the book as a subjective account by an eminent psychologists, rather than an objective summary of scientific evidence. Moreover, ten years have passed and if Kahneman wrote a second edition, it would be very different from the first one. Chapters 3 and 4 would probably just be scrubbed from the book. But that is science. It does make progress, even if progress is often painfully slow in the softer sciences.

Estimating Reproducibility of Psychology (No. 140): An Open Post-Publication Peer-Review

April 17, 2018Experimental Social Psychology, John A. Bargh, Kahneman, Priming, Publication Bias, Social Psychology, Thinking Fast and Slow, TIVAUlrich Schimmack

Introduction

In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology. One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology. However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not. The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors. Each study may have a different explanation. This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article. These predictions will only be accurate if the replication studies were close replications of the original study. Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Special Introduction

The senior author of Article 140 was John A. Bargh. Other studies by John Bargh have failed to replicate and questions about his research were raised in an open letter by Noble Laureate Daniel Kahneman, who prominently features Bargh’s work in his popular book “Thinking Fast and Slow.” John Bargh’s response to this criticisms can be described as stoic defiance. In 2017, he published a book on his work that did not mention replication failures or concerns about replicability of social priming research in general. A quantitative review of the book showed that much of the cited evidence was weak.

Summary of Original Article

The article “Keeping One’s Distance: The Influence of Spatial Distance Cues on Affect and
Evaluation” by Lawrence E. Williams and John A. Bargh was published in Psychological Science. The article has been cited 155 times overall and 17 times in 2017.

The main hypothesis is that physical distance influences social judgments without reference to the self. To provide evidence for this hypothesis the authors reported four priming studies. Physical distance was primed by plotting points in a way that suggested physical closeness or distance. Mean differences indicated greater enjoyment of media depicting embarrassment, less emotional distress from violent media, lower estimates of the number of calories in unhealthy food, and weaker reports of emotional attachments to family members and hometowns in the distant prime condition than in the close prime condition.

Study 1

73 undergraduate students were assigned to three priming conditions (n = 24 per cell; 41 female, 32 male).

After the priming manipulation, participants were given an excerpt from the book “Good in Bed.” and rated their enjoyment of an article titled “Loving a larger woman.”

An ANOVA showed a significant difference between the three groups, F(2, 67) = 3.14, p = .0497 (reported as p-rep = .88). The authors wisely did not conduct post-hoc comparisons of the close or distant condition with the control condition. Instead, they reported a significant linear trend, t(67) = 2.41, p = .019.

Although men and women may respond differently to reading an article about making love to a larger woman, gender differences were not examined or reported.

Study 2

Study 2 reduced the sample size form 73 participants to 42 participants (n = 14 per cell).

The dependent variable was liking of a violent and shocking story.

Despite the loss in statistical power, the ANOVA result was significant, F(2, 39) = 4.37, p = .019. The linear contrast was also significant, t(39) = 2.62, p = .012.

Study 3

59 community members participated in Study 3.

This time, participants rated caloric content (but not liking?) of healthy and unhealthy foods .

The mixed-model ANOVA showed a significant interaction result, F(2,56) = 3.36, p = .042. The interaction was due to significant differences between the priming conditions for unhealthy foods, F(2,56) = 5.62, p = .006. There were no significant differences for healthy foods. No linear contrasts were reported. The Figure shows that the significant effect was mainly due to a lower estimate of calories in the distant priming condition.

Study 4

84 students participated in Study 4.

The dependent variable was the average of closeness ratings to siblings, parents, and hometown.

The ANOVA result was significant, F(2,81) = 4.97, p = .009. The linear contrast also showed a significant trend for lower closeness ratings after distance priming, t(81) = 2.86, p = .005.

Replicability Assessment

All four studies showed significant results with p-values (.019, .012, .006, and .005). These p-values correspond to z-scores of z = 2.35, 2.51, 2.75, and 2.81. Median observed power is 75%, while the success rate is 100%. The Replicability Index corrects for inflation in median observed power when the success rate is higher than median observed power. An R-Index of 50% (75% – 25% Inflation) is not terrible, but also not very reassuring. In the grading scheme of the R-Index it is a D- (in comparison, many chapters in Bargh’s book have an Index below 50% and an F).

More concerning is that the variance of the four z-scores is only var(z) = 0.046, when sampling error of independent studies should be 1. The Test of Insufficient Variance (TIVA) shows that such a small variance or even less variance would occur in four studies with a probability of p = .013. This suggests that some non-random factors contributed to the observed results.

These results suggest that it is difficult to replicate the reported results because the reported effect sizes may be inflated by the use of questionable research practices.

Replication Study

Consistent with the OSC project guidelines, the authors replicated the last study in the article.

The sample size was moderately larger than in the original study (N = 133 vs. 84).

The simple procedure was reproduced exactly.

The ANOVA result was not significant, F(2,122) = 0.24, p = .79. The linear contrast was also not significant, t(122) = -0.59, p = .56.

The pattern of means showed a slightly higher closeness (to family) rating after priming closeness (M = 5.44, SD = 0.83) than in the control condition (M = 5.31, SD = 1.07). The mean for the distance priming condition was identical to the control condition (M = 5.31, SD = 1.15).

In conclusion, the replication study failed to replicate the original finding. Given the simplicity of the study, there are no obvious differences between the studies that could explain the replication failure. The most plausible explanation is that the original article reported inflated effect sizes.

Other Replication Failures

A previous replication attempt failed to replicate the results of Studies 3 and 4 (Harold Pashler, Noriko Coburn, Christine R. Harris, 2012).

The replication of Study 3 had 92 participants (vs. 59 in the original study). The study failed to replicate the interaction between distance priming and healthiness of food, F(2,85) = 0.45, p = .64. The replication of Study 4 also had 92 participants, although some were excluded because they did not have siblings or parents or could not perform the plotting task. The final sample size was N = 71 (vs. 84 in the original study). It also failed to replicate the original result, F(2,68) = 0.31.

Pashler et al. (2012) found no plausible explanation for their replication failure, but they did not consider QRPs. In contrast, the bias analyses presented here suggest that QRPs were used to report only supportive evidence for distance priming. If QRPs were used, it is not surprising that unbiased replication attempts fail to produce biased results.

Conclusion

Williams and Bargh (2008) proposed that a simple geometric task could prime distance or closeness and alter judgments of liking (close = like). Although they presented four studies with significant results, the evidence is not conclusive because bias tests suggest that the results are too good to be true. This impression was confirmed by a replication failure in a replication study that had slightly more power to detect the predicted effect due to a larger sample.

Although the replication failure was reported in 2015, the original article continues to be cited as if no replication failure occurred or the replication failure can be dismissed. A careful bias analysis suggests that the original results do not provide credible evidence for distance priming and the article should not be cited as evidence for it. Unless future studies with larger samples provide credible evidence for the effect, it remains doubtful that a simple geometric drawing task can alter evaluations of closeness to family members.

This replication failure has to be considered in the context of other replication failures of studies by John Bargh and social priming studies in general. As noted by Daniel Kahneman in his 2012 letter to John Bargh.

Bargh’s response to this letter may be described as willful attentional blindness. Not addressing concerns about replicability may be one way to maintain confidence in oneself in the face of criticism and doubt by others, but it is not good science. Experimental social psychologists who still believe in social priming effects like distance priming need to demonstrate replicability with high powered studies. So far, the results have not been encouraging (see failure to replicate professor priming).

P.S If you liked this blog post, the reason is that I primed you with a closeness prime (featured image). In reality, this blog post is terrible. If you didn’t like the blog post, the priming manipulation didn’t work (ironic isn’t it).

Reconstruction of a Train Wreck: How Priming Research Went off the Rails

February 2, 2017Kahneman, Priming, r-index, Statistical Power, Thinking Fast and SlowUlrich Schimmack

This blog post focusses on Chapter 4 about Implicit Priming in Kahneman’s book “Thinking” Fast and Slow.” A review of the book and other chapters can be found here: https://replicationindex.com/2020/12/30/a-meta-scientific-perspective-on-thinking-fast-and-slow/

Daniel Kahneman’s response to this blog post:
https://replicationindex.com/2017/02/02/reconstruction-of-a-train-wreck-how-priming-research-went-of-the-rails/comment-page-1/#comment-1454

Authors: Ulrich Schimmack, Moritz Heene, and Kamini Kesavan

Abstract:
We computed the R-Index for studies cited in Chapter 4 of Kahneman’s book “Thinking Fast and Slow.” This chapter focuses on priming studies, starting with John Bargh’s study that led to Kahneman’s open email. The results are eye-opening and jaw-dropping. The chapter cites 12 articles and 11 of the 12 articles have an R-Index below 50. The combined analysis of 31 studies reported in the 12 articles shows 100% significant results with average (median) observed power of 57% and an inflation rate of 43%. The R-Index is 14. This result confirms Kahneman’s prediction that priming research is a train wreck and readers of his book “Thinking Fast and Slow” should not consider the presented studies as scientific evidence that subtle cues in their environment can have strong effects on their behavior outside their awareness.

Introduction

In 2011, Nobel Laureate Daniel Kahneman published a popular book, “Thinking Fast and Slow”, about important finding in social psychology.

In the same year, questions about the trustworthiness of social psychology were raised. A Dutch social psychologist had fabricated data. Eventually over 50 of his articles would be retracted. Another social psychologist published results that appeared to demonstrate the ability to foresee random future events (Bem, 2011). Few researchers believed these results and statistical analysis suggested that the results were not trustworthy (Francis, 2012; Schimmack, 2012). Psychologists started to openly question the credibility of published results.

In the beginning of 2012, Doyen and colleagues published a failure to replicate a prominent study by John Bargh that was featured in Daniel Kahneman’s book. A few month later, Daniel Kahneman distanced himself from Bargh’s research in an open email addressed to John Bargh (Young, 2012):

“As all of you know, of course, questions have been raised about the robustness of priming results…. your field is now the poster child for doubts about the integrity of psychological research… people have now attached a question mark to the field, and it is your responsibility to remove it… all I have personally at stake is that I recently wrote a book that emphasizes priming research as a new approach to the study of associative memory…Count me as a general believer… My reason for writing this letter is that I see a train wreck looming.”

Five years later, Kahneman’s concerns have been largely confirmed. Major studies in social priming research have failed to replicate and the replicability of results in social psychology is estimated to be only 25% (OSC, 2015).

Looking back, it is difficult to understand the uncritical acceptance of social priming as a fact. In “Thinking Fast and Slow” Kahneman wrote “disbelief is not an option. The results are not made up, nor are they statistical flukes. You have no choice but to accept that the major conclusions of these studies are true.”

Yet, Kahneman could have seen the train wreck coming. In 1971, he co-authored an article about scientists’ “exaggerated confidence in the validity of conclusions based on small samples” (Tversky & Kahneman, 1971, p. 105). Yet, many of the studies described in Kahneman’s book had small samples. For example, Bargh’s priming study used only 30 undergraduate students to demonstrate the effect.

Replicability Index

Small samples can be sufficient to detect large effects. However, small effects require large samples. The probability of replicating a published finding is a function of sample size and effect size. The Replicability Index (R-Index) makes it possible to use information from published results to predict how replicable published results are.

Every reported test-statistic can be converted into an estimate of power, called observed power. For a single study, this estimate is useless because it is not very precise. However, for sets of studies, the estimate becomes more precise. If we have 10 studies and the average power is 55%, we would expect approximately 5 to 6 studies with significant results and 4 to 5 studies with non-significant results.

If we observe 100% significant results with an average power of 55%, it is likely that studies with non-significant results are missing (Schimmack, 2012). There are too many significant results. This is especially true because average power is also inflated when researchers report only significant results. Consequently, the true power is even lower than average observed power. If we observe 100% significant results with 55% average powered power, power is likely to be less than 50%.

This is unacceptable. Tversky and Kahneman (1971) wrote “we refuse to believe that a serious investigator will knowingly accept a .50 risk of failing to confirm a valid research hypothesis.”

To correct for the inflation in power, the R-Index uses the inflation rate. For example, if all studies are significant and average power is 75%, the inflation rate is 25% points. The R-Index subtracts the inflation rate from average power. So, with 100% significant results and average observed power of 75%, the R-Index is 50% (75% – 25% = 50%). The R-Index is not a direct estimate of true power. It is actually a conservative estimate of true power if the R-Index is below 50%. Thus, an R-Index below 50% suggests that a significant result was obtained only by capitalizing on chance, although it is difficult to quantify by how much.

How Replicable are the Social Priming Studies in “Thinking Fast and Slow”?

Chapter 4: The Associative Machine

4.1. Cognitive priming effect

In the 1980s, psychologists discovered that exposure to a word causes immediate and measurable changes in the ease with which many related words can be evoked.

[no reference provided]

4.2. Priming of behavior without awareness

Another major advance in our understanding of memory was the discovery that priming is not restricted to concepts and words. You cannot know this from conscious experience, of course, but you must accept the alien idea that your actions and your emotions can be primed by events of which you are not even aware.

“In an experiment that became an instant classic, the psychologist John Bargh and his collaborators asked students at New York University—most aged eighteen to twenty-two—to assemble four-word sentences from a set of five words (for example, “finds he it yellow instantly”). For one group of students, half the scrambled sentences contained words associated with the elderly, such as Florida, forgetful, bald, gray, or wrinkle. When they had completed that task, the young participants were sent out to do another experiment in an office down the hall. That short walk was what the experiment was about. The researchers unobtrusively measured the time it took people to get from one end of the corridor to the other.”

“As Bargh had predicted, the young people who had fashioned a sentence from words with an elderly theme walked down the hallway significantly more slowly than the others. walking slowly, which is associated with old age.”

“All this happens without any awareness. When they were questioned afterward, none of the students reported noticing that the words had had a common theme, and they all insisted that nothing they did after the first experiment could have been influenced by the words they had encountered. The idea of old age had not come to their conscious awareness, but their actions had changed nevertheless.“

[John A. Bargh, Mark Chen, and Lara Burrows, “Automaticity of Social Behavior: Direct Effects of Trait Construct and Stereotype Activation on Action,” Journal of Personality and Social Psychology 71 (1996): 230–44.]

t(28)=2.86	0.008	2.66	0.76
t(28)=2.16	0.039	2.06	0.54

MOP = .65, Inflation = .35, R-Index = .30

4.3. Reversed priming: Behavior primes cognitions

“The ideomotor link also works in reverse. A study conducted in a German university was the mirror image of the early experiment that Bargh and his colleagues had carried out in New York.”

“Students were asked to walk around a room for 5 minutes at a rate of 30 steps per minute, which was about one-third their normal pace. After this brief experience, the participants were much quicker to recognize words related to old age, such as forgetful, old, and lonely.”

“Reciprocal priming effects tend to produce a coherent reaction: if you were primed to think of old age, you would tend to act old, and acting old would reinforce the thought of old age.”

t(18)=2.10	0.050	1.96	0.50
t(35)=2.10	0.043	2.02	0.53
t(31)=2.50	0.018	2.37	0.66

MOP = .53, Inflation = .47, R-Index = .06

4.4. Facial-feedback hypothesis (smiling makes you happy)

“Reciprocal links are common in the associative network. For example, being amused tends to make you smile, and smiling tends to make you feel amused….”

“College students were asked to rate the humor of cartoons from Gary Larson’s The Far Side while holding a pencil in their mouth. Those who were “smiling” (without any awareness of doing so) found the cartoons funnier than did those who were “frowning.”

[“Inhibiting and Facilitating Conditions of the Human Smile: A Nonobtrusive Test of the Facial Feedback Hypothesis,” Journal of Personality and Social Psychology 54 (1988): 768–77.]

The authors used the more liberal and unconventional criterion of p < .05 (one-tailed), z = 1.65, as a criterion for significance. Accordingly, we adjusted the R-Index analysis and used 1.65 as the criterion value.

t(89)=1.85	0.034	1.83	0.57
t(75)=1.78	0.034	1.83	0.57

MOP = .57, Inflation = .43, R-Index = .14

These results could not be replicated in a large replication effort with 17 independent labs. Not a single lab produced a significant result and even a combined analysis failed to show any evidence for the effect.

4.5. Automatic Facial Responses

In another experiment, people whose face was shaped into a frown (by squeezing their eyebrows together) reported an enhanced emotional response to upsetting pictures—starving children, people arguing, maimed accident victims.

[Ulf Dimberg, Monika Thunberg, and Sara Grunedal, “Facial Reactions to

Emotional Stimuli: Automatically Controlled Emotional Responses,” Cognition and Emotion, 16 (2002): 449–71.]

The description in the book does not match any of the three studies reported in this article. The first two studies examined facial muscle movements in response to pictures of facial expressions (smiling or frowning faces). The third study used emotional pictures of snakes and flowers. We might consider the snake pictures as being equivalent to pictures of starving children or maimed accident victims. Participants were also asked to frown or to smile while looking at the pictures. However, the dependent variable was not how they felt in response to pictures of snakes, but rather how their facial muscles changed. Aside from a strong effect of instructions, the study also found that the emotional picture had an automatic effect on facial muscles. Participants frowned more when instructed to frown and looking at a snake picture than when instructed to frown and looking at a picture of a flower. “This response, however, was larger to snakes than to flowers as indicated by both the Stimulus factor, F(1, 47) = 6.66, p < .02, and the Stimulus 6 Interval factor, F(1, 47) = 4.30, p < .05.” (p. 463). The evidence for smiling was stronger. “The zygomatic major muscle response was larger to flowers than to snakes, which was indicated by both the Stimulus factor, F(1, 47) = 18.03, p < .001, and the Stimulus 6 Interval factor, F(1, 47) = 16.78, p < .001.” No measures of subjective experiences were included in this study. Therefore, the results of this study provide no evidence for Kahneman’s claim in the book and the results of this study are not included in our analysis.

4.6. Effects of Head-Movements on Persuasion

“Simple, common gestures can also unconsciously influence our thoughts and feelings.”

“In one demonstration, people were asked to listen to messages through new headphones. They were told that the purpose of the experiment was to test the quality of the audio equipment and were instructed to move their heads repeatedly to check for any distortions of sound. Half the participants were told to nod their head up and down while others were told to shake it side to side. The messages they heard were radio editorials.”

“Those who nodded (a yes gesture) tended to accept the message they heard, but those who shook their head tended to reject it. Again, there was no awareness, just a habitual connection between an attitude of rejection or acceptance and its common physical expression.”

F(2,66)=44.70

0.000

7.22

1.00

MOP = 1.00, Inflation = .00, R-Index = 1.00

[Gary L. Wells and Richard E. Petty, “The Effects of Overt Head Movements on Persuasion: Compatibility and Incompatibility of Responses,” Basic and Applied Social Psychology, 1, (1980): 219–30.]

4.7 Location as Prime

“Our vote should not be affected by the location of the polling station, for example, but it is.”

“A study of voting patterns in precincts of Arizona in 2000 showed that the support for propositions to increase the funding of schools was significantly greater when the polling station was in a school than when it was in a nearby location.”

“A separate experiment showed that exposing people to images of classrooms and school lockers also increased the tendency of participants to support a school initiative. The effect of the images was larger than the difference between parents and other voters!”

[Jonah Berger, Marc Meredith, and S. Christian Wheeler, “Contextual Priming: Where People Vote Affects How They Vote,” PNAS 105 (2008): 8846–49.]

z = 2.10	0.036	2.10	0.56
p = .05	0.050	1.96	0.50

MOP = .53, Inflation = .47, R-Index = .06

4.8 Money Priming

“Reminders of money produce some troubling effects.”

“Participants in one experiment were shown a list of five words from which they were required to construct a four-word phrase that had a money theme (“high a salary desk paying” became “a high-paying salary”).”

“Other primes were much more subtle, including the presence of an irrelevant money-related object in the background, such as a stack of Monopoly money on a table, or a computer with a screen saver of dollar bills floating in water.”

“Money-primed people become more independent than they would be without the associative trigger. They persevered almost twice as long in trying to solve a very difficult problem before they asked the experimenter for help, a crisp demonstration of increased self-reliance.”

“Money-primed people are also more selfish: they were much less willing to spend time helping another student who pretended to be confused about an experimental task. When an experimenter clumsily dropped a bunch of pencils on the floor, the participants with money (unconsciously) on their mind picked up fewer pencils.”

“In another experiment in the series, participants were told that they would shortly have a get-acquainted conversation with another person and were asked to set up two chairs while the experimenter left to retrieve that person. Participants primed by money chose to stay much farther apart than their nonprimed peers (118 vs. 80 centimeters).”

“Money-primed undergraduates also showed a greater preference for being alone.”

[Kathleen D. Vohs, “The Psychological Consequences of Money,” Science 314 (2006): 1154–56.]

F(2,49)=3.73	0.031	2.16	0.58
t(35)=2.03	0.050	1.96	0.50
t(37)=2.06	0.046	1.99	0.51
t(42)=2.13	0.039	2.06	0.54
F(2,32)=4.34	0.021	2.30	0.63
t(38)=2.13	0.040	2.06	0.54
t(33)=2.37	0.024	2.26	0.62
F(2,58)=4.04	0.023	2.28	0.62
chi^2(2)=10.10	0.006	2.73	0.78

MOP = .58, Inflation = .42, R-Index = .16

4.9 Death Priming

“The evidence of priming studies suggests that reminding people of their mortality increases the appeal of authoritarian ideas, which may become reassuring in the context of the terror of death.”

The cited article does not directly examine this question. The abstract states that “three experiments were conducted to test the hypothesis, derived from terror management theory, that reminding people of their mortality increases attraction to those who consensually validate their beliefs and decreases attraction to those who threaten their beliefs” (p. 308). Study 2 found no general effect of death priming. Rather, the effect was qualified by authoritarianism. Mortality salience enhanced the rejection of dissimilar others in Study 2 only among high authoritarian subjects.” (p. 314), based on a three-way interaction with F(1,145) = 4.08, p = .045. We used the three-way interaction for the computation of the R-Index. Study 1 reported opposite effects for ratings of Christian targets, t(44) = 2.18, p = .034 and Jewish targets, t(44)= 2.08, p = .043. As these tests are dependent, only one test could be used, and we chose the slightly stronger result. Similarly, Study 3 reported significantly more liking of a positive interviewee and less liking of a negative interviewee, t(51) = 2.02, p = .049 and t(49) = 2.42, p = .019, respectively. We chose the stronger effect.

[Jeff Greenberg et al., “Evidence for Terror Management Theory II: The Effect of Mortality Salience on Reactions to Those Who Threaten or Bolster the Cultural Worldview,” Journal of Personality and Social Psychology]

t(44)=2.18	0.035	2.11	0.56
F(1,145)=4.08	0.045	2.00	0.52
t(49)=2.42	0.019	2.34	0.65

MOP = .56, Inflation = .44, R-Index = .12

4.10 The “Lacy Macbeth Effect”

“For example, consider the ambiguous word fragments W_ _ H and S_ _ P. People who were recently asked to think of an action of which they are ashamed are more likely to complete those fragments as WASH and SOAP and less likely to see WISH and SOUP.”

“Furthermore, merely thinking about stabbing a coworker in the back leaves people more inclined to buy soap, disinfectant, or detergent than batteries, juice, or candy bars. Feeling that one’s soul is stained appears to trigger a desire to cleanse one’s body, an impulse that has been dubbed the “Lady Macbeth effect.”

[Lady Macbeth effect”: Chen-Bo Zhong and Katie Liljenquist, “Washing Away Your Sins:

Threatened Morality and Physical Cleansing,” Science 313 (2006): 1451–52.]

F(1,58)=4.26	0.044	2.02	0.52
F(1,25)=6.99	0.014	2.46	0.69

MOP = .61, Inflation = .39, R-Index = .22

The article reports two more studies that are not explicitly mentioned, but are used as empirical support for the Lady Macbeth effect. As the results of these studies were similar to those in the mentioned studies, including these tests in our analysis does not alter the conclusions.

chi^2(1)=4.57	0.033	2.14	0.57
chi^2(1)=5.02	0.025	2.24	0.61

MOP = .59, Inflation = .41, R-Index = .18

4.11 Modality Specificity of the “Lacy Macbeth Effect”

“Participants in an experiment were induced to “lie” to an imaginary person, either on the phone or in e-mail. In a subsequent test of the desirability of various products, people who had lied on the phone preferred mouthwash over soap, and those who had lied in e-mail preferred soap to mouthwash.”

[Spike Lee and Norbert Schwarz, “Dirty Hands and Dirty Mouths: Embodiment of the Moral-Purity Metaphor Is Specific to the Motor Modality Involved in Moral Transgression,” Psychological Science 21 (2010): 1423–25.]

The results are presented as significant with a one-sided t-test. “As shown in Figure 1a, participants evaluated mouthwash more positively after lying in a voice mail (M = 0.21, SD = 0.72) than after lying in an e-mail (M = –0.26, SD = 0.94), F(1, 81) = 2.93, p = .03 (one-tailed), d = 0.55 (simple main effect), but evaluated hand sanitizer more positively after lying in an e-mail (M = 0.31, SD = 0.76) than after lying in a voice mail (M = –0.12, SD = 0.86), F(1, 81) = 3.25, p = .04 (one-tailed), d = 0.53 (simple main effect).” We adjusted the significance criterion for the R-Index accordingly.

F(1,81)=2.93	0.045	1.69	0.52
F(1,81)=3.25	0.038	1.78	0.55

MOP = .54, Inflation = .46, R-Index = .08

4.12 Eyes on You

“On the first week of the experiment (which you can see at the bottom of the figure), two wide-open eyes stare at the coffee or tea drinkers, whose average contribution was 70 pence per liter of milk. On week 2, the poster shows flowers and average contributions drop to about 15 pence. The trend continues. On average, the users of the kitchen contributed almost three times as much in ’eye weeks’ as they did in ’flower weeks.’ ”

[Melissa Bateson, Daniel Nettle, and Gilbert Roberts, “Cues of Being Watched Enhance Cooperation in a Real-World Setting,” Biology Letters 2 (2006): 412–14.]

F(1,7)=11.55

0.011

2.53

0.72

MOP = .72, Inflation = .28, R-Index = .44

Combined Analysis

We then combined the results from the 31 studies mentioned above. While the R-Index for small sets of studies may underestimate replicability, the R-Index for a large set of studies is more accurate. Median Obesrved Power for all 31 studies is only 57%. It is incredible that 31 studies with 57% power could produce 100% significant results (Schimmack, 2012). Thus, there is strong evidence that the studies provide an overly optimistic image of the robustness of social priming effects. Moreover, median observed power overestimates true power if studies were selected to be significant. After correcting for inflation, the R-Index is well below 50%. This suggests that the studies have low replicability. Moreover, it is possible that some of the reported results are actually false positive results. Just like the large-scale replication of the facial feedback studies failed to provide any support for the original findings, other studies may fail to show any effects in large replication projects. As a result, readers of “Thinking Fast and Slow” should be skeptical about the reported results and they should disregard Kahneman’s statement that “you have no choice but to accept that the major conclusions of these studies are true.” Our analysis actually leads to the opposite conclusion. “You should not accept any of the conclusions of these studies as true.”

k = 31, MOP = .57, Inflation = .43, R-Index = .14, Grade: F for Fail

Powergraph of Chapter 4

Schimmack and Brunner (2015) developed an alternative method for the estimation of replicability. This method takes into account that power can vary across studies. It also provides 95% confidence intervals for the replicability estimate. The results of this method are presented in the Figure above. The replicability estimate is similar to the R-Index, with 14% replicability. However, due to the small set of studies, the 95% confidence interval is wide and includes values above 50%. This does not mean that we can trust the published results, but it does suggest that some of the published results might be replicable in larger replication studies with more power to detect small effects. At the same time, the graph shows clear evidence for a selection effect. That is, published studies in these articles do not provide a representative picture of all the studies that were conducted. The powergraph shows that there should have been a lot more non-significant results than were reported in the published articles. The selective reporting of studies that worked is at the core of the replicability crisis in social psychology (Sterling, 1959, Sterling et al., 1995; Schimmack, 2012). To clean up their act and to regain trust in published results, social psychologists have to conduct studies with larger samples that have more than 50% power (Tversky & Kahneman, 1971) and they have to stop reporting only significant results. We can only hope that social psychologists will learn from the train wreck of social priming research and improve their research practices.

Replicability-Index

Improving the replicability of empirical research

Category Archives: Thinking Fast and Slow

A Meta-Scientific Perspective on “Thinking: Fast and Slow

Conclusion

Reconstruction of a Train Wreck: How Priming Research Went off the Rails