2011 was an important year in the history of psychology, especially social psychology. First, it became apparent that one social psychologist had faked results for dozens of publications (https://en.wikipedia.org/wiki/Diederik_Stapel). Second, a highly respected journal published an article with the incredible claim that humans can foresee random events in the future, if they are presented without awareness (https://replicationindex.com/2018/01/05/bem-retraction/). Third, Nobel Laureate Daniel Kahneman published a popular book that reviewed his own work, but also many findings from social psychology (https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow).
It is likely that Kahneman’s book, or at least some of his chapters, would be very different from the actual book, if it had been written just a few years later. However, in 2011 most psychologists believed that most published results in their journals can be trusted. This changed when Bem (2011) was able to provide seemingly credible scientific evidence for paranormal phenomena nobody was willing to believe. It became apparent that even articles with several significant statistical results could not be trusted.
Kahneman also started to wonder whether some of the results that he used in his book were real. A major concern was that implicit priming results might not be replicable. Implicit priming assumes that stimuli that are presented outside of awareness can still influence behavior (e.g., you may have heard the fake story that a movie theater owner flashed a picture of a Coke bottle on the screen and that everybody rushed to the concession stand to buy a Coke without knowing why they suddenly wanted one). In 2012, Kahneman wrote a letter to the leading researcher of implicit priming studies, expressing his doubts about priming results, that attracted a lot of attention (Young, 2012).
Several years later, it has become clear that the implicit priming literature is not trustworthy and that many of the claims in Kahneman’s Chapter 4 are not based on solid empirical foundations (Schimmack, Heene, & Kesavan, 2017). Kahneman acknowledged this in a comment on our work (Kahneman, 2017).
We initially planned to present our findings for all chapters in more detail, but we got busy with other things. However, once in a while I am getting inquires about the other chapters (Engber). So, I am using some free time over the holidays to give a brief overview of the results for all chapters.
The Replicability Index (R-Index) is based on two statistics (Schimmack, 2016). One statistic is simply the percentage of significant results. In a popular book that discusses discoveries, this value is essentially 100%. The problem with selecting significant results from a broader literature is that significance alone, p < .05, does not provide sufficient information about true versus false discoveries. It also does not tell us how replicable a result is. Information about replicability can be obtained by converting the exact p-value into an estimate of statistical power. For example, p = .05 implies 50% power and p = .005 implies 80% power with alpha = .05. This is a simple mathematical transformation. As power determines the probability of a significant result, it also predicts the probability of a successful replication. A study with p = .005 is more likely to replicate than a study with p = .05.
There are two problems with point-estimates of power. One problem is that p-values are highly variable, which also produces high variability / uncertainty in power estimates. With a single p-value, the actual power could range pretty much from the minimum of .05 to the maximum of 1 for most power estimates. This problem is reduced in a meta-analysis of p-values. As more values become available, the average power estimate is closer to the actual average power.
The second problem is that selection of significant results (e.g., to write a book about discoveries) inflates power estimates. This problem can be addressed by comparing the success rate or discovery rate (i.e., the percentage of significant results) with the average power. Without publication bias, the discovery rate should match average power (Brunner & Schimmack, 2020). When publication bias is present, the discovery rate exceeds average power (Schimmack, 2012). Thus, the difference between the discovery rate (in this case 100%) and the average power estimates provides information about the extend of publication bias. The R-Index is a simple correction for the inflation that is introduced by selecting significant results. To correct for inflation the difference between the discovery rate and the average power estimate is subtracted from the mean power estimate. For example, if all studies are significant and the mean power estimate is 80%, the discrepancy is 20%, and the R-Index is 60%. If all studies are significant and the mean power estimate is only 60%, the R-Index is 20%.
When I first developed the R-Index, I assumed that it would be better to use the median (e.g.., power estimates of .50, .80, .90 would produce a median value of .80 and an R-Index of 60. However, the long-run success rate is determined by the mean. For example, .50, .80, .90 would produce a mean of .73, and an R-Index of 47. However, the median overestimates success rates in this scenario and it is more appropriate to use the mean. As a result, the R-Index results presented here differ somewhat from those shared publically in an article by Engber.
Table 1 shows the number of results that were available and the R-Index for chapters that mentioned empirical results. The chapters vary dramatically in terms of the number of studies that are presented (Table 1). The number of results ranges from 2 for chapters 14 and 16 to 55 for Chapter 5. For small sets of studies, the R-Index may not be very reliable, but it is all we have unless we do a careful analysis of each effect and replication studies.
Chapter 4 is the priming chapter that we carefully analyzed (Schimmack, Heene, & Kesavan, 2017).Table 1 shows that Chapter 4 is the worst chapter with an R-Index of 19. An R-Index below 50 implies that there is a less than 50% chance that a result will replicate. Tversky and Kahneman (1971) themselves warned against studies that provide so little evidence for a hypothesis. A 50% probability of answering multiple choice questions correctly is also used to fail students. So, we decided to give chapters with an R-Index below 50 a failing grade. Other chapters with failing grades are Chapter 3, 6, 711, 14, 16. Chapter 24 has the highest highest score (80, wich is an A- in the Canadian grading scheme), but there are only 8 results.
Chapter 24 is called “The Engine of Capitalism”
A main theme of this chapter is that optimism is a blessing and that individuals who are more optimistic are fortunate. It also makes the claim that optimism is “largely inherited” (typical estimates of heritability are about 40-50%), and that optimism contributes to higher well-being (a claim that has been controversial since it has been made, Taylor & Brown, 1988; Block & Colvin, 1994). Most of the research is based on self-ratings, which may inflate positive correlations between measures of optimism and well-being (cf. Schimmack & Kim, 2020). Of course, depressed individuals have lower well-being and tend to be pessimistic, but whether optimism is really preferable over realism remains an open question. Many other claims about optimists are made without citing actual studies.
Even some of the studies with a high R-Index seem questionable with the hindsight of 2020. For example, Fox et al.’s (2009) study of attentional biases and variation in the serotonin transporter gene is questionable because single-genetic variant research is largely considered unreliable today. Moreover, attentional-bias paradigms also have low reliability. Taken together, this implies that correlations between genetic markers and attentional bias measures are dramatically inflated by chance and unlikely to replicate.
Another problem with narrative reviews of single studies is that effect sizes are often omitted. For example, Puri and Robinson’s finding that optimism (estimates of how long you are going to live) and economic risk-taking are correlated is based on a large sample. This makes it possible to infer that there is a relationship with high confidence. A large sample also allows fairly precise estimates of the size of the relationship, which is a correlation of r = .09. A simple way to understand what this correlation means is to think about the increase in predicting in risk taking. Without any predictor, we have a 50% chance for somebody to be above or below the average (median) in risk-taking. With a predictor that is correlated r = .09, our ability to predict risk taking increases from 50% to 55%.
Even more problematic, the next article that is cited for a different claim shows a correlation of r = -.04 between a measure of over-confidence and risk-taking (Busenitz & Barney, 1997). In this study with a small sample (N = 124 entrepreneurs, N = 95 managers), over-confidence was a predictor of being an entrepreneur, z = 2.89, R-Index = .64.
The study by Cassar and Craig (2009) provides strong evidence for hindsight bias, R-Index = 1. Entrepreneurs who were unable to turn a start-up into an operating business underestimated how optimistic they were about their venture (actual: 80%, retrospective: 60%).
Sometimes claims are only loosely related to a cited article (Hmieleski & Baron, 2009). The statement “this reasoning leads to a hypothesis: the people who have the greatest influence on the lives of others are likely to be optimistic and overconfident, and to take more risks than they realize” is linked to a study that used optimism to predict revenue growth and employment growth. Optimism was a negative predictor, although the main claim was that the effect of optimism also depends on experience and dynamism.
A very robust effect was used for the claim that most people see themselves as above average on positive traits (e.g., overestimate their intelligence) (Williams & Gilovich, 2008), R-Index = 1. However, the meaning of this finding is still controversial. For example, the above average effect disappears when individuals are asked to rate themselves and familiar others (e.g., friends). In this case, ratings of others are more favorable than ratings of self (Kim et al., 2019).
Kahneman then does mention the alternative explanation for better-than-average effects (Windschitl et al., 2008). Namely rather than actually thinking that they are better than average, respondents simply respond positively to questions about qualities that they think they have without considering others or the average person. For example, most drivers have not had a major accident and that may be sufficient to say that they are a good driver. They then also rate themselves as better than the average driver without considering that most other drivers also did not have a major accident. R-Index = .92.
So, are most people really overconfident and does optimism really have benefits and increase happiness? We don’t really know, even 10 years after Kahneman wrote his book.
Meanwhile, the statistical analysis of published results has also made some progress. I analyzed all test statistics with the latest version of z-curve (Bartos & Schimmack, 2020). All test-statistics are converted into absolute z-scores that reflect the strength of evidence against the null-hypothesis that there is no effect.
The figure shows the distribution of z-scores. As the book focussed on discoveries most test-statistics are significant with p < .05 (two-tailed, which corresponds to z = 1.96. The distribution of z-scores shows that these significant results were selected from a larger set of tests that produced non-significant results. The z-curve estimate is that the significant results are only 12% of all tests that were conducted. This is a problem.
Evidently, these results are selected from a larger set of studies that produced non-significant results. These results may not even have been published (publication bias). To estimate how replicable the significant results are, z-curve estimates the mean power of the significant results. This is similar to the R-Index, but the R-Index is only an approximate correction for information. Z-curve does properly correct for the selection for significance. The mean power is 46%, which implies that only half of the results would be replicated in exact replication studies. The success rate in actual replication studies is often lower and may be as low as the estimated discovery rate (Bartos & Schimmack, 2020). So, replicability is somewhere between 12% and 46%. Even if half of the results are replicable, we do not know which results are replicable and which one’s are not. The Chapter-based analyses provide some clues which findings may be less trustworthy (implicit priming) and which ones may be more trustworthy (overconfidence), but the main conclusion is that the empirical basis for claims in “Thinking: Fast and Slow” is shaky.
Conclusion
In conclusion, Daniel Kahneman is a distinguished psychologist who has made valuable contributions to the study of human decision making. His work with Amos Tversky was recognized with a Nobel Memorial Prize in Economics (APA). It is surely interesting to read what he has to say about psychological topics that range from cognition to well-being. However, his thoughts are based on a scientific literature with shaky foundations. Like everybody else in 2011, Kahneman trusted individual studies to be robust and replicable because they presented a statistically significant result. In hindsight it is clear that this is not the case. Narrative literature reviews of individual studies reflect scientists’ intuitions (Fast Thinking, System 1) as much or more than empirical findings. Readers of “Thinking: Fast and Slow” should read the book as a subjective account by an eminent psychologists, rather than an objective summary of scientific evidence. Moreover, ten years have passed and if Kahneman wrote a second edition, it would be very different from the first one. Chapters 3 and 4 would probably just be scrubbed from the book. But that is science. It does make progress, even if progress is often painfully slow in the softer sciences.
I found this video on YouTube (Christan G.) with little information about the source of the discussion. I think it is a valuable historic document and I am reposting it here because I am afraid that it may be deleted from YouTube and be lost.
Highlights
Kahneman “We are all Mischelians.”
Kahneman “You [Mischel] showed convincingly that traits do not exist but you also provided the most convincing evidence for stable traits [when children delay eating a marshmallow become good students who do not drink and smoke.]
Here is Mischel’s answer to a question I always wanted him to answer. In short, self-control is not a trait. It is a skill.
Norbert Schwarz is an eminent social psychologist with an H-Index of 80 (80 articles cited more than 80 times as of January 3, 2019). Norbert Schwarz’s most cited article examined the influence of mood on life-satisfaction judgments (Schwarz & Clore, 1983). Although this article continues to be cited heavily (110 citations in 2018), numerous articles have demonstrated that the main assumption of the article (people rely on their current mood to judge their overall wellbeing) is inconsistent with the reliability and validity of life-satisfaction judgments (Eid & Diener, 2004; Schimmack & Oishi, 2005). More important, a major replication attempt failed to replicate the key finding of the original article (Yap et al., 2017).
The replication failure of Schwarz and Clore’s mood-as-information study is not surprising, given the low replication rate in social psychology in general, which has been estimated to be around 25% (OSC, 2015). The reason is that social psychologists have used questionable research practices to produce significant results, at the risk that many of these significant results are false positive results. In a ranking of the replicability of eminent social psychologists, Norbert Schwarz ranked in the bottom half (43 out of 71). It is therefore possible that other results published by Norbert Schwarz are also difficult to replicate.
EASE-OF-RETRIEVAL PARADIGM
The original article that introduced the ease of retrieval paradigm is Schwarz’s 5th most cited article.
The aim of the ease-of-retrieval paradigm was to distinguish between two accounts of frequency or probability judgments. One account assumes that people simply count examples that come to mind. Another account assumes that people rely on the ease with which examples come to mind.
The 3rd edition of Gilovich et al.’s textbook, introduces the ease-of-retrieval paradigm.
An ingenious experiment by Norbert Schwarz and his colleagues managed to untangle the two interpretations (Schwarz et al., 1991). In the guise of gathering material for a relaxation-training program, students were asked to review their lives for experiences relevant to assertiveness. The experiment involved four conditions. One group was asked to list 6 occasions when they had acted assertively, and another group was asked to list 12 such examples. A third group was asked to list 6 occasions when they had acted unassertively, and the final group was asked to list 12 such examples. The requirement to generate 6 or 12 examples of either behavior was carefully chosen; thinking of 6 examples would be easy for nearly everyone, but thinking of 12 would be extremely difficult. (p. 138).
The textbook shows a table with the mean assertiveness ratings in the four conditions. The means show a picture perfect cross-over interaction, with no information about standard deviations or statistical significance. The pattern shows higher assertiveness after recalling fewer examples of asssertive behaviors and lower assertiveness after recalling fewer unassertive behaviors. This pattern of results suggest that participants relied on the ease of retrieving instances of assertive or unassertive behaviors from memory.
But there are reasons to believe that these textbook results are too good to be true. Sampling error alone would sometimes produce less-perfect pattern of results, even if the ease-of-retrieval hypothesis is true.
Reading the original article provides the valuable information that each of the four means in the textbook is based on only 10 participants (for a total of 40 participants). Results from such small samples are nowadays considered questionable. The results section also contains the valuable information that the perfect results in Study 1 were only marginally significant; that is the risk of a false positive result was greater than 5%.
More important, weak statistical results such as p-values of .07 often do not replicate because sampling error will produce less than perfect results the next time.
The article reported several successful replication studies. However, this does not increase the chances that the result are credible. As studies with small samples often produce non-significant results, a series of studies should show some failures. If those failures are missing, it suggests that questionable research practices were used (Schimmack, 2012).
Study 2 replicated the finding of Study 2 with about 20 participants in each cell. The pattern of means was again picture perfect and this time the interaction was significant, F(1, 142) = 6.35, p = .013. However, even this evidence is just significant and results with p-values of .01 often fail to replicate.
Study 3 again replicated the interaction with less than 10 participants in each condition and a just significant result, F(1,70) = 4.09, p = .030.
Given the small sample sizes, it is unlikely that three studies would produce support for the ease-of-retrieval hypothesis without any replication failures. The median probability to produce a significant result (power) is 59% for p < .05 and 70% for p < .10; and these are based on probably inflated effect size estimates. Thus, the chance of obtaining three significant results with p < .10 and 70% power is less than .70*.70*.70 = 34%. Maybe Schwarz and colleagues were just lucky, but maybe they also used questionable research practices, which is particularly easy in small samples (Simmons, Nelson, & Simonsohn, 2011).
Using the Replicability Index (Schimmack, 2015), it is reasonable to expect a replication failure rather than a replication success in a replication attempt without QRPs (R-Index = 70 – 30 = .40%).
A low R-Index does not mean that a theory is false or that a replication study will definitely fail. However, it does raise concern about the credibility of textbook findings that present the results of Study 1 as solid empirical evidence.
Given the large number of citations, there are many studies that have also reported ease-of-retrieval effects. The problem is that social psychology journals only report successful studies. As a result, these replication studies do not test the ease-of-retrieval hypothesis, and results are inflated by selective publication of significant results. This is confirmed in a recent meta-analysis that found evidence of publication bias in ease-of-retrieval studies.
Although the meta-analysis suggests that there still is an effect after correcting for publication bias, corrections for publication bias are difficult and may still overestimate the effect. What is needed is a trustworthy replication study in a large sample.
In 2012 I learned about such a replication study at a conference about the replication crisis in social psychology. One of the presenters was Jon Krosnick, who reported about a replication project in a large, national representative sample. 11 different studies were replicated and all but one produced a significant result (recalled from memory) . The one replication failure was the ease-of-retrieval paradigm. The data of this study and several follow-up studies with large samples were included in the Weingarten and Hutchinson meta-analysis.
The results show that these replication attempts failed to reproduce the effect despite much larger samples that could detect even smaller effects.
Interestingly, the 5th edition of the textbook (Gilovich et al., 2019) no longer mentions Schwarz et al.’s ingenious ease-of-retrieval paradigm. Although I do not know why this study was removed, the deletion of this study suggests that the authors lost confidence in the effect.
Broader Theoretical Considerations
There are other problems with the ease-of-retrieval paradigm. Most important, it does not examine how respondents answer questions about their personality under naturalistic conditions, without explicit instructions to recall a specified number of concrete examples.
Try to recall 12 examples when you were helpful.
Could you do this in less than 10 second? If so, you are a very helpful person, but even if you are a very helpful person, it might take more time than that to do so. However, personality judgments or other frequency and probability judgments are often made in under 5 seconds. Thus, even if ease-of-retrieval is one way to make social judgments, it is not the typical way social judgments are made. Thus, it remains an open question how participants are able to make fast and partially accurate judgments of their own and other people’s personality, the frequency of their emotions, or other judgments.
Ironically, an article published in the same year as Schwarz et al.’s article made this point. However, this article was published in a cognitive journal, which social psychologists rarely cite. Overall, this article has been cited only 15 times. Maybe the loss of confidence in the ease-of-retrieval paradigm will generate renewed interest in models of social judgments that do not require retrieval of actual examples.
In 2002, Daniel Kahneman was awarded the Nobel Prize for Economics. He received the award for his groundbreaking work on human irrationality in collaboration with Amos Tversky in the 1970s.
In 1999, Daniel Kahneman was the lead editor of the book “Well-Being: The foundations of Hedonic Psychology.” Subsequently, Daniel Kahneman conducted several influential studies on well-being.
The aim of the book was to draw attention to hedonic or affective experiences as an important, if not the sole, contributor to human happiness. He called for a return to Bentham’s definition of a good life as a life filled with pleasure and devoid of pain a.k.a displeasure.
The book was co-edited by Norbert Schwarz and Ed Diener, who both contributed chapters to the book. These chapters make contradictory claims about the usefulness of life-satisfaction judgments as an alternative measure of a good life.
Ed Diener is famous for his conception of wellbeing in terms of a positive hedonic balance (lot’s of pleasure, little pain) and high life-satisfaction. In contrast, Schwarz is known as a critic of life-satisfaction judgments. In fact, Schwarz and Strack’s contribution to the book ended with the claim that “most readers have probably concluded that there is little to be learned from self-reports of global well-being” (p. 80).
To a large part, Schwarz and Strack’s pessimistic view is based on their own studies that seemed to show that life-satisfaction judgments are influenced by transient factors such as current mood or priming effects.
“the obtained reports of SWB are subject to pronounced question-order- effects because the content of preceding questions influences the temporary accessibility of relevant information” (Schwarz & Strack, p. 79).
There is only one problem with this claim; it is only true for a few studies conducted by Schwarz and Strack. Studies by other researchers have produced much weaker and often not statistically reliable context effects (see Schimmack & Oishi, 2005, for a meta-analysis). In fact, a recent attempt to replicate Schwarz and Strack’s results in a large sample of over 7,000 participants failed to show the effect and even found a small, but statistically significant effect in the opposite direction (ManyLabs2).
Figure 1 summarizes the results of the meta-analysis from Schimmack and Oishi 2005), but it is enhanced by new developments in meta-analysis. The blue line in the graph regresses effect sizes (converted into Fisher-z scores) onto sampling error (1/sqrt(N -3). Publication bias and other statistical tricks produce a correlation between effect size and sampling error. The slope of the blue line shows clear evidence of publication bias, z = 3.85, p = .0001. The intercept (where the line meets zero on the x-axis) can be interpreted as a bias-corrected estimate of the real effect size. The value is close to zero and not statistically significant, z = 1.70, p = .088. The green line shows the effect size in the replication study, which was also close to zero, but statistically significant in the opposite direction. The orange vertical red line shows the average effect size without controlling for publication bias. We see that this naive meta-analysis overestimates the effect size and falsely suggests that item-order effects are a robust phenomenon. Finally, the graph highlights the three results from studies by Strack and Schwarz. These results are clear outliers and even above the biased blue regression line. The biggest outlier was obtained by Strack et al. (1991) and this is the finding that is featured in Kahneman’s book, even though it is not reproducible and clearly inflated by sampling error. Interestingly, sampling error is also called noise and Kahneman wrote a whole new book about the problems of noise in human judgments.
While the figure is new, the findings were published in 2005, several years before Kahneman wrote his book “Thinking Fast and Slow). He was simply to lazy to use the slow process of a thorough literature research to write about life-satisfaction judgments. Instead, he relied on a fast memory search that retrieved a study by his buddy. Thus, while the chapter is a good example of biases that result from fast information processing, it is not a good chapter to tell readers about life-satisfaction judgments.
To be fair, Kahneman did inform his readers that he is biased against life-satisfaction judgments. Having come to the topic of well-being from the study of the mistaken memories of colonoscopies and painfully cold hands, I was naturally suspicious of global satisfaction with life as a valid measure of well-being (Kindle Locations 6796-6798). Later on, he even admits to his mistake. Life satisfaction is not a flawed measure of their experienced well-being, as I thought some years ago. It is something else entirely (Kindle Location 6911-6912).
However, insight into his bias was not enough to motivate him to search for evidence that may contradict his bias. This is known as confirmation bias. Even ideal-prototypes of scientists like Nobel Laureates are not immune to this fallacy. Thus, this example shows that we cannot rely on simple cues like “professor at Ivy League,” “respected scientists,” or “published in prestigious journals.” to trust scientific claims. Scientific claims need to be backed up by credible evidence. Unfortunately, social psychology has produced a literature that is not trustworthy because studies were only published if they confirmed theories. It will take time to correct these mistakes of the past by carefully controlling for publication bias in meta-analyses and by conducting pre-registered studies that are published even if they falsify theoretical predictions. Until then, readers should be skeptical about claims based on psychological ‘science,’ even if they are made by a Nobel Laureate.
Richard Nisbett has been an influential experimental social psychologist. His co-authored book on faulty human information processing (Nisbett & Ross, 1980) provided the foundation of experimental studies of social cognition (Fiske & Taylor, 1984). Experiments became the dominant paradigm in social psychology with success stories like Daniel Kahneman’s Noble Price for Economics and embarrassments like Diederik Staple’s numerous retractions because he fabricated data for articles published in experimental social psychology (ESP) journals.
The Stapel Debacle raised questions about the scientific standards of experimental social psychology. The reputation of Experimental Social Psychology (ESP) also took a hit when the top journal of ESP research published an article by Daryl Bem that claimed to provide evidence for extra-sensory perceptions. For example, in one study extraverts seemed to be able to foresee the location of pornographic images before a computer program determined the location. Subsequent analyses of his results and data revealed that Daryl Bem did not use scientific methods properly and that the results provide no credible empirical evidence for his claims (Francis, 2012; Schimmack, 2012; Schimmack, 2018).
More detrimental for the field of experimental social psychology was that Bem’s carefree use of scientific methods is common in experimental social psychology; in part because Bem wrote a chapter that instructed generations of experimental social psychologists how they could produce seemingly perfect results. The use of these questionable research practices explains why over 90% of published results in social psychology journals support authors’ hypotheses (Sterling, 1959; Sterling et al., 1995).
Since 2011, some psychologists have started to put the practices and results of experimental social psychologists under the microscope. The most impressive evidence comes from a project that tried to replicate a representative sample of psychological studies (Open Science Collaboration, 2015). Only a quarter of social psychology experiments could be replicated successfully.
The response by eminent social psychologists to these findings has been a classic case of motivated reasoning and denial. For example, in an interview for the Chronicle of Higher Education, Nisbett dismissed these results by attributing them to problems of the replication studies.
Nisbett has been calculating effect sizes since before most of those in the replication movement were born. And he’s a skeptic of this new generation of skeptics. For starters, Nisbett doesn’t think direct replications are efficient or sensible; instead he favors so-called conceptual replication, which is more or less taking someone else’s interesting result and putting your own spin on it. Too much navel-gazing, according to Nisbett, hampers professional development. “I’m alarmed at younger people wasting time and their careers,” he says. He thinks that Nosek’s ballyhooed finding that most psychology experiments didn’t replicate did enormous damage to the reputation of the field, and that its leaders were themselves guilty of methodological problems. And he’s annoyed that it’s led to the belief that social psychology is riddled with errors. “How do they know that?”, Nisbett asks, dropping in an expletive for emphasis.
In contrast to Nisbett’s defensive response, Noble Laureate Daniel Kahneman has expressed concerns about the replicability of BS-ESP results that he reported in his popular book “Thinking: Fast and Slow” He also wrote a letter to experimental social psychologists suggesting that they should replicate their findings. It is telling that several years later, eminent experimental social psychologists have not published self-replications of their classic findings.
Nisbett also ignores that Nosek’s findings are consistent with statistical analyses that show clear evidence of questionable research practices and evidence that published results are too good to be true (Francis, 2014). Below I present new evidence about the credibility of experimental social psychology based on a representative sample of published studies in social psychology.
How Replicable are Between-Subject Social Psychology Experiments (BS-ESP)?
Motyl and colleagues (2017) coded hundreds of articles and over one-thousand published studies in social psychology journals. They recorded information about the type of study (experimental or correlational), the design of the study (within subject vs. between-subject) and about the strength of an effect (as reflected in test statistics or p-values). After crunching the numbers, their results showed clear evidence of publication bias, but also evidence that social psychologists published some robust and replicable results.
In a separate blog-post, I agreed with this general conclusion. However, Motyl et al.’s assessment was based on a broad range of studies, including correlational studies with large samples. Few people doubt that these results would replicate, but experimental social psychologist tend to dismiss these findings because they are correlational (Nisbett, 2016).
The replicability of BS-ESP results is more doubtful because these studies often used between-subject designs with small samples, which makes it difficult to obtain statistically significant results. For example, John Bargh used only 30 participants (15 per condition) for his famous elderly priming study that failed to replicate.
I conducted a replicability analysis of BS-ESP results based on a subset of studies in Motly’s dataset. I selected only studies with between-subject experiments where participants were randomly assigned to different conditions with one degree of freedom. The studies could be comparisons of two groups or a 2 x 2 design that is also often used by experimental social psychologists to demonstrate interaction (moderator) effects. I also excluded studies with fewer than 20 participants per condition because these studies should not have been published because parametric tests require a minimum of 20 participants to be robust (Cohen, 1994).
There were k = 314 studies that fulfilled these criteria. Two-hundred-seventy-eight of these studies (89%) were statistically significant at the standard criterion of p < .05. Including marginally significant and one-sided tests, 95% were statistically significant. This success rate is consistent with Sterling’s results in the 1950s and 1990s and the results of the OSC project. For the replicability analysis, I focused on the 278 results that met the standard criterion of statistical significance.
First, I compared the mean effect size without correcting for publication bias with the bias-corrected effect size estimate using the latest version of puniform (vanAert, 2018). The mean effect size of the k = 278 studies was d = .64, while the bias-corrected puniform estimate was more than 50% lower, d = .30. The finding that actual effect sizes are approximately half of the published effect sizes is consistent with the results based on actual replication studies (OSC, 2015).
Next, I used z-curve (Brunner & Schimmack, 2018) to estimate mean power of the published studies based on the test statistics reported in the original articles. Mean power predicts the success rate if the 278 studies were exactly replicated. The advantage of using a statistical approach is that it avoids problems of carrying out exact replication studies, which is often difficult and sometimes impossible.
Figure 1. Histogram of the strength of evidence against the null-hypothesis in a representative sample of (k = 314) between-subject experiments in social psychology.
The Nominal Test of Insufficient Variance (Schimmack, 2015) showed 61% of results within the range from z = 1.96 and z = 2.80, when only a maximum of 32% is expected from a representative sample of independent tests. The z-statistic of 10.34 makes it clear that this is not a chance finding. Visual inspection shows a sharp drop at a z-score of 1.96, which corresponds to the significance criterion, p < .05, two-tailed. Taken together, these results provide clear evidence that published results are not representative of all studies that are conducted by experimental psychologists. Rather, published results are selected to provide evidence in favor of authors’ hypotheses.
The mean power of statistically significant results is 32% with a 95%CI ranging from 23% to 39%. This means that many of the studies that were published with a significant result would not reproduce a significant result in an actual replication attempt. With an estimate of 32%, the success rate is not reliably higher than the success rate for actual replication studies in the Open Science Reproducibility Project (OSC, 2015). Thus, it is clear that the replication failures are the result of shoddy research practices in the original studies rather than problems of exact replication studies.
The estimate of 32% is also consistent with my analysis of social psychology experiments in Bargh’s book “Before you know it” that draws heavily on BS-ESP results. Thus, the present results replicate previous analyses based on a set of studies that were selected by an eminent experimental social psychologist. Thus, replication failures in experimental social psychology are highly replicable.
A new statistic under development is the maximum false discovery rate; that is, the percentage of significant results that could be false positives. It is based on the fit of z-curves with different proportions of false positives (z = 0). The maximum false discovery rate is 70% with a 95%CI ranging from 50% to 85%. This means, the data are so weak that it is impossible to rule out the possibility that most BS-ESP results are false.
Conclusion
Nisbett’s questioned how critics know that ESP is riddled with errors. I answered his call for evidence by presenting a z-curve analysis of a representative set of BS-ESP results. The results are consistent with findings from actual replication studies. There is clear evidence of selection bias and consistent evidence that the majority of published BS-ESP results cannot be replicated in exact replication studies. Nisbett dismisses this evidence and attributes replication failures to problems with the replication studies. This attribution is a classic example of a self-serving attribution error; that is, the tendency to blame others for negative outcomes.
The low replicability of BS-ESP results is not surprising, given several statements by experimental social psychologists about their research practices. For example, Bem’s (2001) advised students that it is better to “err on the side of discovery” (translation: a fun false finding is better than no finding). He also shrugged off replication failures of his ESP studies with a comment that he doesn’t care whether his results replicate or not.
“I used data as a point of persuasion, and I never really worried about, ‘Will this replicate or will this not?” (Daryl J. Bem, in Engber, 2017)
A similar attitude is revealed in Baumeister’s belief that personality psychology has lost appeal because it developed its scientific method and a “gain in rigor was accomplished by a loss in interest.” I agree that fiction can be interesting, but science without rigor is science fiction.
Another social psychologist (I forgot the name) once bragged openly that he was able to produce significant results in 30% of his studies and compared this to a high batting averages in baseball. In baseball it is indeed impressive to hit a fast small ball with a bat 1 out of 3 times. However, I prefer to compare success rates of BS-ESP researchers to the performance of my students on an exam, where a 30% success rate earns them a straight F. And why would anybody watch a movie that earned a 32% average rating on rottentomatoes.com, unless they are watching it because watching bad movies can be fun (e.g., “The Room”).
The problems of BS-ESP research are by no means new. Tversky and Kahneman (1971) tried to tell psychologists decades ago that studies with low power should not be conducted. Despite decades of warnings by methodologists (Cohen, 1962, 1994), social psychologists have blissfully ignored these warnings and continue to publish meaningless statistically significant results and hiding non-significant ones. In doing so, they conducted the ultimate attribution error. They attributed the results of their studies to the behavior of their participants, while the results actually depended on their own biases that determined which studies they selected for publication.
Many experimental social psychologists prefer to ignore evidence that their research practices are flawed and published results are not credible. For example, Bargh did not mention actual replication failures of his work in his book, nor did he mention that Noble Laureate Daniel Kahneman wrote him a letter in which he described Bargh’s work as “the poster child for doubts about the integrity of psychological research.” Several years later, it is fair to say that evidence is accumulating that experimental social psychology lacks scientific integrity. It is often said that science is self-correcting. Given the lack of self-correction by experimental social psychologists, it logically follows that it is not a science; at least it doesn’t behave like one.
I doubt that members of the Society for Experimental Social Psychology (SESP) will respond to this new information any differently from the way they responded to criticism of the field in the past seven years; that is, with denial, name calling (“Shameless Little Bullies”, “Method Terrorists”, “Human Scum”), or threats of legal actions. In my opinion, the biggest failure of SESP is not the way its members conducted research in the past, but their response to valid scientific criticism of their work. As Karl Popper pointed out “True ignorance is not the absence of knowledge, but the refusal to acquire it.” Ironically, the unwillingness of experimental social psychologists to acquire inconvenient self-knowledge provides some of the strongest evidence for biases and motivated reasoning in human information processing. If only these biases could be studied in BS experiments with experimental social psychologists as participants.
Caution
The abysmal results for experimental social psychology should not be generalized to all areas of psychology. The OSC (2015) report examined the replicability of psychology, and found that cognitive studies replicated much better than experimental social psychology results. Motyl et al. (2017) found evidence that correlational results in social and personality psychology are more replicable than BS-ESP results.
It is also not fair to treat all experimental social psychologists alike. Some experimental social psychologists may have used the scientific method correctly and published credible results. The problem is to know which results are credible and which results are not. Fortunately, studies with stronger evidence (lower p-values or higher z-score) are more likely to be true. In actual replication attempts, studies with z-scores greater than 4 had an 80% chance to be successfully replicated (OSC, 2015). I provided a brief description of results that met this criterion in Motyl et al.’s dataset in the Appendix. However, it is impossible to distinguish honest results with weak evidence from results that were manipulated to show significance. Thus, over 50 years of experimental social psychology have produced many interesting ideas without empirical evidence for most of them. Sadly, even today articles are published that are no more credible than those published 10 years ago. If there can be failed sciences, experimental social psychology is one of them. Maybe it is time to create a new society for social psychologists who respect the scientific method. I suggest calling it Society of Ethical Social Psychologists (SESP), and that it adopts the ethics code of the American Physical Society (APS).
Fabrication of data or selective reporting of data with the intent to mislead or deceive is an egregious departure from the expected norms of scientific conduct, as is the theft of data or research results from others.
APPENDIX
Journal of Experimental Social Psychology
Klein, W. M. (2003). Effects of objective feedback and “single other” or “average other” social comparison feedback on performance judgments and helping behavior. Personality and Social Psychology Bulletin, 29(3), 418-429.
Df1
Df2
N
F-value
1
44
48
21.71
In this study, participants have a choice to give easy or difficult hints to a confederate after performing on a different task. The strong result shows an interaction effect between performance feedback and the way participants are rewarded for their performance. When their reward is contingent on the performance of the other students, participants gave easier hints after they received positive feedback and harder hints after they received negative feedback.
Phillips, K. W. (2003). The effects of categorically based expectations on minority influence: The importance of congruence. Personality and Social Psychology Bulletin, 29(1), 3-13.
1
155
158
97.02
This strong effect shows that participants were surprised when an in-group member disagreed with their opinion in a hypothetical scenario in which they made decisions with an in-group and an out-group member, z = 8.67.
Seta, J. J., Seta, C. E., & McElroy, T. (2003). Attributional biases in the service of stereotype maintenance: A schema-maintenance through compensation analysis. Personality and Social Psychology Bulletin, 29(2), 151-163.
1
101
112
39.80
The strong effect in Study 1 reflects different attributions of a minister’s willingness to volunteer for a charitable event. Participants assumed that the motives were more selfish and different from motives of other ministers if they were told that the minister molested a young boy and sold heroin to a teenager. These effects were qualified by a Target Identity × Inconsistency interaction, F(1, 101) = 39.80, p < .001. This interaction was interpreted via planned comparisons. As expected, participants who read about the aberrant behaviors of the minister attributed his generosity in volunteering to the dimension that was more inconsistent with the dispositional attribution of ministers—impressing others (M = 2.26)—in contrast to the same target control participants (M= 4.62), F(1, 101) = 34.06, p < .01.
Trope, Y., Gervey, B., & Bolger, N. (2003). The role of perceived control in overcoming defensive self-evaluation. Journal of Experimental Social Psychology, 39(5), 407-419.
1
176
190
25.61
Study 2 manipulated perceptions of changeability of attributes and valence of feedback. A third factor were self-reported abilities. The 2 way interaction showed that participants were more interested in feedback about weaknesses when attributes were perceived as changeable, z = 4.74. However, the critical test was the three-way interaction with self-perceived abilities, which was weaker and not a fully experimental design, F(1, 176) = 6.34, z = 2.24.
Brambilla, M., Sacchi, S., Pagliaro, S., & Ellemers, N. (2013). Morality and intergroup relations: Threats to safety and group image predict the desire to interact with outgroup and ingroup members. Journal of Experimental Social Psychology, 49(5), 811-821.
89.02
1
78
83
89.02
42.76
1
99
108
42.76
134.67
1
156
165
134.67
Three strong results come from this study of morality (zs > 5). In hypothetical scenarios, participants were presented with moral and immoral targets and asked about their behavioral intentions how they would interact with them. All studies showed that participants were less willing to engage with immoral targets. Other characteristics that were manipulated had no effect.
Mason, M. F., Lee, A. J., Wiley, E. A., & Ames, D. R. (2013). Precise offers are potent anchors: Conciliatory counteroffers and attributions of knowledge in negotiations. Journal of Experimental Social Psychology, 49(4), 759-763.
1
244
247
19.29
This study showed that recipients of a rounded offer make larger adjustments to the offer than recipients of more precise offers, z = 4.15. This effect was demonstrated in several studies. This is the strongest evidence, in part, because the sample size was the largest. So, if you put your house up for sale, you may suggest a sales price of $491,307 rather than $500,000 to get a higher counteroffer.
Pica, G., Pierro, A., Bélanger, J. J., & Kruglanski, A. W. (2013). The Motivational Dynamics of Retrieval-Induced Forgetting A Test of Cognitive Energetics Theory. Personality and Social Psychology Bulletin, 39(11), 1530-1541.
1
93
94
208.16
The strong effect for this analysis is a within-subject main effect. The critical effect was a mixed design three-way interaction. This effect was weaker. “.05). Of greatest importance, the three-way interaction between retrieval-practice repetition, need for closure, and OSPAN was significant, β = −.24, t = −2.25, p < .05.”
Preston, J. L., & Ritter, R. S. (2013). Different effects of religion and God on prosociality with the ingroup and outgroup. Personality and Social Psychology Bulletin, ###.
1
113
127
23.22
This strong effect, z = 4.59, showed that participants thought a religious leader would want them to help a family that belongs to their religious group, whereas God would want them to help a family that does not belong to the religious group (cf. These values were analyzed by one-way ANOVA on Condition (God/Leader), F(1, 113) = 23.22, p < .001, partial η2= .17. People expected the religious leader would want them to help the religious ingroup family (M = 6.71, SD = 2.67), whereas they expected God would want them to help the outgroup family (M = 4.39, SD = 2.48)). I find the dissociation between God and religious leaders interesting. The strength of the effect makes me belief that this is a replicable finding.
Sinaceur, M., Adam, H., Van Kleef, G. A., & Galinsky, A. D. (2013). The advantages of being unpredictable: How emotional inconsistency extracts concessions in negotiation. Journal of Experimental Social Psychology, 49(3), 498-508.
1
151
152
25.93
Study 2 produced a notable effect of manipulating emotional inconsistency on self-ratings of “sense of unpredictability” (z = 4.88). However, the key dependent variable was concession making. The effect on concession making was not as strong, F(1, 151) = 7.29, z = 2.66.
Newheiser, A. K., & Barreto, M. (2014). Hidden costs of hiding stigma: Ironic interpersonal consequences of concealing a stigmatized identity in social interactions. Journal of Experimental Social Psychology, 52, 58-70.
1
54
57
26.9361
Participants in this study were either told to reveal their major of study or to falsely report that they are medical students. The strong effect shows that participants who were told to lie reported feeling less authentic, z = 4.51. The effect on a second dependent variable, “belonging” (I feel accepted) was weaker, t(54) = 2.54, z = 2.20.
PERSONALITY AND SOCIAL PSYCHOLOGY BULLETIN
Simon, B., & Stürmer, S. (2003). Respect for group members: Intragroup determinants of collective identification and group-serving behavior. Personality and Social Psychology Bulletin, 29(2), 183-193.
1
159
163
48.75
The strong effect in this study shows a main effect of respectful vs. disrespectful feedback from a group-member on collective self-esteem; that is, feeling good about being part of the group. As predicted, a 2 x2 ANOVA revealed that collective identification (averaged over all 12 items; Cronbach’s = .84) was stronger in the respectful-treatment condition than in the disrespectful-treatment condition,M(RESP) = 3.54,M(DISRESP) = 2.59, F(1, 159) = 48.75, p < .001.
Craig, M. A., & Richeson, J. A. (2014). More diverse yet less tolerant? How the increasingly diverse racial landscape affects white Americans’ racial attitudes. Personality and Social Psychology Bulletin, 40(6) 750–761.
1
13
30
41.60
1
13
15
36.00
Two strong effects are based on studies that aimed to manipulate responses to the race IAT with stories about shifting demographics in the United States. However, the test statistics are based on the comparison of IAT scores against a value of zero and not the comparison of the experimental group and the control group. The relevant results are, t(26) = 2.07, p = .048, d = 0.84 in Study 2a and t(23) = 2.80, p = .01, d = 1.13 in Study 2b. These results are highly questionable because it is unlikely to obtain just significant results in a pair of studies. In addition, the key finding in Study 1 is also just significant, t(84) = 2.29, p = .025, as is the finding in Study 3, F(1,366) = 5.94, p = .015.
Hung, I. W., & Wyer, R. S. (2014). Effects of self-relevant perspective-taking on the impact of persuasive appeals. Personality and Social Psychology Bulletin, 40(3), 402-414.
17.42
1
288
300
17.42
Participants viewed a donation appeal from a charity called Pangaea. The one-page appeal described the problem of child trafficking and was either self-referential or impersonal. The strong effect was that participants in the self-referential condition were more likely to imagine themselves in the situation of the child. Participants were more likely to imagine themselves being trafficked when the appeal was self-referential than when it was impersonal (M = 4.78, SD =2.95 vs. M = 3.26, SD = 2.96 respectively), F(1, 288) = 17.42, p < .01, ω2 = .041, and this difference did not depend on the victims’ ethnicity (F < 1). Thus, taking the victims’ perspective influenced participants’ tendency to imagine themselves being trafficked without thinking about their actual similarity to the victims that were portrayed. The effect on self-reported likelihood of helping was weaker. Participants reported greater urge to help when the appeal encouraged them to take the protagonists’ perspective than when it did not (M = 5.83, SD = 2.04 vs. M = 5.18, SD = 2.31), F(1, 288) = 5.68, p < .02, ω2 = .013
Lick, D. J., & Johnson, K. L. (2014).” You Can’tTell Just by Looking!” Beliefs in the Diagnosticity of Visual Cues Explain Response Biases in Social Categorization. Personality and Social Psychology Bulletin,
1
164
166
47.3
The main effect of social category dimension was significant, F(1, 164) = 47.30, p < .001, indicating that participants made more stigmatizing categorizations in the sex condition (M = 15.94, SD = 0.45) relative to the religion condition (M = 12.42, SD = 4.64). This result merely shows that participants were more likely to indicate that a woman is a woman than that an atheist is an atheist based on a photograph of a person. This finding would be expected based on the greater visibility of gender than religion.
Bastian, B., Jetten, J., Chen, H., Radke, H. R., Harding, J. F., & Fasoli, F. (2013). Losing our humanity the self-dehumanizing consequences of social ostracism. Personality and Social Psychology Bulletin, 39(2), 156-169.
1
51
53
39.77
The strong effect in this study reveals that participants rated ostracizing somebody more immoral than a typical everyday interaction. An ANOVA, with condition as the between-subjects variable, revealed that condition had an effect on perceived immorality, F(1, 51) = 39.77, p < .001, η2 = .44, indicating that participants felt the act of ostracizing another person was more immoral (M = 3.83, SD = 1.80) compared with having an everyday interaction (M = 1.37, SD = 0.81).
PSYCHOLOGICAL SCIENCE
Kifer, Y., Heller, D., Perunovic, W. Q. E., & Galinsky, A. D. (2013). The good life of the powerful the experience of power and authenticity enhances subjective well-being. Psychological science, 24(3), 280-288.
1
130
132
245.5489
This strong effect is a manipulation check. The focal test provides much weaker evidence for the claim that authenticity increases wellbeing. The manipulation was successful. Participants in the high-authenticity condition (M = 4.57, SD = 0.62) reported feeling more authentic than those in the low-authenticity condition (M = 2.70, SD = 0.74), t(130) = 15.67, p < .01, d = 2.73. As predicted, participants in the high-authenticity condition (M = 0.38, SD = 1.99) reported higher levels of state SWB than those in the low-authenticity condition (M = −0.46, SD = 2.12), t(130) = 2.35, p < .05, d = 0.40.
Lerner, J. S., Li, Y., & Weber, E. U. (2012). The financial costs of sadness. Psychological science, 24(1) 72–79.
1
78
202
45.1584
Again, the strong effect is a manipulation check. The emotion-induction procedure was effective in both magnitude and specificity. Participants in the sad-state condition reported feeling more sadness (M = 3.72) than neutrality (M = 1.66), t(78) = 6.72, p < .0001. The critical test that sadness leads to financial losses produced a just significant result. Sad participants were more impatient (mean = .21, median = .04) than neutral participants (mean = .28, median = .19; Mann- Whitney z = 2.04, p = .04).
Tang, S., Shepherd, S., & Kay, A. C. (2014). Do Difficult Decisions Motivate Belief in Fate? A Test in the Context of the 2012 US Presidential Election. 25(4), 1046-1048.
1
180
200
102.8196
A manipulation check confirmed that participants in the similar-candidates condition saw the candidates as more similar (M = 4.41, SD = 0.80) than did participants in the different-candidates condition (M = 3.24, SD = 0.76), t(180) = 10.14, p < .001. The critical test was not statistically significant. As predicted, participants in the similar-candidates condition reported greater belief in fate (M = 3.45, SD = 1.46) than did those in the different-candidates condition (M = 3.04, SD = 1.44), t(180) = 1.92, p = .057
Caruso, E. M., Van Boven, L., Chin, M., & Ward, A. (2013). The temporal Doppler effect when the future feels closer than the past. Psychological science, 24(4) 530–536.
The strong effect revealed that participants view an event (Valentine’s Day) in the future closer to the present than an event in the past. Valentine’s Day was perceived to be closer 1 week before it happened than 1 week after it happened, t(321) = 4.56, p < .0001, d = 0.51 (Table 1). The effect met the criterion of z > 4 because the sample size was large, N = 323, indicating that experimental social psychology could benefit from larger samples to produce more credible results.
Galinsky, A. D., Wang, C. S., Whitson, J. A., Anicich, E. M., Hugenberg, K., & Bodenhausen, G. V. (2013). The reappropriation of stigmatizing labels the reciprocal relationship between power and self-labeling. Psychological science, 24(10)
The strong effect showed that participants rated a stigmatized group as having more power of a stigmatized label if they used the label themselves than when it was used by others. The stigmatized out-group was seen as possessing greater power over the label in the self-label condition (M = 5.14, SD = 1.52) than in the other-label condition (M = 3.42, SD = 1.76), t(233) = 8.04, p < .001, d = 1.05. The effect on evaluations of the label was weaker. The label was also seen as less negative in the self-label condition (M = 5.61, SD = 1.37) than in the other-label condition (M = 6.03, SD = 1.19), t(233) = 2.46, p = .01, d = 0.33. And the weakest evidence was provided for a mediation effect. We tested whether perceptions of the stigmatized group’s power mediated the link between self-labeling and stigma attenuation. The bootstrap analysis was significant, 95% bias-corrected CI = [−0.41, −0.01]. A value of 0 rather than -0.01 would render this finding non-significant. The t-value for this analysis can be approximated by dividing the mean of the confidence boundaries (-.42/2 = -.21), by an estimate of sampling error (-.21- (-0.01) = -.20 / 2 = -.10). The ratio is an estimate of the signal to noise ratio (-.21 / -.10 = 2.1). With N = 235 this t-value is similar to the z-score and the effect can be considered just significant. This result is consistent with the weak evidence in the other 7 studies in this article.
JOURNAL OF PERSONALITY AND SOCIAL PSYCHOLOGY-Attitudes and Social Cognition
[This journal published Bem’s (2011) alleged evidence in support of extrasensory perceptions]
Ruder, M., & Bless, H. (2003). Mood and the reliance on the ease of retrieval heuristic. Journal of Personality and Social Psychology, 85(1), 20.
23.46
1
59
63
23.46
The strong effect in this study is based on a contrast analysis for the happy condition. Supporting the first central hypothesis, happy participants responded faster than sad participants (M = 9.81 s vs. M = 14.17 s), F(1, 59) = 23.46, p < .01. The results of the 2 x 2 ANOVA analysis are reported subsequently. The differential impact of number of arguments on happy versus sad participants is reflected in a marginally significant interaction, F(1, 59) = 3.59, p = .06.
Fazio, R. H., Eiser, J. R., & Shook, N. J. (2004). Attitude formation through exploration: valence asymmetries. Journal of personality and social psychology, 87(3), 293.
18.65
1
71
76
18.65
The strong effect in this study reveals that participants approached stimuli that they were told were rewarding. When they learned by experience that this was not the case, they stopped approaching them. However, when they were told that stimuli were bad, they avoided them and were not able to learn that the initial information was wrong. This resulted in an interaction effect for prior (true or false) and actual information. More important, the predicted interaction was highly significant as well, F(1, 71) =18.65, p< .001.
Pierro, A., Mannetti, L., Kruglanski, A. W., & Sleeth-Keppler, D. (2004). Relevance override: on the reduced impact of” cues” under high-motivation conditions of persuasion studies. Journal of Personality and Social Psychology, 86(2), 251.
19.85
1
42
180
19.85
The strong effect in this study is based on a contrast effect following an interaction effect. Consistent with our hypothesis, the first informational set exerted a stronger attitudinal impact in the low accountability condition, simple F(1, 42) = 19.85, p < .001, M = (positive) 1.79 versus M (negative) = -.15. The pertinent two-way interaction effect was not as strong. The interaction between accountability and valence of first informational set was significant, F(1, 84) = 5.79, p = .018. For study 2, an effect of F(1,43) = 18.66 was used, but the authors emphasized the importance of the four-way interaction. Of greater theoretical interest was the four-way interaction between our independent variables, F(1, 180) = 3.922, p = .049. For Study 3, an effect of F(1,48) = 18.55 was recorded, but the authors emphasize the importance of the four-way interaction, which was not statistically significant. Of greater theoretical interest is the four-way interaction between our independent variables, F(1, 176) = 3.261, p = .073.
Clark, C. J., Luguri, J. B., Ditto, P. H., Knobe, J., Shariff, A. F., & Baumeister, R. F. (2014). Free to punish: A motivated account of free will belief. Journal of personality and social psychology, 106(4), 501.
1
93
95
192.3769
The strong effect in this study by my prolific friend Roy Baumeister is due to the finding that participants want to punish a robber more than an aluminum can forager. Participants also wanted to punish the robber (M _ 4.98, SD =1.07) more than the aluminum can forager (M = 1.96, SD = 1.05), t(93) = 13.87, p = .001. Less impressive is the evidence that beliefs about free will are influenced by reading about a robber or an aluminum can forager. Participants believed significantly more in free will after reading about the robber (M = 3.68, SD = 0.70) than the aluminum can forager (M _ 3.38, SD _ 0.62), t(90) = 2.23, p = .029, d = 0.47.
Yan, D. (2014). Future events are far away: Exploring the distance-on-distance effect. Journal of Personality and Social Psychology, 106(4), 514.
1
118
122
23.7
This strong effect reflects an effect of a construal level manipulation (thinking in abstract or concrete terms) on temporal distance judgments. The results of a 2 x 2 ANOVA on this index revealed a significant main effect of the construal level manipulation only, F(1, 118) = 23.70, p < .001. Consistent with the present prediction, participants in the superordinate condition (M = 1.81) indicated a higher temporal distance than those in the subordinate condition (M = 1.58).
In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology. One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.
This finding fueled debates about a replication crisis in psychology. However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not. The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors. Each study may have a different explanation. This means it is important to take an ideographic (study-centered) perspective.
The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article. These predictions will only be accurate if the replication studies were close replications of the original study. Otherwise, differences between the original study and the replication study may explain why replication studies failed.
Special Introduction
The senior author of Article 140 was John A. Bargh. Other studies by John Bargh have failed to replicate and questions about his research were raised in an open letter by Noble Laureate Daniel Kahneman, who prominently features Bargh’s work in his popular book “Thinking Fast and Slow.” John Bargh’s response to this criticisms can be described as stoic defiance. In 2017, he published a book on his work that did not mention replication failures or concerns about replicability of social priming research in general. A quantitative review of the book showed that much of the cited evidence was weak.
Summary of Original Article
The article “Keeping One’s Distance: The Influence of Spatial Distance Cues on Affect and
Evaluation” by Lawrence E. Williams and John A. Bargh was published in Psychological Science. The article has been cited 155 times overall and 17 times in 2017.
The main hypothesis is that physical distance influences social judgments without reference to the self. To provide evidence for this hypothesis the authors reported four priming studies. Physical distance was primed by plotting points in a way that suggested physical closeness or distance. Mean differences indicated greater enjoyment of media depicting embarrassment, less emotional distress from violent media, lower estimates of the number of calories in unhealthy food, and weaker reports of emotional attachments to family members and hometowns in the distant prime condition than in the close prime condition.
Study 1
73 undergraduate students were assigned to three priming conditions (n = 24 per cell; 41 female, 32 male).
After the priming manipulation, participants were given an excerpt from the book “Good in Bed.” and rated their enjoyment of an article titled “Loving a larger woman.”
An ANOVA showed a significant difference between the three groups, F(2, 67) = 3.14, p = .0497 (reported as p-rep = .88). The authors wisely did not conduct post-hoc comparisons of the close or distant condition with the control condition. Instead, they reported a significant linear trend, t(67) = 2.41, p = .019.
Although men and women may respond differently to reading an article about making love to a larger woman, gender differences were not examined or reported.
Study 2
Study 2 reduced the sample size form 73 participants to 42 participants (n = 14 per cell).
The dependent variable was liking of a violent and shocking story.
Despite the loss in statistical power, the ANOVA result was significant, F(2, 39) = 4.37, p = .019. The linear contrast was also significant, t(39) = 2.62, p = .012.
Study 3
59 community members participated in Study 3.
This time, participants rated caloric content (but not liking?) of healthy and unhealthy foods .
The mixed-model ANOVA showed a significant interaction result, F(2,56) = 3.36, p = .042. The interaction was due to significant differences between the priming conditions for unhealthy foods, F(2,56) = 5.62, p = .006. There were no significant differences for healthy foods. No linear contrasts were reported. The Figure shows that the significant effect was mainly due to a lower estimate of calories in the distant priming condition.
Study 4
84 students participated in Study 4.
The dependent variable was the average of closeness ratings to siblings, parents, and hometown.
The ANOVA result was significant, F(2,81) = 4.97, p = .009. The linear contrast also showed a significant trend for lower closeness ratings after distance priming, t(81) = 2.86, p = .005.
Replicability Assessment
All four studies showed significant results with p-values (.019, .012, .006, and .005). These p-values correspond to z-scores of z = 2.35, 2.51, 2.75, and 2.81. Median observed power is 75%, while the success rate is 100%. The Replicability Index corrects for inflation in median observed power when the success rate is higher than median observed power. An R-Index of 50% (75% – 25% Inflation) is not terrible, but also not very reassuring. In the grading scheme of the R-Index it is a D- (in comparison, many chapters in Bargh’s book have an Index below 50% and an F).
More concerning is that the variance of the four z-scores is only var(z) = 0.046, when sampling error of independent studies should be 1. The Test of Insufficient Variance (TIVA) shows that such a small variance or even less variance would occur in four studies with a probability of p = .013. This suggests that some non-random factors contributed to the observed results.
These results suggest that it is difficult to replicate the reported results because the reported effect sizes may be inflated by the use of questionable research practices.
Replication Study
Consistent with the OSC project guidelines, the authors replicated the last study in the article.
The sample size was moderately larger than in the original study (N = 133 vs. 84).
The simple procedure was reproduced exactly.
The ANOVA result was not significant, F(2,122) = 0.24, p = .79. The linear contrast was also not significant, t(122) = -0.59, p = .56.
The pattern of means showed a slightly higher closeness (to family) rating after priming closeness (M = 5.44, SD = 0.83) than in the control condition (M = 5.31, SD = 1.07). The mean for the distance priming condition was identical to the control condition (M = 5.31, SD = 1.15).
In conclusion, the replication study failed to replicate the original finding. Given the simplicity of the study, there are no obvious differences between the studies that could explain the replication failure. The most plausible explanation is that the original article reported inflated effect sizes.
The replication of Study 3 had 92 participants (vs. 59 in the original study). The study failed to replicate the interaction between distance priming and healthiness of food, F(2,85) = 0.45, p = .64. The replication of Study 4 also had 92 participants, although some were excluded because they did not have siblings or parents or could not perform the plotting task. The final sample size was N = 71 (vs. 84 in the original study). It also failed to replicate the original result, F(2,68) = 0.31.
Pashler et al. (2012) found no plausible explanation for their replication failure, but they did not consider QRPs. In contrast, the bias analyses presented here suggest that QRPs were used to report only supportive evidence for distance priming. If QRPs were used, it is not surprising that unbiased replication attempts fail to produce biased results.
Conclusion
Williams and Bargh (2008) proposed that a simple geometric task could prime distance or closeness and alter judgments of liking (close = like). Although they presented four studies with significant results, the evidence is not conclusive because bias tests suggest that the results are too good to be true. This impression was confirmed by a replication failure in a replication study that had slightly more power to detect the predicted effect due to a larger sample.
Although the replication failure was reported in 2015, the original article continues to be cited as if no replication failure occurred or the replication failure can be dismissed. A careful bias analysis suggests that the original results do not provide credible evidence for distance priming and the article should not be cited as evidence for it. Unless future studies with larger samples provide credible evidence for the effect, it remains doubtful that a simple geometric drawing task can alter evaluations of closeness to family members.
This replication failure has to be considered in the context of other replication failures of studies by John Bargh and social priming studies in general. As noted by Daniel Kahneman in his 2012 letter to John Bargh.
“As all of you know, of course, questions have been raised about the robustness of priming results…. your field is now the poster child for doubts about the integrity of psychological research… people have now attached a question mark to the field, and it is your responsibility to remove it.”
Bargh’s response to this letter may be described as willful attentional blindness. Not addressing concerns about replicability may be one way to maintain confidence in oneself in the face of criticism and doubt by others, but it is not good science. Experimental social psychologists who still believe in social priming effects like distance priming need to demonstrate replicability with high powered studies. So far, the results have not been encouraging (see failure to replicate professor priming).
P.S If you liked this blog post, the reason is that I primed you with a closeness prime (featured image). In reality, this blog post is terrible. If you didn’t like the blog post, the priming manipulation didn’t work (ironic isn’t it).
“Any man whose errors take ten years to correct is quite a man.” (J. Robert Oppenheimer)
More than a century ago, Charles Darwin proposed that facial expressions of emotions not only communicate emotional experiences to others, but play an integral role in the experience of emotions themselves (Darwin, 1872). This hypothesis later became known as the facial feedback hypothesis.
Nearly a century later, a review article concluded that empirical evidence for the facial feedback hypothesis was inconclusive and suffered from some methodological problems (Ross, 1980). Most important, positive results may have been due to demand effects. That is, participants may have been aware that the manipulation of their facial muscles was intended to induce a specific emotion and respond accordingly.
Strack, Martin, and Stepper (1988) invented the pen-in–mouth-paradigm to overcome these limitations of prior studies. In this paradigm, participants are instructed to hold a pen in their mouth either with their lips or with their teeth. Holding the pen with the teeth is supposed to activate the muscles involved in smiling (zygomaticus major). Holding the pen with the lips prevents smiling. To ensure that participants are not aware of the purpose of the manipulation, the study is conducted as a between-subject study with participants being randomly assigned to either the teeth or the lips condition. Furthermore, they are given a cover story for holding the pen in the mouth.
“The study you are participating in has to do with psychomotoric coordination. More specifically, we are interested in people’s ability to perform various tasks with parts of their body that they would normally not use for such tasks…The tasks we would like you to perform are actually part of a pilot study for a more complicated experiment we are planning to do next semester to better understand this substitution process.” (p. 770).
In Study 1, participants were shown several cartoons and asked to rate how funny each cartoon was. According to FFH, inducing smiling by holding a pen with teeth should induce amusement and amplify the funniness of cartoons. The average rating of funniness was consistent with this prediction (teeth M = 5.14 vs. lips M = 4.33 on a 0 to 9 scale). A second study replicated the pen-in-mouth paradigm with amusement ratings as the dependent variable (M = 6.43 vs. 5.40).
Strack et al.’s article has been widely cited as conclusive evidence for FFH (cf. Wagenmakers et al., 2016) and the article has been featured prominently in textbooks (cf. Coles & Larsen, 2017) and popular psychology books (cf. Schimmack, 2017). However, in 2011 psychologists encountered a crisis of confidence after some classic findings could not be replicated and Nobel Laureate Daniel Kahneman asked for replications of classic studies (Kahneman, 2012).
Wagenmakers et al. (2016) answered this call using the newly established format of a Registered Replication Report (Simons & Holcombe, 2014). In this format, original authors, replication authors, and editors work together to design the replication study and the original study is replicated across several labs. Wagenmakers et al. (2016) reported the results of 17 preregistered replications of Strack et al.’s Study 1. The minimum sample size for each study was N = 50. Actual sample sizes ranged from N = 87 to 139. These sample sizes do not provide sufficient statistical power to replicate the effect in each study. However, a meta-analysis of all 17 studies ensures a high probability of replicating the original finding even with a statistically small effect size. Nevertheless, the replication study failed to provide evidence for FFH.
Some psychologists interpreted these results as challenging Darwin’s century old hypothesis that facial expressions play an important role in emotional experiences. After all, results based on the best test of the theory that were widely used to support FFH could not be replicated. However, some psychologists raised concerns about the replication study. Reber (2016) compared psychology to chemistry. For an experiment in chemistry to work as predicted, chemists need to use pure chemicals. Even small impurities may cause failures to demonstrate chemical processes. Reber suggested that the replication failure of the FFH could have been caused by “impurities” in the replication study. This line of argumentation is dangerous because it can lead to circular reasoning. That is, if a study provides evidence for a theoretically predicted effect, the study was pure, but if a study fails to provide evidence for the effect, the study was impure. Accordingly, a theoretical prediction can never be falsified.
It is also possible to question the results of the original study. Schimmack (2017) pointed out that both studies failed to reach the standard criterion of statistical significance in a two-tailed test and were only significant in a one-tailed test. These results are often called marginally significant. Two marginally significant results are suggestive, but do not provide conclusive evidence for an effect. Thus, these results were prematurely accepted as evidence for FFH, when additional evidence was needed.
It would be surprising if nobody had ever tried to replicate the pen-in-mouth paradigm given its prominence and theoretical importance. In fact, numerous published articles have used the paradigm to replicate and extend the original findings (see Appendix). We conducted a replicability analysis of studies that used the pen-in-mouth paradigm prior to the controversial registered replication report. If previous studies consistently found evidence for FFH, it suggests that the replication report studies were impure. However, if previous studies also had difficulties demonstrating the effect, it suggests that the pen-in-mouth paradigm does not reliably produce facial feedback effects.
Replicability Analysis
A replicabiliy analysis differs from conventional meta-analyses in two ways. First, the goal of a replicability analysis is not to estimate an effect size. Instead, the goal is to estimate the average replicability of a set of studies, where replicability is defined as the probability of obtaining a statistically significant result (Schimmack, 2014). Second, a replicability analysis examines whether a set of studies shows signs of publication bias by comparing the percentage of significant results to the average statistical power of studies. In an unbiased set of studies, the success rate should match median observed power. However, if publication bias is present, the success rate is higher than median observed power justifies (Schimmack, 2012, 2014).
Median observed power is only an estimate of average power and the estimate is imprecise with small sets of studies. However, precision increases as the number of studies increases. For this replicability analysis, we conducted a cumulative replicability analysis where studies are added in chronological order. The cummulative analysis shows how strong the evidence for FFH was in the beginning and how it changed over time.
We used three search strategies to retrieve original articles that used the pen-in-mouth paradigm. First, we conducted full text searchers of social psychology journals looking for the word pen. Second, we searched for articles that cited Strack et al.’s seminal study that introduced the pen-in-mouth paradigm. Third, articles that were found using the first two strategies were searched for references to additional studies. We fond 12 published articles with 19 independent studies that used the pen-in-mouth paradigm, including the original pair of studies.
For each independent study, we converted reported test statistics into a z-score as a standardized measure of the strength of evidence for FFH. If the means were not in the predicted direction, the z-scores were negative. We then computed observed power with z = 1.96 (p < .05, two-tailed) as criterion, unless the authors interpreted a marginally significant result as evidence for FFH; in this case, we used z = 1.65 as criterion for significance. The formula to convert z-scores into observed power is simply
1-pnorm(criterion.z,obs.z); obs.z = observed z-score; criterion.z = 1.96 or 1.65
The outcome of each study is dichotomous with 0 = not significant and 1 = significant. Averaging this outcome across studies yields the success rate.
We then compute an inflation index as the difference between the success rate and median observed power. In the long run, these two values should be equivalent if there is no publication bias. If there is publication bias, the inflation index is positive and reflects the amount of publication bias.
Finally, I computed the Replicability Index (R-Index; Schimmack, 2014). As publication bias also inflates median observed power, we subtracted the inflation index from median observed power. The result is called the R-Index. An R-Index of 50% or less suggests that it would be difficult to replicate a finding with the typical sample sizes in the set of studies.
Results
Table 1 shows the results. The original studies were both successful with the weaker criterion value of z = 1.65 that was used by the authors. However, both studies barely met this criterion which leads to a high inflation index and a low replicability index. As predicted by the low R-Index, the next study produced a non-significant result which brought the success rate more in line with median observed power. However, the next three studies also failed to demonstrate the effect and median observed power dropped to .07. From study 9 till study 19, median observed power stays at this level, while the success rate remains above 30%, indicating the influence of publication bias.
For the total set of 19 studies, the probability of obtaining more than 53% (10 / 19) non-significant results with a 1-.07 = 93% probability of this outcome is greater than 99.99% (Schimmack, 2012). Thus, there is strong evidence of publication bias, even though the estimated median power is only 7%. Combining the very low estimate of median power with a positive inflation index yields a negative R-Index. Thus, it is not surprising that a set of studies without publication bias failed to replicate the original effect. This finding is entirely consistent with the cumulative evidence from previous studies, once publication bias is taken into account. In fact, the cumulative analysis shows that there was never convincing evidence for the effect (R-Index < 50).
No.
Year
A#
S#
z
OP
Sig.
MOP
SR
Inf.
R-Index
1
1988
1
1
1.83
0.57
1
0.57
1.00
0.43
0.14
2
1988
1
2
1.76
0.54
1
0.56
1.00
0.44
0.12
3
2002
2
1
0.52
0.07
0
0.54
0.67
0.12
0.42
4
2006
3
1
< 1
0.07
0
0.31
0.50
0.19
0.12
5
2006
3
2
-2.38
0.00
0
0.07
0.40
0.33
-0.25
6
2008
4
1
-1.17
0.00
0
0.07
0.33
0.26
-0.19
7
2009
5
1
2.18
0.59
1
0.07
0.43
0.35
-0.28
8
2012
6
1
2.29
0.63
1
0.31
0.50
0.19
0.12
9
2013
7
1
0.17
0.04
0
0.07
0.44
0.37
-0.29
10
2013
8
1
< 1
0.07
0
0.07
0.40
0.33
-0.25
11
2013
8
2
< 1
0.07
0
0.07
0.36
0.29
-0.22
12
2013
8
3
< 1
0.07
0
0.07
0.33
0.26
-0.19
13
2013
8
4
< 1
0.07
0
0.07
0.31
0.24
-0.16
14
2013
8
5
2.18
0.59
1
0.07
0.36
0.28
-0.21
15
2013
8
6
2.96
0.84
1
0.07
0.40
0.33
-0.26
16
2014
9
1
1.90
0.60
1
0.07
0.44
0.36
-0.29
17
2014
10
1
2.43
0.68
1
0.07
0.47
0.40
-0.32
18
2015
11
1
0.25
0.04
0
0.07
0.44
0.37
-0.30
19
2016
12
1
2.34
0.65
1
0.07
0.47
0.40
-0.32
Note. No. = Number of Study in Chronological Order, Year = Year, A# Article Number (see Appendix), S# Study Number in Article, z = strength of evidence for or against FFH, OP = observed power, Sig = Significant (0 = No, 1 = Yes), MOP = Median Observed Power, SR = Success Rate, Inf. = Inflation, R-Index = Replicability Index (MOP – Inf).
Conclusion
Darwin was a great scientist. Since he published his influential theory of evolution in 1859, biology has made tremendous progress in understanding the process of evolution. The same cannot be said about Darwin’s theory of emotion. More than hundred years later, psychologists are still debating the influence of facial feedback on emotional experiences. One reason for the slow progress in some areas of psychology is that original studies were often accepted as conclusive evidence without rigorous replication efforts. In addition, meta-analyses provided misleading results because they failed to take publication bias into account. The present replicability analysis showed that the pen-in-mouth paradigm never provided convincing evidence for facial feedback effects. Nevertheless, the original study was often cited as evidence for facial feedback effects. To make progress like other sciences, psychology needs to take empirical studies more seriously and ensure that important findings can be replicated before they become corner stones of theories and textbook findings.
This replicability analysis is limited to the pen-in-mouth paradigm. Other paradigms may produce replicable results. However, the pen-in-mouth paradigm has been used because it addressed limitations of these paradigms such as demand effects. Thus, even if these paradigms were more successful, the underlying mechanism would be less clear. At present, the replicability analysis simply shows a lack of evidence for FFH, but it would be premature to conclude that facial feedback effects do not exist.
References
Buck, R. (1980). Nonverbal Behavior and the Theory of Emotion: The Facial
Feedback Hypothesis. Journal of Personality and Social Psychology, 38, 811-824.
Coles, N. A., & Larsen, J. T., & Lench, H. C. (2017). A meta-analysis of the facial feedback hypothesis literature. OSF-Preprint.
Darwin, C. (1872). The expression of emotions in man and animals. London: John Murray.
Kahneman, D. (2012). A proposal to deal with questions about priming effects.
Strack, F., Martin, L. L., & Stepper, S. (1988). Inhibiting and Facilitating Conditions of the Human Smile: A Nonobtrusive Test of the Facial Feedback Hypothesis. Journal of Personality and Social Psychology. 54, 768–777.
Reber, R. (2016). Impure replications.
Schimmack, U. The Ironic Effect of Significant Results on the Credibility of Multiple-Study Articles, Psychological Methods, 17, 551–566.
Schimmack, U. (2014). A revised introduction to the R-Index.
Simons, D. J., & Holcombe, A. O. (2014). Registered Replication Reports.
Wagenmakers, EJ et al. (2016). Registered Replication Report: Strack, Martin, & Stepper (1988). Perspectives on Psychological Science, 917-928.
Appendix: Articles used for Meta-Analysis
A1. Strack, F., Martin, L. L., & Stepper, S. (1988). Inhibiting and Facilitating Conditions of the Human Smile: A Nonobtrusive Test of the Facial Feedback Hypothesis. Journal of Personality and Social Psychology. 54, 768–777.
A2. Soussignan, R. (2002). Duchenne Smile, Emotional Experience, and Autonomic Reactivity: A Test of the Facial Feedback Hypothesis. Emotion, 2, 52-74.
A3. Ito, T., Chiao, K. W., Devine, P. G., Lorig, T. S., & Cacioppo, J. T. (2006). The Influence of Facial Feedback on Race Bias. Psychological Science, 17, 256-261.
A4. Andreasson, P., & Dimberg, U. (2008). Emotional Empathy and Facial Feedback. Journal of Nonverbal Behavior, 32, 215-224.
A5. Wiswede, D., Munte, T. F., Kramer, U. M., & Russler, J. (2009). Embodied Emotion Modulates Neural Signature of Performance Monitoring. PlosOne, 4, e5754, 1-6.
A6. Kraft, T. L., & Pressman, S. D. (2012). Grin and Bear It: The Influence of Manipulated Facial Expression on the Stress Response. Psychological Science, 23, 1372-1378.
A7. Paredes, B., Stavraki, M., Briñol, P., & Petty, R. E. (2013). Social Psychology, 44, 349-353.
A8. Marmolejo-Ramos, F. & Dunn, J. (2013). On the activation of sensorimotor systems during the processing of emotionally-laden stimuli. Universitas Psychologica, 12, 1511-1542.
A9. Rummer, R., Schweppe, J., Schleelmilch, R., Grice, M. (2014). Mood Is Linked to Vowel Type: The Role of Articulatory Movements. Emotion, 14, 246-250.
A10. Dzokoto, V., Wallace, D. S., Peters, L., & Bentsi-Enchill, E. (2014). Attention to Emotion and Non-Western Faces: Revisiting the Facial Feedback Hypothesis. The Journal of General Psychology, 2014, 141(2), 151–168.
A11. Arminjon, M., Preissmann, D., Chmetz, F., Duraku, A., Ansermet, F., & Magistretti, P. J. (2015). Embodied memory: Unconscious smiling modulates emotional evaluation of episodic memories, Frontiers in Psychology, 6, 650, 1-7.
A12. Epstein, N., Brendel, T., Hege, I., Ouellette, D. L., Schmidmaier, R., & Kiesewetter, J. (2016). The power of the pen: how to make physicians more friendly and patients more attractive. Medical Education, 50, 1214–1218.
Authors: Ulrich Schimmack, Moritz Heene, and Kamini Kesavan
Abstract: We computed the R-Index for studies cited in Chapter 4 of Kahneman’s book “Thinking Fast and Slow.” This chapter focuses on priming studies, starting with John Bargh’s study that led to Kahneman’s open email. The results are eye-opening and jaw-dropping. The chapter cites 12 articles and 11 of the 12 articles have an R-Index below 50. The combined analysis of 31 studies reported in the 12 articles shows 100% significant results with average (median) observed power of 57% and an inflation rate of 43%. The R-Index is 14. This result confirms Kahneman’s prediction that priming research is a train wreck and readers of his book “Thinking Fast and Slow” should not consider the presented studies as scientific evidence that subtle cues in their environment can have strong effects on their behavior outside their awareness.
Introduction
In 2011, Nobel Laureate Daniel Kahneman published a popular book, “Thinking Fast and Slow”, about important finding in social psychology.
In the same year, questions about the trustworthiness of social psychology were raised. A Dutch social psychologist had fabricated data. Eventually over 50 of his articles would be retracted. Another social psychologist published results that appeared to demonstrate the ability to foresee random future events (Bem, 2011). Few researchers believed these results and statistical analysis suggested that the results were not trustworthy (Francis, 2012; Schimmack, 2012). Psychologists started to openly question the credibility of published results.
In the beginning of 2012, Doyen and colleagues published a failure to replicate a prominent study by John Bargh that was featured in Daniel Kahneman’s book. A few month later, Daniel Kahneman distanced himself from Bargh’s research in an open email addressed to John Bargh (Young, 2012):
“As all of you know, of course, questions have been raised about the robustness of priming results…. your field is now the poster child for doubts about the integrity of psychological research… people have now attached a question mark to the field, and it is your responsibility to remove it… all I have personally at stake is that I recently wrote a book that emphasizes priming research as a new approach to the study of associative memory…Count me as a general believer… My reason for writing this letter is that I see a train wreck looming.”
Five years later, Kahneman’s concerns have been largely confirmed. Major studies in social priming research have failed to replicate and the replicability of results in social psychology is estimated to be only 25% (OSC, 2015).
Looking back, it is difficult to understand the uncritical acceptance of social priming as a fact. In “Thinking Fast and Slow” Kahneman wrote “disbelief is not an option. The results are not made up, nor are they statistical flukes. You have no choice but to accept that the major conclusions of these studies are true.”
Yet, Kahneman could have seen the train wreck coming. In 1971, he co-authored an article about scientists’ “exaggerated confidence in the validity of conclusions based on small samples” (Tversky & Kahneman, 1971, p. 105). Yet, many of the studies described in Kahneman’s book had small samples. For example, Bargh’s priming study used only 30 undergraduate students to demonstrate the effect.
Replicability Index
Small samples can be sufficient to detect large effects. However, small effects require large samples. The probability of replicating a published finding is a function of sample size and effect size. The Replicability Index (R-Index) makes it possible to use information from published results to predict how replicable published results are.
Every reported test-statistic can be converted into an estimate of power, called observed power. For a single study, this estimate is useless because it is not very precise. However, for sets of studies, the estimate becomes more precise. If we have 10 studies and the average power is 55%, we would expect approximately 5 to 6 studies with significant results and 4 to 5 studies with non-significant results.
If we observe 100% significant results with an average power of 55%, it is likely that studies with non-significant results are missing (Schimmack, 2012). There are too many significant results. This is especially true because average power is also inflated when researchers report only significant results. Consequently, the true power is even lower than average observed power. If we observe 100% significant results with 55% average powered power, power is likely to be less than 50%.
This is unacceptable. Tversky and Kahneman (1971) wrote “we refuse to believe that a serious investigator will knowingly accept a .50 risk of failing to confirm a valid research hypothesis.”
To correct for the inflation in power, the R-Index uses the inflation rate. For example, if all studies are significant and average power is 75%, the inflation rate is 25% points. The R-Index subtracts the inflation rate from average power. So, with 100% significant results and average observed power of 75%, the R-Index is 50% (75% – 25% = 50%). The R-Index is not a direct estimate of true power. It is actually a conservative estimate of true power if the R-Index is below 50%. Thus, an R-Index below 50% suggests that a significant result was obtained only by capitalizing on chance, although it is difficult to quantify by how much.
How Replicable are the Social Priming Studies in “Thinking Fast and Slow”?
Chapter 4: The Associative Machine
4.1. Cognitive priming effect
In the 1980s, psychologists discovered that exposure to a word causes immediate and measurable changes in the ease with which many related words can be evoked.
[no reference provided]
4.2. Priming of behavior without awareness
Another major advance in our understanding of memory was the discovery that priming is not restricted to concepts and words. You cannot know this from conscious experience, of course, but you must accept the alien idea that your actions and your emotions can be primed by events of which you are not even aware.
“In an experiment that became an instant classic, the psychologist John Bargh and his collaborators asked students at New York University—most aged eighteen to twenty-two—to assemble four-word sentences from a set of five words (for example, “finds he it yellow instantly”). For one group of students, half the scrambled sentences contained words associated with the elderly, such as Florida, forgetful, bald, gray, or wrinkle. When they had completed that task, the young participants were sent out to do another experiment in an office down the hall. That short walk was what the experiment was about. The researchers unobtrusively measured the time it took people to get from one end of the corridor to the other.”
“As Bargh had predicted, the young people who had fashioned a sentence from words with an elderly theme walked down the hallway significantly more slowly than the others. walking slowly, which is associated with old age.”
“All this happens without any awareness. When they were questioned afterward, none of the students reported noticing that the words had had a common theme, and they all insisted that nothing they did after the first experiment could have been influenced by the words they had encountered. The idea of old age had not come to their conscious awareness, but their actions had changed nevertheless.“
[John A. Bargh, Mark Chen, and Lara Burrows, “Automaticity of Social Behavior: Direct Effects of Trait Construct and Stereotype Activation on Action,” Journal of Personality and Social Psychology 71 (1996): 230–44.]
t(28)=2.86
0.008
2.66
0.76
t(28)=2.16
0.039
2.06
0.54
MOP = .65, Inflation = .35, R-Index = .30
4.3. Reversed priming: Behavior primes cognitions
“The ideomotor link also works in reverse. A study conducted in a German university was the mirror image of the early experiment that Bargh and his colleagues had carried out in New York.”
“Students were asked to walk around a room for 5 minutes at a rate of 30 steps per minute, which was about one-third their normal pace. After this brief experience, the participants were much quicker to recognize words related to old age, such as forgetful, old, and lonely.”
“Reciprocal priming effects tend to produce a coherent reaction: if you were primed to think of old age, you would tend to act old, and acting old would reinforce the thought of old age.”
t(18)=2.10
0.050
1.96
0.50
t(35)=2.10
0.043
2.02
0.53
t(31)=2.50
0.018
2.37
0.66
MOP = .53, Inflation = .47, R-Index = .06
4.4. Facial-feedback hypothesis (smiling makes you happy)
“Reciprocal links are common in the associative network. For example, being amused tends to make you smile, and smiling tends to make you feel amused….”
“College students were asked to rate the humor of cartoons from Gary Larson’s The Far Side while holding a pencil in their mouth. Those who were “smiling” (without any awareness of doing so) found the cartoons funnier than did those who were “frowning.”
[“Inhibiting and Facilitating Conditions of the Human Smile: A Nonobtrusive Test of the Facial Feedback Hypothesis,” Journal of Personality and Social Psychology 54 (1988): 768–77.]
The authors used the more liberal and unconventional criterion of p < .05 (one-tailed), z = 1.65, as a criterion for significance. Accordingly, we adjusted the R-Index analysis and used 1.65 as the criterion value.
t(89)=1.85
0.034
1.83
0.57
t(75)=1.78
0.034
1.83
0.57
MOP = .57, Inflation = .43, R-Index = .14
These results could not be replicated in a large replication effort with 17 independent labs. Not a single lab produced a significant result and even a combined analysis failed to show any evidence for the effect.
4.5. Automatic Facial Responses
In another experiment, people whose face was shaped into a frown (by squeezing their eyebrows together) reported an enhanced emotional response to upsetting pictures—starving children, people arguing, maimed accident victims.
[Ulf Dimberg, Monika Thunberg, and Sara Grunedal, “Facial Reactions to
The description in the book does not match any of the three studies reported in this article. The first two studies examined facial muscle movements in response to pictures of facial expressions (smiling or frowning faces). The third study used emotional pictures of snakes and flowers. We might consider the snake pictures as being equivalent to pictures of starving children or maimed accident victims. Participants were also asked to frown or to smile while looking at the pictures. However, the dependent variable was not how they felt in response to pictures of snakes, but rather how their facial muscles changed. Aside from a strong effect of instructions, the study also found that the emotional picture had an automatic effect on facial muscles. Participants frowned more when instructed to frown and looking at a snake picture than when instructed to frown and looking at a picture of a flower. “This response, however, was larger to snakes than to flowers as indicated by both the Stimulus factor, F(1, 47) = 6.66, p < .02, and the Stimulus 6 Interval factor, F(1, 47) = 4.30, p < .05.” (p. 463). The evidence for smiling was stronger. “The zygomatic major muscle response was larger to flowers than to snakes, which was indicated by both the Stimulus factor, F(1, 47) = 18.03, p < .001, and the Stimulus 6 Interval factor, F(1, 47) = 16.78, p < .001.” No measures of subjective experiences were included in this study. Therefore, the results of this study provide no evidence for Kahneman’s claim in the book and the results of this study are not included in our analysis.
4.6. Effects of Head-Movements on Persuasion
“Simple, common gestures can also unconsciously influence our thoughts and feelings.”
“In one demonstration, people were asked to listen to messages through new headphones. They were told that the purpose of the experiment was to test the quality of the audio equipment and were instructed to move their heads repeatedly to check for any distortions of sound. Half the participants were told to nod their head up and down while others were told to shake it side to side. The messages they heard were radio editorials.”
“Those who nodded (a yes gesture) tended to accept the message they heard, but those who shook their head tended to reject it. Again, there was no awareness, just a habitual connection between an attitude of rejection or acceptance and its common physical expression.”
F(2,66)=44.70
0.000
7.22
1.00
MOP = 1.00, Inflation = .00, R-Index = 1.00
[Gary L. Wells and Richard E. Petty, “The Effects of Overt Head Movements on Persuasion: Compatibility and Incompatibility of Responses,” Basic and Applied Social Psychology, 1, (1980): 219–30.]
4.7 Location as Prime
“Our vote should not be affected by the location of the polling station, for example, but it is.”
“A study of voting patterns in precincts of Arizona in 2000 showed that the support for propositions to increase the funding of schools was significantly greater when the polling station was in a school than when it was in a nearby location.”
“A separate experiment showed that exposing people to images of classrooms and school lockers also increased the tendency of participants to support a school initiative. The effect of the images was larger than the difference between parents and other voters!”
[Jonah Berger, Marc Meredith, and S. Christian Wheeler, “Contextual Priming: Where People Vote Affects How They Vote,” PNAS 105 (2008): 8846–49.]
z = 2.10
0.036
2.10
0.56
p = .05
0.050
1.96
0.50
MOP = .53, Inflation = .47, R-Index = .06
4.8 Money Priming
“Reminders of money produce some troubling effects.”
“Participants in one experiment were shown a list of five words from which they were required to construct a four-word phrase that had a money theme (“high a salary desk paying” became “a high-paying salary”).”
“Other primes were much more subtle, including the presence of an irrelevant money-related object in the background, such as a stack of Monopoly money on a table, or a computer with a screen saver of dollar bills floating in water.”
“Money-primed people become more independent than they would be without the associative trigger. They persevered almost twice as long in trying to solve a very difficult problem before they asked the experimenter for help, a crisp demonstration of increased self-reliance.”
“Money-primed people are also more selfish: they were much less willing to spend time helping another student who pretended to be confused about an experimental task. When an experimenter clumsily dropped a bunch of pencils on the floor, the participants with money (unconsciously) on their mind picked up fewer pencils.”
“In another experiment in the series, participants were told that they would shortly have a get-acquainted conversation with another person and were asked to set up two chairs while the experimenter left to retrieve that person. Participants primed by money chose to stay much farther apart than their nonprimed peers (118 vs. 80 centimeters).”
“Money-primed undergraduates also showed a greater preference for being alone.”
[Kathleen D. Vohs, “The Psychological Consequences of Money,” Science 314 (2006): 1154–56.]
F(2,49)=3.73
0.031
2.16
0.58
t(35)=2.03
0.050
1.96
0.50
t(37)=2.06
0.046
1.99
0.51
t(42)=2.13
0.039
2.06
0.54
F(2,32)=4.34
0.021
2.30
0.63
t(38)=2.13
0.040
2.06
0.54
t(33)=2.37
0.024
2.26
0.62
F(2,58)=4.04
0.023
2.28
0.62
chi^2(2)=10.10
0.006
2.73
0.78
MOP = .58, Inflation = .42, R-Index = .16
4.9 Death Priming
“The evidence of priming studies suggests that reminding people of their mortality increases the appeal of authoritarian ideas, which may become reassuring in the context of the terror of death.”
The cited article does not directly examine this question. The abstract states that “three experiments were conducted to test the hypothesis, derived from terror management theory, that reminding people of their mortality increases attraction to those who consensually validate their beliefs and decreases attraction to those who threaten their beliefs” (p. 308). Study 2 found no general effect of death priming. Rather, the effect was qualified by authoritarianism. Mortality salience enhanced the rejection of dissimilar others in Study 2 only among high authoritarian subjects.” (p. 314), based on a three-way interaction with F(1,145) = 4.08, p = .045. We used the three-way interaction for the computation of the R-Index. Study 1 reported opposite effects for ratings of Christian targets, t(44) = 2.18, p = .034 and Jewish targets, t(44)= 2.08, p = .043. As these tests are dependent, only one test could be used, and we chose the slightly stronger result. Similarly, Study 3 reported significantly more liking of a positive interviewee and less liking of a negative interviewee, t(51) = 2.02, p = .049 and t(49) = 2.42, p = .019, respectively. We chose the stronger effect.
[Jeff Greenberg et al., “Evidence for Terror Management Theory II: The Effect of Mortality Salience on Reactions to Those Who Threaten or Bolster the Cultural Worldview,” Journal of Personality and Social Psychology]
t(44)=2.18
0.035
2.11
0.56
F(1,145)=4.08
0.045
2.00
0.52
t(49)=2.42
0.019
2.34
0.65
MOP = .56, Inflation = .44, R-Index = .12
4.10 The “Lacy Macbeth Effect”
“For example, consider the ambiguous word fragments W_ _ H and S_ _ P. People who were recently asked to think of an action of which they are ashamed are more likely to complete those fragments as WASH and SOAP and less likely to see WISH and SOUP.”
“Furthermore, merely thinking about stabbing a coworker in the back leaves people more inclined to buy soap, disinfectant, or detergent than batteries, juice, or candy bars. Feeling that one’s soul is stained appears to trigger a desire to cleanse one’s body, an impulse that has been dubbed the “Lady Macbeth effect.”
[Lady Macbeth effect”: Chen-Bo Zhong and Katie Liljenquist, “Washing Away Your Sins:
Threatened Morality and Physical Cleansing,” Science 313 (2006): 1451–52.]
F(1,58)=4.26
0.044
2.02
0.52
F(1,25)=6.99
0.014
2.46
0.69
MOP = .61, Inflation = .39, R-Index = .22
The article reports two more studies that are not explicitly mentioned, but are used as empirical support for the Lady Macbeth effect. As the results of these studies were similar to those in the mentioned studies, including these tests in our analysis does not alter the conclusions.
chi^2(1)=4.57
0.033
2.14
0.57
chi^2(1)=5.02
0.025
2.24
0.61
MOP = .59, Inflation = .41, R-Index = .18
4.11 Modality Specificity of the “Lacy Macbeth Effect”
“Participants in an experiment were induced to “lie” to an imaginary person, either on the phone or in e-mail. In a subsequent test of the desirability of various products, people who had lied on the phone preferred mouthwash over soap, and those who had lied in e-mail preferred soap to mouthwash.”
[Spike Lee and Norbert Schwarz, “Dirty Hands and Dirty Mouths: Embodiment of the Moral-Purity Metaphor Is Specific to the Motor Modality Involved in Moral Transgression,” Psychological Science 21 (2010): 1423–25.]
The results are presented as significant with a one-sided t-test. “As shown in Figure 1a, participants evaluated mouthwash more positively after lying in a voice mail (M = 0.21, SD = 0.72) than after lying in an e-mail (M = –0.26, SD = 0.94), F(1, 81) = 2.93, p = .03 (one-tailed), d = 0.55 (simple main effect), but evaluated hand sanitizer more positively after lying in an e-mail (M = 0.31, SD = 0.76) than after lying in a voice mail (M = –0.12, SD = 0.86), F(1, 81) = 3.25, p = .04 (one-tailed), d = 0.53 (simple main effect).” We adjusted the significance criterion for the R-Index accordingly.
F(1,81)=2.93
0.045
1.69
0.52
F(1,81)=3.25
0.038
1.78
0.55
MOP = .54, Inflation = .46, R-Index = .08
4.12 Eyes on You
“On the first week of the experiment (which you can see at the bottom of the figure), two wide-open eyes stare at the coffee or tea drinkers, whose average contribution was 70 pence per liter of milk. On week 2, the poster shows flowers and average contributions drop to about 15 pence. The trend continues. On average, the users of the kitchen contributed almost three times as much in ’eye weeks’ as they did in ’flower weeks.’ ”
[Melissa Bateson, Daniel Nettle, and Gilbert Roberts, “Cues of Being Watched Enhance Cooperation in a Real-World Setting,” Biology Letters 2 (2006): 412–14.]
F(1,7)=11.55
0.011
2.53
0.72
MOP = .72, Inflation = .28, R-Index = .44
Combined Analysis
We then combined the results from the 31 studies mentioned above. While the R-Index for small sets of studies may underestimate replicability, the R-Index for a large set of studies is more accurate. Median Obesrved Power for all 31 studies is only 57%. It is incredible that 31 studies with 57% power could produce 100% significant results (Schimmack, 2012). Thus, there is strong evidence that the studies provide an overly optimistic image of the robustness of social priming effects. Moreover, median observed power overestimates true power if studies were selected to be significant. After correcting for inflation, the R-Index is well below 50%. This suggests that the studies have low replicability. Moreover, it is possible that some of the reported results are actually false positive results. Just like the large-scale replication of the facial feedback studies failed to provide any support for the original findings, other studies may fail to show any effects in large replication projects. As a result, readers of “Thinking Fast and Slow” should be skeptical about the reported results and they should disregard Kahneman’s statement that “you have no choice but to accept that the major conclusions of these studies are true.” Our analysis actually leads to the opposite conclusion. “You should not accept any of the conclusions of these studies as true.”
k = 31, MOP = .57, Inflation = .43, R-Index = .14, Grade: F for Fail
Powergraph of Chapter 4
Schimmack and Brunner (2015) developed an alternative method for the estimation of replicability. This method takes into account that power can vary across studies. It also provides 95% confidence intervals for the replicability estimate. The results of this method are presented in the Figure above. The replicability estimate is similar to the R-Index, with 14% replicability. However, due to the small set of studies, the 95% confidence interval is wide and includes values above 50%. This does not mean that we can trust the published results, but it does suggest that some of the published results might be replicable in larger replication studies with more power to detect small effects. At the same time, the graph shows clear evidence for a selection effect. That is, published studies in these articles do not provide a representative picture of all the studies that were conducted. The powergraph shows that there should have been a lot more non-significant results than were reported in the published articles. The selective reporting of studies that worked is at the core of the replicability crisis in social psychology (Sterling, 1959, Sterling et al., 1995; Schimmack, 2012). To clean up their act and to regain trust in published results, social psychologists have to conduct studies with larger samples that have more than 50% power (Tversky & Kahneman, 1971) and they have to stop reporting only significant results. We can only hope that social psychologists will learn from the train wreck of social priming research and improve their research practices.