“Trust is good, but control is better”
Information about the replicability of published results is important because empirical results can only be used as evidence if the results can be replicated. However, the replicability of published results in social psychology is doubtful. Brunner and Schimmack (2018) developed a statistical method called z-curve to estimate how replicable a set of significant results are, if the studies were replicated exactly. In a replicability audit, I am applying z-curve to the most cited articles of psychologists to estimate the replicability of their studies.
John A. Bargh
Bargh is an eminent social psychologist (H-Index in WebofScience = 61). He is best known for his claim that unconscious processes have a strong influence on behavior. Some of his most cited article used subliminal or unobtrusive priming to provide evidence for this claim.
Bargh also played a significant role in the replication crisis in psychology. In 2012, a group of researchers failed to replicate his famous “elderly priming” study (Doyen et al., 2012). He responded with a personal attack that was covered in various news reports (Bartlett, 2013). It also triggered a response by psychologist and Nobel Laureate Daniel Kahneman, who wrote an open letter to Bargh (Young, 2012).
“As all of you know, of course, questions have been raised about the robustness of priming results…. your field is now the poster child for doubts about the integrity of psychological research.”
Kahneman also asked Bargh and other social priming researchers to conduct credible replication studies to demonstrate that the effects are real. However, seven years later neither Bargh nor other prominent social priming researchers have presented new evidence that their old findings can be replicated.
Instead other researchers have conducted replication studies and produced further replication failures. As a result, confidence in social priming is decreasing as reflected in Bargh’s citation counts (Figure 1)
Figure 1. John A. Bargh’s citation counts in Web of Science (3/17/19)
In this blog post, I examine the replicability and credibility of John A. Bargh’s published results using a statistical approach; z-curve (Brunner & Schimmack, 2018). ). It is well known that psychology journals only published confirmatory evidence with statistically significant results, p < .05 (Sterling, 1959). This selection for significance is the main cause of the replication crisis in psychology because selection for significance makes it impossible to distinguish results that can be replicated from results that cannot be replicated because selection for significance ensures that all results will be replicated (we never see replication failures).
While selection for significance makes success rates uninformative, the strength of evidence against the null-hypothesis (signal/noise or effect size / sampling error) does provide information about replicability. Studies with higher signal to noise ratios are more likely to replicate. Z-curve uses z-scores as the common metric of signal-to-noise ratio for studies that used different test statistics. The distribution of observed z-scores provides valuable information about the replicability of a set of studies. If most z-scores are close to the criterion for statistical significance (z = 1.96), replicability is low.
Given the requirement to publish significant results, researches had two options how they could meet this goal. One option requires obtaining large samples to reduce sampling error and therewith increase the signal-to-noise ratio. The other solution is to conduct studies with small samples and conduct multiple statistical tests. Multiple testing increases the probability of obtaining a significant results with the help of chance. This strategy is more efficient in producing significant results, but these results are less replicable because a replication study will not be able to capitalize on chance again. The latter strategy is called a questionable research practice (John et al., 2012), and it produces questionable results because it is unknown how much chance contributed to the observed significant result. Z-curve reveals how much a researcher relied on questionable research practices to produce significant results.
I used WebofScience to identify the most cited articles by John A. Bargh (datafile). I then selected empirical articles until the number of coded articles matched the number of citations, resulting in 43 empirical articles (H-Index = 41). The 43 articles reported 111 studies (average 2.6 studies per article). The total number of participants was 7,810 with a median of 56 participants per study. For each study, I identified the most focal hypothesis test (MFHT). The result of the test was converted into an exact p-value and the p-value was then converted into a z-score. The z-scores were submitted to a z-curve analysis to estimate mean power of the 100 results that were significant at p < .05 (two-tailed). Four studies did not produce a significant result. The remaining 7 results were interpreted as evidence with lower standards of significance. Thus, the success rate for 111 reported hypothesis tests was 96%. This is a typical finding in psychology journals (Sterling, 1959).
The z-curve estimate of replicability is 29% with a 95%CI ranging from 15% to 38%. Even at the upper end of the 95% confidence interval this is a low estimate. The average replicability is lower than for social psychology articles in general (44%, Schimmack, 2018) and for other social psychologists. At present, only one audit has produced an even lower estimate (Replicability Audits, 2019).
The histogram of z-values shows the distribution of observed z-scores (blue line) and the predicted density distribution (grey line). The predicted density distribution is also projected into the range of non-significant results. The area under the grey curve is an estimate of the file drawer of studies that need to be conducted to achieve 100% successes if hiding replication failures were the only questionable research practice that is used. The ratio of the area of non-significant results to the area of all significant results (including z-scores greater than 6) is called the File Drawer Ratio. Although this is just a projection, and other questionable practices may have been used, the file drawer ratio of 7.53 suggests that for every published significant result about 7 studies with non-significant results remained unpublished. Moreover, often the null-hypothesis may be false, but the effect size is very small and the result is still difficult to replicate. When the definition of a false positive includes studies with very low power, the false positive estimate increases to 50%. Thus, about half of the published studies are expected to produce replication failures.
Finally, z-curve examines heterogeneity in replicability. Studies with p-values close to .05 are less likely to replicate than studies with p-values less than .0001. This fact is reflected in the replicability estimates for segments of studies that are provided below the x-axis. Without selection for significance, z-scores of 1.96 correspond to 50% replicability. However, we see that selection for significance lowers this value to just 14% replicability. Thus, we would not expect that published results with p-values that are just significant would replicate in actual replication studies. Even z-scores in the range from 3 to 3.5 average only 32% replicability. Thus, only studies with z-scores greater than 3.5 can be considered to provide some empirical evidence for this claim.
Inspection of the datafile shows that z-scores greater than 3.5 were consistently obtained in 2 out of the 43 articles. Both articles used a more powerful within-subject design.
The automatic evaluation effect: Unconditional automatic attitude activation with a pronunciation task (JPSP, 1996)
Subjective aspects of cognitive control at different stages of processing (Attention, Perception, & Psychophysics, 2009).
John A. Bargh’s work on unconscious processes with unobtrusive priming task is at the center of the replication crisis in psychology. This replicability audit suggests that this is not an accident. The low replicability estimate and the large file-drawer estimate suggest that replication failures are to be expected. As a result, published results cannot be interpreted as evidence for these effects.
So far, John Bargh has ignored criticism of his work. In 2017, he published a popular book about his work on unconscious processes. The book did not mention doubts about the reported evidence, while a z-curve analysis showed low replicability of the cited studies (Schimmack, 2017).
Recently, another study by John Bargh failed to replicate (Chabris et al., in press), and Jessy Singal wrote a blog post about this replication failure (Research Digest) and John Bargh wrote a lengthy comment.
In the commentary, Bargh lists several studies that successfully replicated the effect. However, listing studies with significant results does not provide evidence for an effect unless we know how many studies failed to demonstrate the effect and often we do not know this because these studies are not published. Thus, Bargh continues to ignore the pervasive influence of publication bias.
Bargh then suggests that the replication failure was caused by a hidden moderator which invalidates the results of the replication study.
One potentially important difference in procedure is the temperature of the hot cup of coffee that participants held: was the coffee piping hot (so that it was somewhat uncomfortable to hold) or warm (so that it was pleasant to hold)? If the coffee was piping hot, then, according to the theory that motivated W&B, it should not activate the concept of social warmth – a positively valenced, pleasant concept. (“Hot” is not the same as just more “warm”, and actually participates in a quite different metaphor – hot vs. cool – having to do with emotionality.) If anything, an uncomfortably hot cup of coffee might be expected to activate the concept of anger (“hot-headedness”), which is antithetical to social warmth. With this in mind, there are good reasons to suspect that in C&S, the coffee was, for many participants, uncomfortably hot. Indeed, C&S purchased a hot or cold coffee at a coffee shop and then immediately handed that coffee to passersby who volunteered to take the study. Thus, the first few people to hold a hot coffee likely held a piping hot coffee (in contrast, W&B’s coffee shop was several blocks away from the site of the experiment, and they used a microwave for subsequent participants to keep the coffee at a pleasantly warm temperature). Importantly, C&S handed the same cup of coffee to as many as 7 participants before purchasing a new cup. Because of that feature of their procedure, we can check if the physical-to-social warmth effect emerged after the cups were held by the first few participants, at which point the hot coffee (presumably) had gone from piping hot to warm.
He overlooks that his original study produced only weak evidence for the effect with a p-value of .0503, that is technically not below the .05 value for significance. As shown in the z-curve plot, results with a p-value of .0503 have only an average replicability of 13%. Moreover, the 95%CI for the effect size touches 0. Thus, the original study did not rule out that the effect size is extremely small and has no practical significance. To make any claims that the effect of holding a warm cup on affection is theoretically relevant for our understanding of affection would require studies with larger samples and more convincing evidence.
At the end of his commentary, John A. Bargh assures readers that he is purely motivated by a search for the truth.
Let me close by affirming that I share your goal of presenting the public with accurate information as to the state of the scientific evidence on any finding I discuss publicly. I also in good faith seek to give my best advice to the public at all times, again based on the present state of evidence. Your and my assessments of that evidence might differ, but our motivations are the same.
Let me be crystal clear. I have no reasons to doubt that John A. Bargh believes what he says. His conscious mind sees himself as a scientist who employs the scientific method to provide objective evidence. However, Bargh himself would be the first to acknowledge that our conscious mind is not fully aware of the actual causes of human behavior. I submit that his response to criticism of his work shows that he is less capable of being objective than he thinks he his. I would be happy to be proven wrong in a response by John A. Bargh to my scientific criticism of his work. So far, eminent social psychologists have preferred to remain silent about the results of their replicability audits.
It is nearly certain that I made some mistakes in the coding of John A. Bargh’s articles. However, it is important to distinguish consequential and inconsequential mistakes. I am confident that I did not make consequential errors that would alter the main conclusions of this audit. However, control is better than trust and everybody can audit this audit. The data are openly available and the data can be submitted to a z-curve analysis using a shinny app. Thus, this replicability audit is fully transparent and open to revision.
Many psychologists do not take this work seriously because it has not been peer-reviewed. However, nothing is stopping them from conducting a peer-review of this work and to publish the results of their review as a commentary here or elsewhere. Thus, the lack of peer-review is not a reflection of the quality of this work, but rather a reflection of the unwillingness of social psychologists to take criticism of their work seriously.
If you found this audit interesting, you might also be interested in other replicability audits of eminent social psychologists.
6 thoughts on “Replicability Audit of John A. Bargh”
“John A. Bargh’s work on unconscious processes with unobtrusive priming task is at the center of the replication crisis in psychology.
Bargh’s, and other similar work, is so 2000. They got to get with the program! The new “hip” thing to do is “crowdsourcing”. That’s when you “solve” the replication crisis by making sure A) nobody can even attempt to replicate your work anymore, and B) p-hacking and selective reporting can still be done, but in a slightly different way.
You do A) by using (or wasting?) tons of participants, and measuring tons of variables. “Findings” coming from these studies can almost certainly never be replicated, because nobody is able to muster up that many participants ever again.
You do B) by gathering tons of data, and making your data available. That way, p-hacking and selective reporting can still be done, but it doesn’t seem like it at first glance. That’s because the p-hacking and selective reporting happens by separate papers, by different researchers, and over a longer period of time.
For instance, how can researchers control for multiple analyses by adjusting the p-value when they don’t know how many others are/have been analyzing the open data set you are currently working on? (p-hacking?). And, what do you think the chances are that researchers will “explore” the data set in tons of different ways, and then consciously or unconsciously, only “pre-register” the analyses they are subsequently “confirming” in their to be written paper? (p-hacking? selective reporting?). And, what do you think the chances are that researchers will only write about “findings” they want to find, but not those they don’t want to find, when analyzing the open data set? (selective reporting?).
Without the use of sarcasm (like i partially did in the above), i sincerely wonder if the above reasoning makes any sense. I am seriously starting to wonder if large data sets may come with whole new versions of replicablilty problems, p-hacking, selective reporting, etc. If any of this makes sense, i think a possibly useful paper about this all can be written.
To further illustrate the possible problematic issues stemming from large data sets, you could even try and find a specific data set that has been used dozens of times by different papers to look at all the findings from these seperate papers, but then adjust them for multiple analyses and see which are still “statistically significant” after adjusting the p-values.
To really bring the point home in a possible paper about this, i have also started to wonder whether splitting large data sets in an exploratory and confirmatory set could lead to “confirming” many spurious findings, and researchers fooling themselves in a whole new different way!? I understood that many spurious results can be found in large data sets. For instance see Standing, Sproul, & Khouzam (1991) “Empirical statistics: IV illustrating Meehl’s sixth law of soft psychology: everything correlates with everything” https://journals.sagepub.com/doi/10.2466/pr0.19188.8.131.52.
Now if this is correct, i reason large data sets with many variables may contain many spurious findings. I also reason that that the larger the data set, the more likely it is that a possible splitted exploratory, and confirmatory set are very similar. If these 2 things make sense, i then i reason researchers may begin to fool themselves, and others, again in a whole different way!
Thanks. I have been wondering what happened to the criticisms of Bargh’s work and the nonreplication of his studies. Thanks for summarizing it here.