“Trust is good, but control is better”
Sir Ronald Fisher emphasized that a significant p-value is not sufficient evidence for a scientific claim. Other scientists should be able to replicate the study and reproduce a significant result most of the time. Neyman and Pearson formalized this idea when they distinguished type-I errors (a false positive result) and type-II errors (a false negative result). Good experiments should have a low risk of a type-I and type-II error.
To reduce type-II errors, researchers need to conduct studies with a good signal to noise ratio. That is, the population effect size needs to be considerably larger than sampling error so that the observed signal to noise ratio in a sample is large enough to exceed the criterion value for statistical significance. In practice, the significance criterion of p < .05 (two-tailed) corresponds roughly to a signal to noise ratio of 2:1 (z = 1.96). With a 3:1 ratio of the population effect size over sampling error, the probability of a type-II error is only 15% and the chance of replicating a significant result in an exact replication study is 1-0.15 = 85%.
Unfortunately, psychologists are not trained to conduct formal power analyses and often conduct studies with low power (Cohen, 1962). The most direct consequence of this practice is that researchers often fail to find significant results and replication studies fail to confirm original discoveries. However, these replication failures often remain hidden because psychologists used a number of questionable research practices to avoid reporting replication failures. Until today, the use of these practices is not considered a violation of research ethics, although the practice clearly undermines the validity of published results. As Sterling (1959) pointed out, once published results are selected for significance, reporting p < .05 is meaningless because the risk of type-I errors can be much higher than 5%.
In short, the replicability of published results in psychology journals is unknown. Since 2011, a number of publication suggest that many findings in experimental social psychology have low replicability. The Open Science Collaboration conducted actual replication studies and found that only 25% of experiments in social psychology could be replicated. The success rate for the typical between-subject experiment is only 4%.
Brunner and Schimmack (2018) developed a statistical tool to estimate replicability called z-curve. When I applied z-curve to a representative sample of between-subject experimental social psychology (BS-ESP) results, I obtained an estimate of 32% with a 95%CI ranging from 23% to 39% (Schimmack, 2018). This estimate implies that experimental social psychologists are using questionable research practices to inflate their success rate in published articles from about 30% to 95% (Sterling et al., 1995). Thus, the evidence for many claims in social psychology textbooks and popular books (e.g., Bargh, 2017; Kahneman, 2011) is much weaker than the published literature suggests.
Z-curve makes it possible to use the results of published articles to estimate the replicabilty of published results. This makes it possible to reexamine the published literature to estimate actual type-I and type-II error rates in experimental social psychology. Using z-curve, I posted replicability rankings of eminent social psychologists (Schimmack, 2018). Although these results have heuristic value, they are still overly optimistic because they are based on all published test statistics that were automatically extracted from articles. The average replicability estimate was 62%, which is considerably higher than the 30% estimate for focal hypothesis tests in Motyl et al.’s dataset. Thus, a thorough investigation of replicability requires hand-coding of focal hypothesis tests. Because z-curve assumes indepenendence of test statistics, this means for each study, the most focal hypothesis test (MFHT) has to be identified. This blog post reports the results of the first replicability analysis based on MFHTs in the most important articles of an eminent social psychologist.
I call a z-curve analysis of authors’ MFHTs an audit. The term audit is apt because published results are based on authors’ statistical analyses. It is assumed that researchers conducted these analyses properly without the use of questionable practices. In the same way, tax returns are completed by tax payers or their tax lawyers and it is assumed that they followed tax laws in doing so. While trust is good, control is better and tax agencies randomly select some tax returns to check that tax payers followed the rules. Just imagine what tax returns would look like, if they were not audited. Until recently, this was the case for scientific publications in psychology. Researchers could use questionable research practices to inflate effect sizes and the percentage of successes without any concerns that their numbers would be audited. Z-curve makes it possible to audit psychologists without access to the actual data.
I chose Roy F. Baumeister for my fist audit for several reasons. Most important, Baumeister is objectively the most eminent social psychologist with an H-index of 100. Just like it is more interesting to audit Donanld Trump’s than Mike Pence’s tax returns, it is more interesting to learn about the replicability of Roy Bameister’s results than the results of, for example, Harry Reis.
Another reason is that Roy Baumeister is best known for his ego-depletion theory of self-control and a major replication study failed to replicate the ego-depletion effect. In addition, a meta-analysis showed that published ego-depletion studies reported vastly inflated effect sizes. I also conducted a z-curve analysis of focal hypothesis tests in the ego-depletion literature and found evidence that questionable research practices were used to produce evidence for ego depletion (Schimmack, 2016). Taken together, these findings raise concerns about the research practices that were used to provide evidence for ego-depletion, and it is possible that similar research practices were used to provide evidence for other claims made in Baumeister’s articles. Thus, it seemed worthwhile to conduct an audit of Baumeister’s most important articles.
I used WebofScience to identify the most cited articles by Roy F. Baumeister (datafile ). I then selected empirical articles until the number of coded articles matched the number of citations, resulting in 69 empirical articles (H-Index = 69). The 69 articles reported 241 studies (average 3.5 studies per article). The total number of participants was 22,576 with a mean of 94 and a median of 58 participants per study. For each study, I identified the most focal hypothesis test (MFHT). The result of the test was converted into an exact p-value and the p-value was then converted into a z-score. The 241 z-scores were submitted to a z-curve analysis to estimate mean power of the 222 results that were significant at p < .05 (two-tailed). The remaining 19 results were interpreted as evidence with lower standards of significance. Thus, the success rate for the 241 studies was 100% and not a single study reported a failure to support a prediction, implying a phenomenal type-II error probability of zero.
The z-curve estimate of actual replicability is 20% with a 95%CI ranging from 10% to 33%. The complementary interpretation of this result is that the actual type-II error rate is 80% compared to the 0% failure rate in the published articles.
The histogram of z-values shows the distribution of observed z-scores (blue line) and the predicted density distribution (grey line). The predicted density distribution is also projected into the range of non-significant results. The large area under the grey curve is an estimate of the file drawer of studies that need to be conducted to achieve 100% successes with just 20% average power. It is unlikely that dropping studies with non-significant results was the only questionable research practice that was used. Thyus, the actual file-drawer is likely to be smaller. Nevertheless, the figure makes it clear that the reported results are just the tip of the iceberg of empirical attempts that were made to produce significant results that appear to support theoretical predictions.
Z-curve is under development and offers additional information other than the replicabilty of significant results. One new feature is an estimate of the maximum number of false positive results. This estimate is a maximum because it is empirically impossible to distinguish true false positives (effect size is zero) and true positives with negligible effect sizes (effect size is 0.000001). To estimate the maximum false discovery rate, z-curve is fitted with a fixed percentage of false positives and the fit of this model is compared to the unconstrained model. If fit is very similar, it is possible that the set of results contains the specified amount of false positives. The estimate for Roy F. Baumeister’s most important results is that up to 70% of published results could be false positives or true positives with tiny effect sizes. This suggests that the observed z-scores could be a mixture of 70% false positives and 30% results with a mean power of 55% to produce the estimate of 20% average power (70*.05 + 30*.55 = 20).
Z-curve also provides estimates of mean power for different intervals on the x-axis. As power increases with the observed evidence against the null-hypothesis (decreasing p-values, increasing z-scores), mean power increases. For high z-scores of 6 or higher, power is essentially 1 and we would expect any result with an observed z-score greater than 6 to replicate in an exact replication study. As can be seen in the Figure below the x-axis, z-scores from 2 to 2.5 have only a mean power of 14% and z-scores between 2.5 and 3 have only a mean power of 17%. Only z-scores greater than 4 have at least a power of 50% and z-scores greater than 5 are needed for the recommended level of 80% power (Cohen, 1988). Only 11 out of 241 tests yielded a z-score greater than 4 and only 6 yielded a z-score greater than 5.
In conclusion, the replicability audit of Roy F. Baumeister shows that published results were obtained with a low probability to produce a significant result. As a result, exact replication studies also have a low probability to reproduce a significant result. As noted a long time ago by Sterling (1959), statistical significance loses its meaning when results are selected for significance. Given the low replicability estimates and the high risk of false positive results, the significant results in Baumeister’s article provide no empirical evidence for his claims because the type-I and type-II error risks are too high. The only empirical evidence that was provided in these 69 articles are the 6 or 11 results with z-scores greater than 5 or 4, respectively.
Unlike tax audits by revenue agencies, my replicability audits have no real consequences when questionable research practices are discovered. Roy F. Baumeister followed accepted practices in social psychology and did nothing unethical by the lax standards of research ethics in psychology. That is, he did not commit research fraud. He might even argue that he was better at playing the game social psychologists were playing, which is producing as many significant results as possible without worrying about replicability. This prevalent attitude among social psychologists was most clearly expressed by another famous social psychologist, who produced incredible and irrreproducible results.
“I’m all for rigor, but I prefer other people do it. I see its importance—it’s fun for some people—but I don’t have the patience for it. If you looked at all my past experiments, they were always rhetorical devices. I gathered data to show how my point would be made. I used data as a point of persuasion, and I never really worried about, ‘Will this replicate or will this not?” (Daryl J. Bem, in Engber, 2017)
Not everybody may be as indifferent to replicability. For consumers interested in replicable empirical findings it is surely interesting to know how replicable published results are. For example, Noble Laureate Daniel Kahneman might not have featured Roy Baumeister’s results in his popular book “Fast and Slow,” if he had seen these results. Maybe some readers of this blog also find these results informative. I know first hand that at least some of my undergraduate students who invested time and resources in studying psychology find these results interesting and shocking.
It is nearly certain that I made some mistakes in the coding of Roy Baumeister’s articles. However, it is important to distinguish consequential and inconsequential mistakes. I am confident that I did not make consequential results that would alter the main conclusions of this audit. However, control is better than trust and everybody can audit this audit. The data are openly available and everybody who has access to the original articles can do their own analysis of Roy Baumeister or any other author, including myself. The z-curve code is also openly available. Thus, I hope that this seminal and fully open publication of a replicability audit motivates other psychologists or researchers in other disciplines to conduct replicability audits. And last, but not least, I don’t hate Roy Baumeister; I love science.
14 thoughts on “Replicability Audit of Roy F. Baumeister”
“Trust is good, but control is better”
I love that you used this quote! I have been thinking about this exact quote a lot lately, especially in relation to “improvements” and “collaborative” and (what the “hip kids” call) “crowdsourcing” (or is it really “crowding out” or “outsourcing”??) efforts that have been proposed in the last few years.
I think it’s always important, and useful, with all these efforts to keep that quote in the back of your mind and try and see if 1) the proposed “improvement” is really an improvement, and why and how, and 2) if, and how this “improvement” can possibly be used for manipulation, abuse, and control, so you could/should try and prevent that and/or come up with better stuff.
Regarding Baumeister, i would just like to add that he was the 2013 recipient of the “Williams James Fellow Award” which is supposed to reward “APS Members for their lifetime of significant intellectual contributions to the basic science of psychology”. Congratulations again concerning that award mr. Baumeister!
He won the award, the day I presented at Purdue U and made a questionable comparison between Lance Armstrong and Roy Baumeister. (they both used doping, many others used doping, but they were both still better than the rest. Important difference, EPO was banned, QRPs are not).
Perhaps the comparison between scientific (soft) fraud and doping in sports is not the most valid/useful/whatever. Or perhaps it is.
If i am not mistaken, there was/is a “code” in cycling called “code of omerta” where nobody talked about doping but they all knew it was going on. Perhaps there was/is such a “code” in (social) psychology as well. If there is, perhaps it’s called “code of the incentives” (cf. https://www.talyarkoni.org/blog/2018/10/02/no-its-not-the-incentives-its-you/)
If i think about is, there could at least be one major difference between (possible) “doping” in sport and psychological science. For instance, i don’t think doping in sports wastes tons of money or impacts society at large. But i reason scientific (soft) fraud does!?
Armstrong may have “cheated”, hereby preventing others from winning, but that didn’t impact society at large in any way really. Now compare this to cheaters in science. They may have “cheated”, hereby preventing others from “winning” (e.g. getting a job), but they also impacted society at large.
Imagine how many researchers may have (tried to) “directly” or “conceptually” replicate, and otherwise “build on”, the work of research that has been performed using (soft) fraudulent methods. This could very well be millions of tax-payer dollars down the drain, hereby wasting resources that could have otherwise gone to more useful and/or solid topics and smarter and/or more ethical scientists…
The doping analogy was first made in John et al. (2012) “science on steroids” .
I agree that morally cheating in science is worse than cheating in sports.