“Trust is good, but control is better”
Information about the replicability of published results is important because empirical results can only be used as evidence if the results can be replicated. However, the replicability of published results in social psychology is doubtful.
Brunner and Schimmack (2018) developed a statistical method called z-curve to estimate how replicable a set of significant results are, if the studies were replicated exactly. In a replicability audit, I am applying z-curve to the most cited articles of psychologists to estimate the replicability of their studies.
Norbert Schwarz is an eminent social psychologist (H-Index in WebofScience = 49).
He is best known for his influential article on “Mood as information” (Schwarz & Clore, 1983) that suggested people use their momentary mood to judge their life-satisfaction. This claim has been challenged by life-satisfaction researchers (e.g., Eid & Diener, 2004), but until recently there were no major replication attempts of the study. Recently, Yap et al. (2018) published 9 studies that failed to replicate this famous finding.
In collaboration with Strack, Schwarz also published two articles that demonstrated strong item-order effects on life-satisfaction judgments. These studies were featured in Nobel Laureates book “Thinking Fast and Slow.” (cf. Schimmack, 2018). However, these results also have failed to replicate in numerous studies (Schimmack & Oishi, 2005). Most recently, a large multi-lab replication project also failed to replicate the effect (ManyLabs2).
Schwarz is also known for developing a paradigm to show that people rely on the ease of recalling memories to make social judgments. Once more, a large replication study failed to replicate the result (cf. Schimmack, 2019).
Given this string of replication failures, it is of interest to see the average replicability of Schwarz’s published results.
I used WebofScience to identify the most cited articles by Norbert Schwarz (datafile). I then selected empirical articles until the number of coded articles matched the number of citations, resulting in 48 empirical articles (H-Index = 48).
Norbert Schwarz co-authored several articles with Lawrence Sanna, who resigned from his academic job under doubts of data manipulation (Young, 2012). However, the articles co-authored with Norbert Schwarz have not been retracted and contribute to Schwarz’s H-Index. Therefore, I included the articles in the analysis.
The 48 articles reported 109 studies. The total number of participants was 28,606 with a median of 85 participants per study. For each study, I identified the most focal hypothesis test (MFHT); 4 studies did not report a focal test; and 1 study reported a failure to replicate a finding without a statistical result. The result of the test was converted into an exact p-value and the p-value was then converted into a z-score. The z-scores were submitted to a z-curve analysis to estimate mean power of the 95 results that were significant at p < .05 (two-tailed). The remaining 9 results were interpreted as evidence with lower standards of significance. Thus, the success rate for 114 reported hypothesis tests was 100%. This high success rate is common in psychology (Sterling, 1959).
The z-curve estimate of replicability is 39% with a 95%CI ranging from 21% to 56%. Thus, z-curve predicts that only 39% of these studies would produce a significant result if they were replicated exactly. This estimate is consistent with the average for social psychology. However, actual replication attempts have an even lower success rate of 25% (Open Science Collaboration, 2015).
The histogram of z-values shows the distribution of observed z-scores (blue line) and the predicted density distribution (grey line). The predicted density distribution is also projected into the range of non-significant results. The area under the grey curve is an estimate of the file drawer of studies that are needed to produce the observed distribution of significant results. Approximately 3 unpublished studies with non-significant results are expected for each published significant result.
This estimate has important implications for the risk of a false-positive result. A 5%-significance level ensures that no more than 5% of all studies can be false positives (i.e., the effect size is exactly zero or in the opposite direction). This information is useless when only significant results are published (Sterling, 1959). With an estimate of the file-drawer, we see that about 400 studies were needed to produce 100 significant results. Thus, the real risk of false positive results is 400*5% = 20 studies. Thus, 20 of the 100 significant results could be false positives.
Z-curve also provides another measure of the maximum number of false positives. Z-curve is fitted to the data with fixed percentages of false positives. As long as these models still fit the data, it is possible that false positives contributed to the significant results. This approach suggests that no more than 40% of the significant results are strictly false positives. Given the small number of studies, the estimate is not very precise (95%CI = 10-70%).
Although the false-positive results suggest that many reported results are not false positives, some of the true positives may be positives with trivial effect sizes and difficult to replicate. Z-curve provides information, which results are likely to replicate based on the strength of the evidence against the null-hypotheses. The local estimates of power below the x-axis show that z-scores between 2 and 2.5 have only a mean power of 29%. These results are least likely to replicate. Only z-scores greater than 3.5 start having a replicability of more than 50%. The datafile shows which studies fall into this category.
The analysis of Norbert Schwarz’s published results provides clear evidence that questionable research practices were used and that the size of the file-drawer suggests that up to 20% of the significant results could be false positives.
It is important to emphasize that Norbert Schwarz and colleagues followed accepted practices in social psychology and did nothing unethical by the lax standards of research ethics in psychology. That is, he did not commit research fraud.
The low average replicability is also consistent with estimates for social psychology, especially when the focus is on between-subject experiments.
It is nearly certain that I made some mistakes in the coding of Norbert Schwarz’s articles. However, it is important to distinguish consequential and inconsequential mistakes. I am confident that I did not make consequential errors that would alter the main conclusions of this audit. However, control is better than trust and everybody can audit this audit. The data are openly available and the z-curve code is also openly available. Thus, this replicability audit is fully transparent and open to revision.
If you found this audit interesting, you might also be interested in other replicability audits (Replicability Audits).