“Trust is good, but control is better”
Information about the replicability of published results is important because empirical results can only be used as evidence if the results can be replicated. However, the replicability of published results in social psychology is doubtful.
Brunner and Schimmack (2018) developed a statistical method called z-curve to estimate how replicable a set of significant results are, if the studies were replicated exactly. In a replicability audit, I am applying z-curve to the most cited articles of psychologists to estimate the replicability of their studies.
Fritz Strack is an eminent social psychologist (H-Index in WebofScience = 51).
Fritz Strack also made two contributions to meta-psychology.
First, he volunteered his facial-feedback study for a registered replication report; a major effort to replicate a published result across many labs. The study failed to replicate the original finding. In response, Fritz Strack argued that the replication study introduced cameras as a confound or that the replication team actively tried to find no effect (reverse p-hacking).
Second, Strack co-authored an article that tried to explain replication failures as a result of problems with direct replication studies (Strack & Stroebe, 2014). This is a concern, when replicability is examined with actual replication studies. However, this concern does not apply when replicability is examined on the basis of test statistics published in original articles. Using z-curve, we can estimate how replicable these studies are, if they could be replicated exactly, even if this is not possible.
Given Fritz Strack’s skepticism about the value of actual replication studies, he may be particularly interested in estimates based on his own published results.
I used WebofScience to identify the most cited articles by Fritz Strack (datafile). I then selected empirical articles until the number of coded articles matched the number of citations, resulting in 42 empirical articles (H-Index = 42). The 42 articles reported 117 studies (average 2.8 studies per article). The total number of participants was 8,029 with a median of 55 participants per study. For each study, I identified the most focal hypothesis test (MFHT). The result of the test was converted into an exact p-value and the p-value was then converted into a z-score. The z-scores were submitted to a z-curve analysis to estimate mean power of the 114 results that were significant at p < .05 (two-tailed). Three studies did not test a hypothesis or predicted a non-significant result. The remaining 11 results were interpreted as evidence with lower standards of significance. Thus, the success rate for 114 reported hypothesis tests was 100%.
The z-curve estimate of replicability is 38% with a 95%CI ranging from 26% to51%. The complementary interpretation of this result is that the actual type-II error rate is 62% compared to the 0% failure rate in the published articles.
The histogram of z-values shows the distribution of observed z-scores (blue line) and the predicted density distribution (grey line). The predicted density distribution is also projected into the range of non-significant results. The area under the grey curve is an estimate of the file drawer of studies that need to be conducted to achieve 100% successes with 28% average power. Although this is just a projection, the figure makes it clear that Strack and collaborators used questionable research practices to report only significant results.
Z-curve is under development and offers additional information other than the replicabilty of significant results. One new feature is an estimate of the maximum number of false positive results. The maximum percentage of false positive results is estimated to be 35% (95%CI = 10% to 73%). Given the relatively small number of studies, the estimate is not very precise and the upper limit goes as high as 73%. It is unlikely that there are XX% false positives, but the point of empirical research is to reduce the risk of false positives to an acceptable level of 5%. Thus, the actual risk is unacceptably high.
Based on the low overall replicability it would be difficult to identify results that provided credible evidence. However, replicability varies with the strength of evidence against the null-hypothesis; that is, with increasing z-values on the x-axis. Z-curve provides estimates of replicability for different segments of tests. For just significant results with z-scores from 2 to 2.5 (~ p < .05 & p > .01), replicability is just 23%. These studies can be considered preliminary and require verification with confirmatory studies that need much higher sample sizes to have sufficient power to detect an effect (I would not call these studies mere replication studies because the outcome of these studies is uncertain). For z-scores between 2.5 and 3, replicability is still below average with 28%. The nominal type-I error probability of .05 is reached when mean power is above 50%. This is the case only for z-scores greater than 4.0. Thus, after correcting for the use of questionable research practices, only p-values less than 0.00005 allow rejecting the null-hypothesis with a 5% false positive criterion. Only 11 results meet this criterion (see data file for the actual studies and hypothesis tests).
The analysis of Fritz Strack’s published results provides clear evidence that questionable research practices were used and that published significant results could be false positives in two ways. First, the risk of a classic type-I error is much higher than 5% and second the results are false positives in the sense that many results do not meet a corrected level of significance that takes selection for significance into account.
It is important to emphasize that Fritz Strack and colleagues followed accepted practices in social psychology and did nothing unethical by the lax standards of research ethics in psychology. That is, he did not commit research fraud. Moreover, non-significant results in replication studies do not mean that the theoretical predictions are wrong. It merely means that the published results provide insufficient evidence for the empirical claim.
It is nearly certain that I made some mistakes in the coding of Fritz Strack’s articles. However, it is important to distinguish consequential and inconsequential mistakes. I am confident that I did not make consequential errors that would alter the main conclusions of this audit. However, control is better than trust and everybody can audit this audit. The data are openly available and the z-curve code is also openly available. Thus, this replicability audit is fully transparent and open to revision.