Since 2014, I have been posted statistical information about researchers’ replicability (Schnall; Baumeister). I have used this information to explain why social psychology experiments often fail to replicate (Open Science Collaboration, 2015).
Some commentators have asked to examine myself, and I finally did it. I used the new format of a replicability audit (Baumeister, Wilson). A replicability audit picks the most cited articles until the number of articles exceeds the number of citations (H-Index). For each study, the most focal hypothesis test is selected. The test-statistic is converted into a p-value and then into a z-score. The z-scores are analysed with z-curve (Brunner & Schimmack, 2018) to estimate replicability.
I used WebofScience to identify my most cited articles (datafile). I then selected empirical articles until the number of coded articles matched the number of citations, resulting in 27 empirical articles (H-Index = 27). The 27 articles reported 64 studies (average 2.4 studies per article). 45 of the 64 studies reported a hypothesis test. The total number of participants in these 45 studies was 350,148 with a median of 136 participants per statistical test. 42 of the 45 tests were significant with alpha = .05 (two-tailed). The remaining 3 results were interpreted as evidence with lower standards of significance. Thus, the success rate for 45 reported hypothesis tests was 100%.
The z-curve plot shows evidence of a file-drawer. Counting marginally significant results, the success rate is 100%, but power to produce significant results is estimated to be only 71%. Power for the set of significant results (excluding marginally significant ones), is estimated to be 76%. The maximum false discovery rate is estimated to be 5%. Thus, even if results would not replicate with the same sample size, studies with much larger sample sizes are expected to produce a significant result in the same direction.
These results are comparable to the results in cognitive psychology, where replicability estimates are around 80% and the maximum false discovery rate is also low. In contrast, results in experimental social psychology are a lot worse with replicability estimates below 50%; both in statistical estimates with z-curve and in estimates based on actual replication studies (Open Science Collaboration, 2015). The reason is that experimental social psychologists conducted between-subject experiments with small samples. In contrast, most of my studies are correlational studies with large samples or experiments with within-subject designs and many repeated trails. These studies have high power and tend to replicate fairly well (Open Science Collaboration, 2015).
Actual replication studies have produced many replication failures and created the impression that results published in psychology journals are not credible. This is unfortunate because these replication projects have focused on between-subject paradigms in experimental social psychology. It is misleading to generalize these results to all areas of psychology.
Psychologists who want to demonstrate that their work is replicable do not have to wait for somebody to replicate their study. They can conduct a self-audit using z-curve and demonstrate that their results are different from experimental social psychology. Even just reporting the number of observations rather than number of participants may help to signal that a study had good power to produce a significant result. A within-subject study with 8 participants and 100 repetitions has 800 observations, which is 10 times more than the number of observations in the typical between-subject study with 80 participants and one observation per participant.