A recent article in the flashy journal “Nature Human Behaviour” that charges authors or their universities $6,000 published the claim “high replicability of newly discovered social-behavioural findings is achievable” (Protzko et al., 2023). This is good news for social scientists and consumer of social psychology after a decade of replication failures caused by questionable research practices, including fraud.
So, what is the magic formula to produce replicable and credible findings in the social sciences?
The paper attributes success to the implementation of four rigour-enhancing practices, namely confirmatory tests, large sample sizes, preregistration, and methodological transparency, The problem with this multi-pronged approach is that it is not possible to say which of these features are necessary or sufficient to produce replicable results.
I analyze the results of this article with the R-Index. Based on these results, I conclude that none of the four rigor-enhancing practices are necessary to produce highly replicable results. The key ingredients for high replicability are honesty and high power. It is wrong to confuse large samples (N = 1,500) with high power. As shown, sometimes N = 1,500 has low power and sometimes much smaller samples are sufficient to have high power.
The article reports 16 studies. Each study was proposed by one lab and the lab reported the results of a confirmatory test that produced significant results in 15 of the 16 studies. The replication studies by the other three labs produced significant results in 79% of the studies.
I predicted these replication outcomes with the Replicability-Index (R-Index). The R-Index is a simple method to estimate replicability for a small set of studies. The key insight of the R-Index is that the outcome of unbiased replication studies is a function of the mean (I once assumed the median would be better, but this was wrong) power of the original studies (Brunner & Schimmack, 2021). Unfortunately, it can be difficult to estimate the true mean power based on original studies because original studies are often selected for significance and selection for significant leads to inflated estimates of observed power. The R-Index adjusts for this inflation by comparing the success rate (percentage of significant results) to the mean observed power. If the success rate is higher than the mean observed power, selection bias is present and the mean power is inflated. A simple heuristic to correct for this inflation is to subtract the inflation from the observed power.
The article reported the outcomes of “original” (blue = self-replication) and replication studies (green = independent replications by other labs) in Figure 1.
To obtain estimates of observed power, I used the point estimates of the original (original) studies and the lower limit of the 95%CI. I converted these statistics into z-scores, using the formula (ES/((ES – LL.CI)/2). The z-scores were converted into p-values and p-values below .05 were considered significant. Visual inspection of Figure 1 shows that one original study (blue) did not have a statistically significant result (i.e., the 95%CI includes a value of zero). Thus, the actual success rate was 15/16 = 94%.
Table 1 shows that the mean observed power is 87%. Thus, there is evidence of a small amount of selection for significance and the predicted success rate of replication studies is .87 – .06 = .81. The actual success rate was computed as the percentage of replication studies (k = 3) that produced a significant result. The overall success rate of replication studies was 79%, which is close to the estimate of the R-Index, 81%. Finally, it is evident that power of studies varies across studies. 9 studies had z-scores greater than 5 (the 5 sigma rule of particle physics) and all 9 studies had a replication success rate of 100%. The only reason for replication failures of studies with z-scores greater than 5 is fraud or problems in the implementation of the actual replication study. In contrast, studies with z-scores below 4 have insufficient power to produce consistent significant results. The correlation between observed power and replication success rates is r = .93. This finding demonstrates empirically that power determines the outcome of unbiased replication studies.
Honest reporting of results is necessary to trust published results. Open Science Practices may help to ensure that results are reported honesty. This is particularly valuable for the evaluation of a single study. However, statistical tools like the R-Index can be used to examine whether a set of studies is unbiased or whether the results are biased. In the present set of 16 original studies, it detected a small bias that explains the differences in success rate for the original studies (blue, 94%) and the replication studies (green, 79%).
More importantly, the investigation of power shows that some of the studies were underpowered to reject the nil-hypothesis even with N = 1,500 because the real effect sizes were too close to zero. This shows how difficult it is to provide evidence for the absence of an important effect.
At the same time, other studies had large effect sizes and were dramatically overpowered to demonstrate an effect. As shown, z-scores of 5 are sufficient to provide conclusive evidence against a nil-hypothesis and this criterion is used in particle physics for strong hypothesis tests. Using N = 1,500 for an effect size of d = .6 is overkill. This means that researchers who cannot easily collect data from large samples can produce credible results. There are also other methods to reduce sampling error and to increase power than increasing sample sizes. Within-subject designs with many repeated trials can produce credible and replicable results with sample size of N = 8. Sample size should not be used as a criterion to evaluate studies and large samples should not be used as a criterion for good science.
To evaluate the credibility of results in single studies, it is useful to examine confidence intervals and to see which effect sizes are excluded by the lower limit of the confidence interval. Confidence intervals that exclude zero, but not values close to zero suggest that a study was underpowered and that the true population effect size may be so close to zero that it is practically zero. In addition, p-values or z-scores provide valuable information about replicability. Results with z-scores greater than 5 are extremely likely to replicate in an exact replication study and replication failures suggest a significant moderating factor.
Finally, the present results suggest that other aspects of open science like pre-registration are not necessary to produce highly replicable results. Even exploratory results that produced strong evidence (z > 5) are likely to replicate. The reason is that luck or extreme p-hacking does not produce such extreme evidence against the null-hypothesis. A better understanding of the strength of evidence may help to produce credible results without wasting precious resources on unnecessarily large samples.