Estimating Replicability in the “British Journal of Social Psychology”


There is a replication crisis in social psychology (see Schimmack, 2020, for a review). One major cause of the replication crisis is selection for statistical significance. Researchers conduct many studies with low power, but only the significant results get published. As these results ar only significant with the help of sampling error, replication studies fail to replicate a significant result. Awareness of these problems has led some journal editors to change submission guidelines in the hope to attract more replicable results. As replicability depends on power , this would mean that the mean power of statistical tests increased. This can be tested by estimating the mean power before and after selection for significance (Bartos & Schimmack, 2020; Brunner & Schimmack, 2019).

In 2017, John Drury and Hanna Zagefka took over as editors of the “British Journal of Social Psychology” (BJSP). Their editorial directly addresses the replication crisis in social psychology.

A third small change has to do with the continuing crisis in social psychology (especially in quantitative experimental social psychology). We see the mission of social psychology to be to make sense of our social world, in a way which is necessarily selective and has subjective aspects (such as choice of topic and motivation for the research). This sense-making, however, must not entail deliberate distortions, fabrications, and falsifications. It seems apparent to us that the fundamental causes of the growth of data fraud, selective reporting of results and other issues of trust we now face are the institutional pressures to publish and the related reward structure of academic career progression. These factors need to be addressed.

In response to this analysis of problems in the field, they introduced new submission guidelines.

Current debate demonstrates that there is a considerable grey area when deciding which methodological choices are defensible and which ones are not. Clear guidelines are therefore essential. We have added to the submission portal a set of statements to which authors respond in relation to determining sample size, criteria for data exclusion, and reporting of all manipulations, conditions, and measures. We will also encourage authors to share their data with interested parties upon request. These responses will help authors understand what is considered acceptable, and they will help associate editors judge the scientific soundness of the work presented.

In this blog post, I examine the replicability of results published in BJSP and I examine whether changes in submission guidelines have increased replicability. To do so, I downloaded articles from 2000 to 2019 and automatically extracted test-statistics (t-values, F-values) from those articles. All test-statistics were converted into absolute z-scores. Higher z-scores provide stronger evidence against the nil-hypothesis. I then submitted the 8,605 z-scores to a z-curve analysis. Figure 1 shows the results.

First, visual inspects shows a clear drop around z = 1.96. This value corresponds to the typical significance criterion of .05 (two-sided). This drop shows the influence of selectively publishing significant results. A quantitative test of selection can be made by comparing the observed discovery rate to the expected discovery rate. The observed discovery rate is the percentage of significant results that are reported, 70%, 95%CI = 69% to 71%. The expected discovery rate (EDR) is estimated by z-curve on the basis of the distribution of the significant results (grey curve. The EDR is lower, 46%, and the 95%CI, 25% to 57% does not include the ODR. Thus, there is clear evidence that results in BJSP are biased towards significant results.

Z-curve also estimates the replicability of significant results. The expected replication rate (ERR) is the percentage of significant results that is expected in exact replication studies. The ERR is 68%, with a 95%CI ranging from 68% to 73%. This is not a bad replication rate, but there are two caveats. First, automatic extraction does not distinguish theoretically important focal tests from other tests such as manipulation checks. A comparison of automated extraction and hand-coding shows that replication rates for focal tests are lower than the ERR of automated extraction (cf. analysis of JESP). The results for BJSP are slightly better than the results for JESP (ERR: 68% vs. 63%; EDR 46% vs. 35%, but the differences are not statistically significant (confidence intervals overlap). Hand-coding of JESP articles produces an ERR of 39% and an EDR of 12%. Thus, the overall analysis of BJSP suggests that replication rates for actual replication studies are similar to social psychology in general. The Open Science Collaboration found that only 25% could be replicated.

Figure 2 examines time-trends by computing the ERR and EDR for each year. It also computes the ERR (solid) and EDR (dotted) in analyses that are limited to p-values smaller than .005 (grey), which are less likely to be produced by questionable practices. The EDR estimates are highly variable because they are very sensitive to the number of just significant p-values. The ERR estimates are more stable. Importantly, none of them show a significant trend over time. Visual inspection also suggests that editorial changes in 2017 haven’t yet produced changes in published results in 2018 or 2019.

Given concerns about questionable practices and low replicability in social psychology, readers should be cautious about empirical claims, especially when they are based on just-significant results. P-values should be at least below .005 to be considered empirical evidence.

Leave a Reply