Estimating the Replicability of Results in 'Psychological Science'

The University of Toronto has a Research Opportunity Program where undergraduate students get a credit to work with a professor on their research. For several years, some talented and courageous students have worked with my on replicability. This year’s students are Johanna Aminoff, Emma Bognar, Zoe Colclough, Leigh-Ann Lee, Richard Liu, Linh Pham, Emilie, Stamelos, Amrish Wagle, and Jasmine Yiweza.

We picked the 2018 articles published in “Psychological Science” for hand-coding. Every student coded about 20 articles. Hand-coding original articles is very challenging for undergraduate students with basic or no training in statistics. But after a crash-course in statistics, they can handle most statistical tests and learn how to convert correlation coefficients or confidence intervals into t-values. This blog post reports our latest results and puts them in the context of other projects that have examined the replicability of results published in Psychological Science.

Figure 1a (left) shows a z-curve (Brunner & Schimmack, 2019; Bartoš & Schimmack, 2019) analysis of 337 coded focal tests. The picture on the left assumes a simple selection model. All significant results have an equal chance to be published; while non-significant results may not be published. Fitting this model, yields an expected replication rate (ERR) of 69% and an expected discovery rate (EDR) of 40%. The ERR is an estimate of the percentage of significant results that would be obtained if the published studies could be replicated exactly. As this is not possible, the expected discovery rate provides a more realistic estimate of the success rate in actual replication studies as those carried out in the reproducibility project (Open Science Collaboration, 2015). The 40% estimate is not very encouraging, given the tremendous efforts to improve the credibility of results published in Psychological Science (Lindsay, 2019).

A more optimistic picture emerges when we assume that some researchers continue to use questionable research practices that inflate effect sizes until significance is reached. This produces a pile of just significant p-values. This pile of just significant p-values produces low estimates of the EDR. To avoid this problem, z-curve can be fitted to more trustworthy p-values below .01. Figure 1b (right) used a criterion of z > 2.4 to fit z-curve. Now the estimates are 79% for ERR and 74% for EDR. The file drawer of non-significant results that are not reported is much smaller. However, the figure also shows that there are more just-significant results than reported. This suggests that replicability of results is generally high, but a small number of p-hacked studies account for the just significant results. Thus, it is reasonable to trust only results with p-values less than .005 (the green vertical line, z = 2.8) and to consider results with p-values between .05 and .005 as questionable.

How good are undergraduate student coders? To examine this, we can compare the results to my own coding of studies in 2016. Figure 1a shows estimates of 58% and 34%. These estimates are slightly lower than the estimates obtained by students in 2018. There are two possible explanations for this. Replicability actually improved or there is a coder effect (I may have been biased to find low replicabilty).

Fortunately, we can test this by examining time trends using automated coding. However, first we can also examine coding of Psychological Science studies in the years 2003,2004, 2013, and 2014 by Motyl et al.. (2017). The authors found no notable time effects and I aggregated across years to have a useful set size of studies.

Figure 3a estimates of ERR and EDR are 55% and 31, respectively. These estimates are very similar to my 2016 estimates, suggesting that my ratings are not influenced by a bias to show low replicabilty and that replicability had not improved much by the year 2016.

Hand-coding is subjective and time-consuming. I also developed software that automatically extracts test statistics from articles. This makes it possible to examine trends in replicability over time with an objective method that is not influenced by subjective coding decisions. Figure 4 shows the time trends for ERR (solid) and EDR (broken) using all significant results (black) and using only values greater than 2.4 (grey).

The most notable finding is that the EDR estimates using all significant values is highly variable. The reason is that estimates of EDR are highly sensitive to the pile of just significant results. It is unlikely that file-drawer sizes vary so dramatically across time. It is more likely that questionable research practices make just-significant results unreliable and not very useful for estimation of replicability. The other three lines are more consistent and suggest that replicability in Psychological Science is rather high. The results also show that recent initiatives to improve replicability have produced an increase in replicability, at least when ERR is used to estimate it. These results also suggests that student codings accurately reflect an increase in replicabilty from 2016 to 2018.

The project is ongoing and it will be interesting to examine potential moderators of the general results. One moderator is field of study. The reproducibility project found higher replicability for cognitive articles than for social articles. Coding articles by field will allow us to examine this moderator with a much larger set of studies. Another interesting question is whether open science badges provide information about replicability. Sharing data or materials does not necessarily mean that researchers did not use questionable research practices or had high power to produce significant results. However, preregistration makes it harder to use QRPs. If power is low, preregistered studies with low power are expected to produce non-significant results. However, most articles still publish only significant results. The interesting question is how pre-registered studies with modest power can produce only significant results.

The overall conclusion about the replicability of results published in Psychological Science is not terrible. This was already clear when the reproducibility project was able to replicate 50% of cognitive results. Thus, cognitive results, especially those with many repeated trials, are for the most part replicable and credible. At the same time, the new evidence shows that questionable research practices continue to be used in 2018. John et al. (2012) compared QRPs to doping in sports, and the effect of QRPs on the prestige of psychological science is the same. Students who learn about QRPs are disillusioned and don’t know which results they should trust. This negative effect effects all psychological scientists, even those who are working hard to improve the reputation of psychology. It is therefore time to crack down on bad actors who use QRPs to advance their own career at the expense of their colleagues and the reputation of the field. This blog post, and several others, show that we can reveal the use of QRPs and that we can quantify replicability. It is time to reward high replicability and to put psychological science on solid foundations. In other words, it is time to stop deceptive practices and to report results as they are; plain and simple.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s