Please see the latest 2018 rankings. (LINK)
The table shows the preliminary 2017 rankings of 104 psychology journals. A description of the methodology and analyses of by discipline and time are reported below the table.
Download PDF of this ggplot representation of the table courtesy of David Lovis-McMahon.
Introduction
I define replicability as the probability of obtaining a significant result in an exact replication of a study that produced a significant result. In the past five years, there have been concerns about a replication crisis in psychology. Even results that are replicated internally by the same author multiple times fail to replicate in independent replication attempts (Bem, 2011). The key reason for the replication crisis is selective publishing of significant results (publication bias). While journals report over 95% significant results (Sterling, 1959; Sterling et al., 1995), a 2015 article estimated that less than 50% of these results can be replicated (OSC, 2015).
The OSC reproducibility made an important contribution by demonstrating that published results in psychology have low replicability. However, the reliance on actual replication studies has a a number of limitations. First, actual replication studies are expensive, time-consuming, and sometimes impossible (e.g., a longitudinal study spanning 20 years). This makes it difficult to rely on actual replication studies to assess the replicability of psychological results, produce replicability rankings of journals, and to track replicability over time.
Schimmack and Brunner (2016) developed a statistical method (z-curve) that makes it possible to estimate average replicability for a set of published results based on the test-statistics reported in published articles. This statistical approach to the estimation of replicability has several advantages over the use of actual replication studies: (a) replicability can be assessed in real time, (b) it can be estimated for all published results rather than a small sample of studies, and (c) it can be applied to studies that are impossible to reproduce. Finally, it has the advantage that actual replication studies can be criticized (Gilbert, King, Pettigrew, & Wilson, 2016). Estimates of replicabilty based on original studies do not have this problem because they are based on results reported in the original articles.
Z-curve has been validated with simulation studies and can be used with heterogeneous sets of studies that vary across statistical methods, sample sizes, and effect sizes (Brunner & Schimmack, 2016). I have applied this method to articles published in psychology journals to create replicability rankings of psychology journals in 2015 and 2016. This blog post presents preliminary rankings for 2017 based on articles that have been published so far. The rankings will be updated in 2018, when all 2017 articles are available.
For the 2016 rankings, I used z-curve to obtain annual replicability estimates for 103 journals from 2010 to 2016. Analyses of time trends showed no changes from 2010-2015. However, in 2016 there were first signs of an increase in replicabilty. Additional analyses suggested that social psychology journals contributed mostly to this trend. The preliminary 2017 rankings provide an opportunity to examine whether there is a reliable increase in replicability in psychology and whether such a trend is limited to social psychology.
Journals
Journals were mainly selected based on impact factor. Preliminary replicability rankings for 2017 are based on 104 journals. Several new journals were added to increase the number of journals specializing in five disciplines: social (24), cognitive (13), development (15), clinical/medical (18), biological (13). The other 24 journals were broad journals (Psychological Science) or from other disciplines. The total number of journals for the preliminary rankings is 104. More journals will be added to the final rankings for 2017.
Data Preparation
All PDF versions of published articles were downloaded and converted into text files using the conversion program pdfzilla. Text files were searched for reports of statistical results using a self-created R program. Only F-tests, t-tests, and z-tests were used for the rankings because they can be reliabilty extracted from diverse journals. t-values that were reported without df were treated as z-values which leads to a slight inflation in replicability estimates. However, the bulk of test-statistics were F-values and t-values with degrees of freedom. Test-statistics were converted into exact p-values and exact p-values were converted into absolute z-scores as a measure of the strength of evidence against the null-hypothesis.
Data Analysis
The data for each year were analyzed using z-curve (Schimmack and Brunner (2016). Z-curve provides a replicability estimate. In addition, it generates a Powergraph. A Powergraph is essentially a histogram of absolute z-scores. Visual inspection of Powergraphs can be used to examine publication bias. A drop of z-values on the left side of the significance criterion (p < .05, two-tailed, z = 1.96) shows that non-significant results are underpresented. A further drop may be visible at z = 1.65 because values between z = 1.65 and z = 1.96 are sometimes reported as marginally significant support for a hypothesis. The critical values z = 1.65 and z = 1.96 are marked by vertical red lines in the Powergraphs.
Replicabilty rankings rely only on statistically significant results (z > 1.96). The aim of z-curve is to estimate the average probability that an exact replication of a study that produced a significant result produces a significant result again. As replicability estimates rely only on significant results, journals are not being punished for publishing non-significant results. The key criterion is how strong the evidence against the null-hypothesis is when an article published results that lead to the rejection of the null-hypothesis.
Statistically, replicability is the average statistical power of the set of studies that produced significant results. As power is the probabilty of obtaining a significant result, average power of the original studies is equivalent with average power of a set of exact replication studies. Thus, average power of the original studies is an estimate of replicability.
Links to powergraphs for all journals and years are provided in the ranking table. These powergraphs provide additional information that is not used for the rankings. The only information that is being used is the replicability estimate based on the distribution of significant z-scores.
Results
The replicability estimates for each journal and year (104 * 8 = 832 data points) served as the raw data for the following statistical analyses. I fitted a growth model to examine time trends and variability across journals and disciplines using MPLUS7.4.
I compared several models. Model 1 assumed no mean level changes and stable variability across journals (significant variance in the intercept/trait). Model 2 assumed no change from 2010 to 2015 and allowed for mean level changes in 2016 and 2017 as well as stable differences between journals. Model 3 was identical to Model 2 and allowed for random variability in the slope factor.
Model 1 did not have acceptable fit (RMSEA = .109, BIC = 5198). Model 2 increased fit (RMSEA = 0.063, BIC = 5176). Model 3 did not improve model fit (RMSEA = .063, BIC = 5180), the variance of the slope factor was not significant, and BIC favored the more parsimonious Model 2. The parameter estimates suggested that replicability estimates increased from 72 in the years from 2010 to 2015 by 2 points to 74 (z = 3.70, p < .001).
The standardized loadings of individual years on the latent intercept factor ranged from .57 to .61. This implies that about one-third of the variance is stable, while the remaining two-thirds of the variance is due to fluctuations in estimates from year to year.
The average of 72% replicability is notably higher than the estimate of 62% reported in the 2016 rankings. The difference is due to a computational error in the 2016 rankings that affected mainly the absolute values, but not the relative ranking of journals. The r-code for the 2016 rankings miscalculated the percentage of extreme z-scores (z > 6), which is used to adjust the z-curve estimate that are based on z-scores between 1.96 and 6 because all z-scores greater than 6 essentially have 100% power. For the 2016 rankings, I erroneously computed the percentage of extreme z-scores out of all z-scores rather than limiting it to the set of statistically significant results. This error became apparent during new simulation studies that produced wrong estimates.
Although the previous analysis failed to find significant variability for the slope (change factor), this could be due to the low power of this statistical test. The next models included disciplines as predictors of the intercept (Model 4) or the intercept and slope (Model 5). Model 4 had acceptable fit (RMSEA = .059, BIC = 5175), but Model 5 improved fit, although BIC favored the more parsimonious model (RMSEA = .036, BIC = 5178). The Bayesian Information Criterion favors parsimony and better fit cannot be interpreted as evidence for the absence of an effect. Model 5 showed two significant (p < .05) effects for social and developmental psychology. In Model 6 I included only social and development as predictors of the slope factor. BIC favored this model over the other models (RMSEA = .029, BIC = 5164). The model results showed improvements for social psychology (increase by 4.48 percentage points, z = 3.46, p = .001) and developmental psychology (increase by 3.25 percentage points, z = 2.65, p = .008). Whereas the improvement for social psychology was expected based on the 2016 results, the increase for developmental psychology was unexpected and requires replication in the 2018 rankings.
The only significant predictors for the intercept were social psychology (-4.92 percentage points, z = 4.12, p < .001) and cognitive psychology (+2.91, z = 2.15, p = .032). The strong negative effect (standardized effect size d = 1.14) for social psychology confirms earlier findings that social psychology was most strongly affected by the replication crisis (OSC, 2015). It is encouraging to see that social psychology is also the discipline with the strongest evidence for improvement in response to the replication crisis. With an increase by 4.48 points, replicabilty of social psychology is now at the same level as other disciplines in psychology other than cognitive psychology, which is still a bit more replicable than all other disciplines.
In conclusion, the results confirm that social psychology had lower replicability than other disciplines, but also shows that social psychology has significantly improved in replicabilty over the past couple of years.
Analysis of Individual Journals
The next analysis examined changes in replicabilty at the level of individual journals. Replicability estimates were regressed on a dummy variable that contrasted 2010-1015 (0) with 2016-2017 (1). This analysis produced 10 significant increases with p < .01 (one-tailed), when only 1 out of 100 would be expected by chance.
Five of the 10 journals (50% vs. 20% in the total set of journals) were from social psychology (SPPS + 13, JESP + 11, JPSP-IRGP + 11, PSPB + 10, Sex Roles + 8). The remaining journals were from developmental psychology (European J. Dev. Psy + 17, J Cog. Dev. + 9), clinical psychology (J. Cons. & Clinical Psy + 8, J. Autism and Dev. Disorders + 6), and the Journal of Applied Psychology (+7). The high proportion of social psychology journals provides further evidence that social psychology has responded most strongly to the replication crisis.
Limitations
Although z-curve provides very good absolute estimates of replicability in simulation studies, the absolute values in the rankings have to be interpreted with a big grain of salt for several reasons. Most important, the rankings are based on all test-statistics that were reported in an article. Only a few of these statistics test theoretically important hypothesis. Others may be manipulation checks or other incidental analyses. For the OSC (2015) studies the replicability etimate was 69% when the actual success rate was only 37%. Moreover, comparisons of the automated extraction method used for the rankings and hand-coding of focal hypothesis in the same article also show a 20% point difference. Thus, a posted replicability of 70% may imply only 50% replicability for a critical hypothesis test. Second, the estimates are based on the ideal assumptions underlying statistical test distributions. Violations of these assumptions (outliers) are likely to reduce actual replicability. Third, actual replication studies are never exact replication studies and minor differences between the studies are also likely to reduce replicability. There are currently not sufficient actual replication studies to correct for these factors, but the average is likely to be less than 72%. It is also likely to be higher than 37% because this estimate is heavily influenced by social psychology, while cognitive psychology had a success rate of 50%. Thus, a plausible range of the typical replicability of psychology is somwhere between 40% and 60%. We might say the glass is half full and have empty, while there is systematic variation around this average across journals.
Conclusion
55 years after Cohen (1962) pointed out that psychologists conduct many studies that produce non-significant results (type-II errors). For decades there was no sign of improvement. The preliminary rankings of 2017 provide the first empirical evidence that psychologists are waking up to the replication crisis caused by selective reporting of significant results from underpowered studies. Right now, social psychologists appear to respond most strongly to concerns about replicability. However, it is possible that other disciplines will follow in the future as the open science movement is gaining momentum. Hopefully, replicabilty rankings can provide an incentive to consider replicability as one of several criterion for publication. A study with z = 2.20 and another study with z = 3.85 are both significant (z > 1.96), but a study with z =3.85 has a higher chance of being replicable. Everything else being equal, editors should favor studies with stronger evidence; that is higher z-scores (a.k.a, lower p-values). By taking the strength of evidence into account, psychologists can move away from treating all significant results (p < .05) as equal and take type-II errors and power into account.
Why not give standard errors on the estimates? Or at least the number of tests you base them on. Come on, try a little harder to be scientific about this. I agree with your mission, but this makes it really easy to dismiss you.
the number of cases are included in the powergraphs when you click on the journal titles. Bootstrapped 95%CI will be added to the powergraphs when the final data are available.
Good! I’d prefer it were machine-readable information though.
Contact me, if you are interested in analyzing the data.