Category Archives: Replicability-Ranking

“Psychological Science” in 2020

Psychological Science is the flagship journal of the Association for Psychological Science (APS). In response to the replication crisis, D. Stephen Lindsay worked hard to increase the credibility of results published in this journal as editor from 2014-2019 (Schimmack, 2020). This work paid off and meta-scientific evidence shows that publication bias decreased and replicability increased (Schimmack, 2020). In the replicability rankings, Psychological Science is one of a few journals that show reliable improvement over the past decade (Schimmack, 2020).

This year, Patricia J. Bauer took over as editor. Some meta-psychologists were concerned that replicability might be less of a priority because she did not embrace initiatives like preregistration (New Psychological Science Editor Plans to Further Expand the Journal’s Reach).

The good news is that these concerns were unfounded. The meta-scientific criteria of credibility did not change notably from 2019 to 2020.

The observed discovery rates were 64% in 2019 and 66% in 2020. The estimated discovery rates were 58% in 2019 and 59%, respectively. Visual inspection of the z-curves and the slightly higher ODR than EDR suggests that there is still some selection for significant result. That is, researchers use so-called questionable research practices to produce statistically significant results. However, the magnitude of these questionable research practices is small and much lower than in 2010 (ODR = 77%, EDR = 38%).

Based on the EDR, it is possible to estimate the maximum false discovery rate (i.e., the percentage of significant results where the null-hypothesis is true). This rate is low with 4% in both years. Even the upper limit of the 95%CI is only 12%. This contradicts the widespread concern that most published (significant) results are false (Ioannidis, 2005).

The expected replication rate is slightly, but not significantly (i.e., it could be just sampling error) lower in 2020 (76% vs. 83%). Given the small risk of a false positive result, this means that on average significant results were obtained with the recommended power of 80% (Cohen, 1988).

Overall, these results suggest that published results in Psychological Science are credible and replicable. However, this positive evaluations comes with a few caveats.

First, null-hypothesis significance testing can only provide information that there is an effect and the direction of the effect. It cannot provide information about the effect size. Moreover, it is not possible to use the point estimates of effect sizes in small samples to draw inferences about the actual population effect size. Often the 95% confidence interval will include small effect sizes that may have no practical significance. Readers should clearly evaluate the lower limit of the 95%CI to examine whether a practically significant effect was demonstrated.

Second, the replicability estimate of 80% is an average. The average power of results that are just significant is lower. The local power estimates below the x-axis suggest that results with z-scores between 2 and 3 (p < .05 & p > .005) have only 50% power. It is recommended to increase sample sizes for follow-up studies.

Third, the local power estimates also show that most non-significant results are false negatives (type-II errors). Z-scores between 1 and 2 are estimated to have 40% average power. It is unclear how often articles falsely infer that an effect does not exist or can be ignored because the test was not significant. Often sampling error alone is sufficient to explain differences between test statistics in the range from 1 to 2 and from 2 to 3.

Finally, 80% power is sufficient for a single focal test. However, with 80% power, multiple focal tests are likely to produce at least one non-significant result. If all focal tests are significant, there is a concern that questionable research practices were used (Schimmack, 2012).

Readers should also carefully examine the results of individual articles. The present results are based on automatic extraction of all statistical tests. If focal tests have only p-values in the range between .05 and .005, the results are less credible than if at least some p-values are below .005 (Schimmack, 2020).

In conclusion, Psychological Science has responded to concerns about a high rate of false positive results by increasing statistical power and reducing publication bias. This positive trend continued in 2020 under the leadership of the new editor Patricia Bauer.

How Replicable are Focal Hypothesis Tests in the Journal Psychological Science?

Over the past five years, psychological science has been in a crisis of confidence.  For decades, psychologists have assumed that published significant results provide strong evidence for theoretically derived predictions, especially when authors presented multiple studies with internal replications within a single article (Schimmack, 2012). However, even multiple significant results provide little empirical evidence, when journals only publish significant results (Sterling, 1959; Sterling et al., 1995).  When published results are selected for significance, statistical significance loses its ability to distinguish replicable effects from results that are difficult to replicate or results that are type-I errors (i.e., the theoretical prediction was false).

The crisis of confidence led to several initiatives to conduct independent replications. The most informative replication initiative was conducted by the Open Science Collaborative (Science, 2015).  It replicated close to 100 significant results published in three high-ranked psychology journals.  Only 36% of the replication studies replicated a statistically significant result.  The replication success rate varied by journal.  The journal “Psychological Science” achieved a success rate of 42%.

The low success rate raises concerns about the empirical foundations of psychology as a science.  Without further information, a success rate of 42% implies that it is unclear which published results provide credible evidence for a theory and which findings may not replicate.  It is impossible to conduct actual replication studies for all published studies.  Thus, it is highly desirable to identify replicable findings in the existing literature.

One solution is to estimate replicability for sets of studies based on the published test statistics (e.g., F-statistic, t-values, etc.).  Schimmack and Brunner (2016) developed a statistical method, Powergraphs, that estimates the average replicability of a set of significant results.  This method has been used to estimate replicability of psychology journals using automatic extraction of test statistics (2016 Replicability Rankings, Schimmack, 2017).  The results for Psychological Science produced estimates in the range from 55% to 63% for the years 2010-2016 with an average of 59%.   This is notably higher than the success rate for the actual replication studies, which only produced 42% successful replications.

There are two explanations for this discrepancy.  First, actual replication studies are not exact replication studies and differences between the original and the replication studies may explain some replication failures.  Second, the automatic extraction method may overestimate replicability because it may include non-focal statistical tests. For example, significance tests of manipulation checks can be highly replicable, but do not speak to the replicability of theoretically important predictions.

To address the concern about automatic extraction of test statistics, I estimated replicability of focal hypothesis tests in Psychological Science with hand-coded, focal hypothesis tests.  I used three independent data sets.

Study 1

For Study 1, I hand-coded focal hypothesis tests of all studies in the 2008 Psychological Science articles that were used for the OSC reproducibility project (Science, 2015).

OSC.PS

The powergraphs show the well-known effect of publication bias in that most published focal hypothesis tests report a significant result (p < .05, two-tailed, z > 1.96) or at least a marginally significant result (p < .10, two-tailed or p < .05, one-tailed, z > 1.65). Powergraphs estimate the average power of studies with significant results on the basis of the density distribution of significant z-scores.  Average power is an estimate of replicabilty for a set of exact replication studies.  The left graph uses all significant results. The right graph uses only z-scores greater than 2.4 because questionable research practices may produce many just-significant results and lead to biased estimates of replicability. However, both estimation methods produce similar estimates of replicability (57% & 61%).  Given the small number of statistics the 95%CI is relatively wide (left graph: 44% to 73%).  These results are compatible with the low actual success rate for actual replication studies (42%) and the estimate based on automated extraction (59%).

Study 2

The second dataset was provided by Motyl et al. (JPSP, in press), who coded a large number of articles from social psychology journals and psychological science. Importantly, they coded a representative sample of Psychological Science studies from the years 2003, 2004, 2013, and 2014. That is, they did not only code social psychology articles published in Psychological Science.  The dataset included 281 test statistics from Psychological Science.

PS.Motyl

The powergraph looks similar to the powergraph in Study 1.  More important, the replicability estimates are also similar (57% & 52%).  The 95%CI for Study 1 (44% to 73%) and Study 2 (left graph: 49% to 65%) overlap considerably.  Thus, two independent coding schemes and different sets of studies (2008 vs. 2003-2004/2013/2014) produce very similar results.

Study 3

Study 3 was carried out in collaboration with Sivaani Sivaselvachandran, who hand-coded articles from Psychological Science published in 2016.  The replicability rankings showed a slight positive trend based on automatically extracted test statistics.  The goal of this study was to examine whether hand-coding would also show an increase in replicability.  An increase was expected based on an editorial by D. Stephen Linday, incoming editor in 2015, who aimed to increase replicability of results published in Psychological Science by introducing badges for open data and preregistered hypotheses. However, the results failed to show a notable increase in average replicability.

PS.2016

The replicability estimate was similar to those in the first two studies (59% & 59%).  The 95%CI ranged from 49% to 70%. These wide confidence intervals make it difficult to notice small improvements, but the histogram shows that just significant results (z = 2 to 2.2) are still the most prevalent results reported in Psychological Science and that non-significant results that are to be expected are not reported.

Combined Analysis 

Given the similar results in all three studies, it made sense to pool the data to obtain the most precise estimate of replicability of results published in Psychological Science. With 479 significant test statistics, replicability was estimated at 58% with a 95%CI ranging from 51% to 64%.  This result is in line with the estimated based on automated extraction of test statistics (59%).  The reason for the close match between hand-coded and automated results could be that Psych Science publishes short articles and authors may report mostly focal results because space does not allow for extensive reporting of other statistics.  The hand-coded data confirm that replicabilty in Psychological Science is likely to be above 50%.

PS.combined

It is important to realize that the 58% estimate is an average.  Powergraphs also show average replicability for segments of z-scores. Here we see that replicabilty for just-significant results (z < 2.5 ~ p > .01) is only 35%. Even for z-score between 2.5 and 3.0 (~ p > .001) is only 47%.  Once z-scores are greater than 3, average replicabilty is above 50% and with z-scores greater than 4, replicability is greater than 80%.  For any single study, p-values can vary greatly due to sampling error, but in general a published result with a p-value < .001 is much more likely to replicate than a p-value > .01 (see also OSC, Science, 2015).

Conclusion

This blog-post used hand-coding of test-statistics published in Psychological Science, the flagship journal of the Association for Psychological Science, to estimate replicabilty of published results.  Three dataset produced convergent evidence that the average replicabilty of exact replication studies is 58% +/- 7%.  This result is consistent with estimates based on automatic extraction of test statistics.  It is considerably higher than the success rate of actual replication studies in the OSC reproducibility project (42%). One possible reason for this discrepancy is that actual replication studies are never exact replication studies, which makes it more difficult to obtain statistical significance if the original studies are selected for significance. For example, the original study may have had an outlier in the experimental group that helped to produce a significant result. Not removing this outlier is not considered a questionable research practice, but an exact replication study will not reproduce the same outlier and may fail to reproduce a just-significant result.  More broadly, any deviation from the assumptions underlying the computation of test statistics will increase the bias that is introduced by selecting significant results.  Thus, the 58% estimate is an optimistic estimate of the maximum replicability under ideal conditions.

At the same time, it is important to point out that 58% replicability for Psychological Science does not mean psychological science is rotten to the core (Motyl et al., in press) or that most reported results are false (Ioannidis, 2005).  Even results that did not replicate in actual replication studies are not necessarily false positive results.  It is possible that more powerful studies would produce a significant result, but with a smaller effect size estimate.

Hopefully, these analyses will spur further efforts to increase replicability of published results in Psychological Science and in other journals.  We are already near the middle of 2017 and can look forward to the 2017 results.

 

 

 

2016 Replicability Rankings of 103 Psychology Journals

Update: October 24, 2017.
The preliminary 2017 rankings are now available. They provide information for the years 2010-2017, updated analyses, and a correction in the estimates due to a computational error that lowered estimates by about 10 percentage points, on average.  Please check the newer rankings for the most reliable information.

—————————————————————————————————————————————–


I post the rankings on top.  Detailed information and statistical analysis are provided below the table.  You can click on the journal title to see Powergraphs for each year.

Rank   Journal Change 2017 2016 2015 2014 2013 2012 2011 2010
1 Social Indicators Research 10 90 70 65 75 65 72 73 73
2 Psychology of Music -13 81 59 67 61 69 85 84 72
3 Journal of Memory and Language 11 79 76 65 71 64 71 66 70
4 British Journal of Developmental Psychology -9 77 52 61 54 82 74 69 67
5 Journal of Occupational and Organizational Psychology 13 77 59 69 58 61 65 56 64
6 Journal of Comparative Psychology 13 76 71 77 74 68 61 66 70
7 Cognitive Psychology 7 75 73 72 69 66 74 66 71
8 Epilepsy & Behavior 5 75 72 79 70 68 76 69 73
9 Evolution & Human Behavior 16 75 57 73 55 38 57 62 60
10 International Journal of Intercultural Relations 0 75 43 70 75 62 67 62 65
11 Pain 5 75 70 75 67 64 65 74 70
12 Psychological Medicine 4 75 57 66 70 58 72 61 66
13 Annals of Behavioral Medicine 10 74 50 63 62 62 62 51 61
14 Developmental Psychology 17 74 72 73 67 61 63 58 67
15 Judgment and Decision Making -3 74 59 68 56 72 66 73 67
16 Psychology and Aging 6 74 66 78 65 74 66 66 70
17 Aggressive Behavior 16 73 70 66 49 60 67 52 62
18 Journal of Gerontology-Series B 3 73 60 65 65 55 79 59 65
19 Journal of Youth and Adolescence 13 73 66 82 67 61 57 66 67
20 Memory 5 73 56 79 70 65 64 64 67
21 Sex Roles 6 73 67 59 64 72 68 58 66
22 Journal of Experimental Psychology – Learning, Memory & Cognition 4 72 74 76 71 71 67 72 72
23 Journal of Social and Personal Relationships -6 72 51 57 55 60 60 75 61
24 Psychonomic Review and Bulletin 8 72 79 62 78 66 62 69 70
25 European Journal of Social Psychology 5 71 61 63 58 50 62 67 62
26 Journal of Applied Social Psychology 4 71 58 69 59 73 67 58 65
27 Journal of Experimental Psychology – Human Perception and Performance -4 71 68 72 69 70 78 72 71
28 Journal of Research in Personality 9 71 75 47 65 51 63 63 62
29 Journal of Child and Family Studies 0 70 60 63 60 56 64 69 63
30 Journal of Cognition and Development 5 70 53 62 54 50 61 61 59
31 Journal of Happiness Studies -9 70 64 66 77 60 74 80 70
32 Political Psychology 4 70 55 64 66 71 35 75 62
33 Cognition 2 69 68 70 71 67 68 67 69
34 Depression & Anxiety -6 69 57 66 71 77 77 61 68
35 European Journal of Personality 2 69 61 75 65 57 54 77 65
36 Journal of Applied Psychology 6 69 58 71 55 64 59 62 63
37 Journal of Cross-Cultural Psychology -4 69 74 69 76 62 73 79 72
38 Journal of Psychopathology and Behavioral Assessment -13 69 67 63 77 74 77 79 72
39 JPSP-Interpersonal Relationships and Group Processes 15 69 64 56 52 54 59 50 58
40 Social Psychology 3 69 70 66 61 64 72 64 67
41 Achive of Sexual Behavior -2 68 70 78 73 69 71 74 72
42 Journal of Affective Disorders 0 68 64 54 66 70 60 65 64
43 Journal of Experimental Child Psychology 2 68 71 70 65 66 66 70 68
44 Journal of Educational Psychology -11 67 61 66 69 73 69 76 69
45 Journal of Experimental Social Psychology 13 67 56 60 52 50 54 52 56
46 Memory and Cognition -3 67 72 69 68 75 66 73 70
47 Personality and Individual Differences 8 67 68 67 68 63 64 59 65
48 Psychophysiology -1 67 66 65 65 66 63 70 66
49 Cognitve Development 6 66 78 60 65 69 61 65 66
50 Frontiers in Psychology -8 66 65 67 63 65 60 83 67
51 Journal of Autism and Developmental Disorders 0 66 65 58 63 56 61 70 63
52 Journal of Experimental Psychology – General 5 66 69 67 72 63 68 61 67
53 Law and Human Behavior 1 66 69 53 75 67 73 57 66
54 Personal Relationships 19 66 59 63 67 66 41 48 59
55 Early Human Development 0 65 52 69 71 68 49 68 63
56 Attention, Perception and Psychophysics -1 64 69 70 71 72 68 66 69
57 Consciousness and Cognition -3 64 65 67 57 64 67 68 65
58 Journal of Vocactional Behavior 5 64 78 66 78 71 74 57 70
59 The Journal of Positive Psychology 14 64 65 79 51 49 54 59 60
60 Behaviour Research and Therapy 7 63 73 73 66 69 63 60 67
61 Child Development 0 63 66 62 65 62 59 68 64
62 Emotion -1 63 61 56 66 62 57 65 61
63 JPSP-Personality Processes and Individual Differences 1 63 56 56 59 68 66 51 60
64 Schizophrenia Research 1 63 65 68 64 61 70 60 64
65 Self and Identity -4 63 52 61 62 50 55 71 59
66 Acta Psychologica -6 63 66 69 69 67 68 72 68
67 Behavioral Brain Research -3 62 67 61 62 64 65 67 64
68 Child Psychiatry and Human Development 5 62 72 83 73 50 82 58 69
69 Journal of Child Psychology and Psychiatry and Allied Disciplines 10 62 62 56 66 64 45 55 59
70 Journal of Consulting and Clinical Psychology 0 62 56 50 54 59 58 57 57
71 Journal of Counseling Psychology -3 62 70 60 74 72 56 72 67
72 Behavioral Neuroscience 1 61 66 63 62 65 58 64 63
73 Developmental Science -5 61 62 60 62 66 65 65 63
74 Journal of Experimental Psychology – Applied -4 61 61 65 53 69 57 69 62
75 Journal of Social Psychology -11 61 56 55 55 74 70 63 62
76 Social Psychology and Personality Science -5 61 42 56 59 59 65 53 56
77 Cognitive Therapy and Research 0 60 68 54 67 70 62 58 63
78 Hormones & Behavior -1 60 55 55 54 55 60 58 57
79 Motivation and Emotion 1 60 60 57 57 51 73 52 59
80 Organizational Behavior and Human Decision Processes 3 60 63 65 61 68 67 51 62
81 Psychoneuroendocrinology 5 60 58 58 56 53 59 53 57
82 Social Development -10 60 50 66 62 65 79 57 63
83 Appetite -10 59 57 57 65 64 66 67 62
84 Biological Psychology -6 59 60 55 57 57 65 64 60
85 Journal of Personality Psychology 17 59 59 60 62 69 37 45 56
86 Psychological Science 6 59 63 60 63 59 55 56 59
87 Asian Journal of Social Psychology 0 58 76 67 56 71 64 64 65
88 Behavior Therapy 0 58 63 66 69 66 52 65 63
89 Britsh Journal of Social Psychology 0 58 57 44 59 51 59 55 55
90 Social Influence 18 58 72 56 52 33 59 46 54
91 Developmental Psychobiology -9 57 54 61 60 70 64 62 61
92 Journal of Research on Adolescence 2 57 59 61 82 71 75 40 64
93 Journal of Abnormal Psychology -5 56 52 57 58 55 66 55 57
94 Social Cognition -2 56 54 52 54 62 69 46 56
95 Personality and Social Psychology Bulletin 2 55 57 58 55 53 56 54 55
96 Cognition and Emotion -14 54 66 61 62 76 69 69 65
97 Health Psychology -4 51 67 56 72 54 69 56 61
98 Journal of Clinical Child and Adolescence Psychology 1 51 66 61 74 64 58 54 61
99 Journal of Family Psychology -7 50 52 63 61 57 64 55 57
100 Group Processes & Intergroup Relations -5 49 53 68 64 54 62 55 58
101 Infancy -8 47 44 60 55 48 63 51 53
102 Journal of Consumer Psychology -5 46 57 55 51 53 48 61 53
103 JPSP-Attitudes & Social Cognition -3 45 69 62 39 54 54 62 55

Notes.
1. Change scores are the unstandardized regression weights with replicabilty estimates as outcome variable and year as predictor variable.  Year was coded from 0 for 2010 to 1 for 2016 so that the regression coefficient reflects change over the full 7 year period. This method is preferable to a simple difference score because estimates in individual years are variable and are likely to overestimate change.
2. Rich E. Lucas, Editor of JRP, noted that many articles in JRP do not report t of F values in the text and that the replicability estimates based on these statistics may not be representative of the bulk of results reported in this journal.  Hand-coding of articles is required to address this problem and the ranking of JRP, and other journals, should be interpreted with caution (see further discussion of these issues below).

Introduction

I define replicability as the probability of obtaining a significant result in an exact replication of a study that produced a significant result.  In the past five years, it has become increasingly clear that psychology suffers from a replication crisis. Even results that are replicated internally by the same author multiple times fail to replicate in independent replication attempts (Bem, 2011).  The key reason for the replication crisis is selective publishing of significant results (publication bias). While journals report over 95% significant results (Sterling, 1959; Sterling et al., 1995), a 2015 article estimated that less than 50% of these results can be replicated  (OSC, 2015).

The OSC reproducibility made an important contribution by demonstrating that published results in psychology have low replicability.  However, the reliance on actual replication studies has a a number of limitations.  First, actual replication studies are expensive or impossible (e.g., a longitudinal study spanning 20 years).  Second, studies selected for replication may not be representative because the replication team lacks expertise to replicate some studies. Finally, replication studies take time and replicability of recent studies may not be known for several years. This makes it difficult to rely on actual replication studies to rank journals and to track replicability over time.

Schimmack and Brunner (2016) developed a statistical method (z-curve) that makes it possible to estimate average replicability for a set of published results based on the original results in published articles.  This statistical approach to the estimation of replicability has several advantages over the use of actual replication studies.  Replicability can be assessed in real time, it can be estimated for all published results, and it can be used for expensive studies that are impossible to reproduce.  Finally, it has the advantage that actual replication studies can be criticized  (Gilbert, King, Pettigrew, & Wilson, 2016). Estimates of replicabilty based on original studies do not have this problem because they are based on published results in original articles.

Z-curve has been validated with simulation studies and can be used when replicability varies across studies and when there is selection for significance, and is superior to similar statistical methods that correct for publication bias (Brunner & Schimmack, 2016).  I use this method to estimate the average replicability of significant results published in 103 psychology journals. Separate estimates were obtained for the years from 2010, one year before the start of the replication crisis, to 2016 to examine whether replicability increased in response to discussions about replicability.  The OSC estimate of replicability was based on articles published in 2008 and it was limited to three journals.  I posted replicability estimates based on z-curve for the year 2015 (2015 replicability rankings).  There was no evidence that replicability had increased during this time period.

The main empirical question was whether the 2016 rankings show some improvement in replicability and whether some journals or disciplines have responded more strongly to the replication crisis than others.

A second empirical question was whether replicabilty varies across disciplines.  The OSC project provided first evidence that traditional cognitive psychology is more replicable than social psychology.  Replicability estimates with z-curve confirmed this finding.  In the 2015 rankings, The Journal of Experimental Psychology: Learning, Memory and Cognition ranked 25 with a replicability estimate of 74, whereas the two social psychology sections of the Journal of Personality and Social Psychology ranked 73 and 99 (68% and 60% replicability estimates).  For this post, I conducted more extensive analyses of disciplines.

Journals

The 103 journals that are included in these rankings were mainly chosen based on impact factors.  The list also includes diverse areas of psychology, including cognitive, developmental, social, personality, clinical, biological, and applied psychology.  The 2015 list included some new journals that started after 2010.  These journals were excluded from the 2016 rankings to avoid missing values in statistical analyses of time trends.  A few journals were added to the list and the results may change when more journals are added to the list.

The journals were classified into 9 categories: social (24), cognitive (12), development (15), clinical/medical (19), biological (8), personality (5), and applied(IO,education) (8).  Two journals were classified as general (Psychological Science, Frontiers in Psychology). The last category included topical, interdisciplinary journals (emotion, positive psychology).

Data 

All PDF versions of published articles were downloaded and converted into text files. The 2015 rankings were based on conversions with the free program pdf2text pilot.  The 2016 program used a superior conversion program pdfzilla.  Text files were searched for reports of statistical results using my own R-code (z-extraction). Only F-tests, t-tests, and z-tests were used for the rankings. t-values that were reported without df were treated as z-values which leads to a slight inflation in replicability estimates. However, the bulk of test-statistics were F-values and t-values with degrees of freedom.  A comparison of the 2015 rankings using the old method and the new method shows that extraction methods have an influence on replicability estimates some differences (r = .56). One reason for the low correlation is that replicability estimates have a relatively small range (50-80%) and low retest correlations. Thus, even small changes can have notable effects on rankings. For this reason, time trends in replicability have to be examined at the aggregate level of journals or over longer time intervals. The change score of a single journal from 2015 to 2016 is not a reliable measure of improvement.

Data Analysis

The data for each year were analyzed using z-curve Schimmack and Brunner (2016).  The results of individual analysis are presented in Powergraphs. Powergraphs for each journal and year are provided as links to the journal names in the table with the rankings.  Powergraphs convert test statistics into absolute z-scores as a common metric for the strength of evidence against the null-hypothesis.  Absolute z-scores greater than 1.96 (p < .05, two-tailed) are considered statistically significant. The distribution of z-scores greater than 1.96 is used to estimate the average true power (not observed power) of the set of significant studies. This estimate is an estimate of replicability for a set of exact replication studies because average power determines the percentage of statistically significant results.  Powergraphs provide additional information about replicability for different ranges of z-scores (z-values between 2 and 2.5 are less replicable than those between 4 and 4.5).  However, for the replicability rankings only the replicability estimate is used.

Results

Table 1 shows the replicability estimates sorted by replicability in 2016.

The data were analyzed with a growth model to examine time trends and variability across journals and disciplines using MPLUS7.4.  I compared three models. Model 1 assumed no mean level changes and variability across journals. Model 2 assumed a linear increase. Model 3 tested assumed no change from 2010 to 2015 and allowed for an increase in 2016.

Model 1 had acceptable fit (RMSEA = .043, BIC = 5004). Model 2 increased fit (RMSEA = 0.029, BIC = 5005), but BIC slightly favored the more parsimonious Model 1. Model 3 had the best fit (RMSEA = .000, BIC = 5001).  These results reproduce the results of the 2015 analysis that there was no improvement from 2010 to 2015, but there is some evidence that replicability increased in 2016.  Adding a variance component to slope in Model 3 produced an unidentified model. Subsequent analyses show that this is due to insufficient power to detect variation across journals in changes over time.

The standardized loadings of individual years on the latent intercept factor ranged from .49 to .58.  This shows high variabibility in replicability estimates from year to year. Most of the rank changes can be attributed to random factors.  A better way to compare journals is to average across years.  A moving average of five years will provide reliable information and allow for improvement over time.  The reliability of the 5-year average for the years 2012 to 2016 is 68%.

Figure 1 shows the annual averages with 95%CI as well relative to the average over the full 7-year period.

rep-by-year

A paired t-test confirmed that average replicability in 2016 was significantly higher (M = 65, SD = 8) than in the previous years (M = 63, SD = 8), t(101) = 2.95, p = .004.  This is the first evidence that psychological scientists are responding to the replicability crisis by publishing slightly more replicable results.  Of course, this positive result has to be tempered by the small effect size.  But if this trend continuous or even increases, replicability could reach 80% in 10 years.

The next analysis examined changes in replicabilty at the level of individual journals. Replicability estimates were regressed on a dummy variable that contrasted 2016 with the previous years.  This analysis produced only 7 significant increases with p < .05 (one-tailed), which is only 2 more significant results than would be expected by chance alone. Thus, the analysis failed to identify particular journals that contribute to the improvement in the average.  Figure 2 compares the observed distribution of t-values to the predicted distribution based on the null-hypothesis (no change).

t-value Distribution.png

The blue line shows the observed density distribution, which is slightly moved to the right, but there is no set of journals with notably larger t-values.  A more sustained and larger increase in replicability is needed to detect variability in change scores.

The next analyses examine stable differences between disciplines.  The first analysis compared cognitive journals to social journals.  No statistical tests are needed to see that cognitive journals publish more replicable results than social journals. This finding confirms the results with actual replications of studies published in 2008 (OSC, 2015). The Figure suggests that the improvement in 2016 is driven more by social journals, but only 2017 data can tell whether there is a real improvement in social psychology.

replicability.cog.vs.soc.png

The next Figure shows the results for 5 personality journals.  The large confidence intervals show that there is considerable variability among personality journals. The Figure shows the averages for cognitive and social psychology as horizontal lines. The average for personality is only slightly above the average for social and like social, personality shows an upward trend.  In conclusion, personality and social psychology look very similar.  This may be due to considerable overlap between the two disciplines, which is also reflected in shared journals.  Larger differences may be visible for specialized social journals that focus on experimental social psychology.

replicability-personality

The results for developmental journals show no clear time trend and the average is just about in the middle between cognitive and social psychology.  The wide confidence intervals suggest that there is considerable variability among developmental journals. Table 1 shows Developmental Psychology ranks 14 / 103 and Infancy ranks 101/103. The low rank for Infancy may be due to the great difficulty of measuring infant behavior.

replicability-developmental

The clinical/medical journals cover a wide range of topics from health psychology to special areas of psychiatry.  There has been some concern about replicability in medical research (Ioannidis, 2005). The results for clinical are similar to those for developmental journals. Replicability is lower than for cognitive psychology and higher than for social psychology.  This may seem surprising because patient populations and samples tend to be smaller. However, a randomized controlled intervention study uses pre-post designs to boost power, whereas social and personality psychologists use comparisons across individuals, which requires large samples to reduce sampling error.

replicability-clinical

The set of biological journals is very heterogeneous and small. It includes neuroscience and classic peripheral physiology.  Despite wide confidence intervals replicability for biological journals is significantly lower than replicabilty for cognitive psychology. There is no notable time trend. The average is slightly above the average for social journals.

replicability.biological.png

The last category are applied journals. One journal focuses on education. The other journals focus on industrial and organizational psychology.  Confidence intervals are wide, but replicabilty is generally lower than for cognitive psychology. There is no notable time trend for this set of journals.

replicability.applied.png

Given the stability of replicability, I averaged replicability estimates across years. The last figure shows a comparison of disciplines based on these averages.  The figure shows that social psychology is significantly below average and cognitive psychology is significantly above average with the other disciplines falling in the middle.  All averages are significantly above 50% and below 80%.

Discussion

The most exciting finding is that repicability appears to have increased in 2016. This increase is remarkable because averages in the years before consistently tracked the average of 63.  The increase by 2 percentage points in 2016 is not large, but it may represent a first response to the replication crisis.

The increase is particularly remarkable because statisticians have been sounding the alarm bells about low power and publication bias for over 50 years (Cohen, 1962; Sterling, 1959), but these warnings have had no effect on research practices. In 1989, Sedlmeier and Gigerenzer (1989) noted that studies of statistical power had no effect on the statistical power of studies.  The present results provide the first empirical evidence that psychologists are finally starting to change their research practices.

However, the results also suggest that most journals continue to publish articles with low power.  The replication crisis has affected social psychology more than other disciplines with fierce debates in journals and on social media (Schimmack, 2016).  On the one hand, the comparisons of disciplines supports the impression that social psychology has a bigger replicability problem than other disciplines. However, the differences between disciplines are small. With the exception of cognitive psychology, other disciplines are not a lot more replicable than social psychology.  The main reason for the focus on social psychology is probably that these studies are easier to replicate and that there have been more replication studies in social psychology in recent years.  The replicability rankings predict that other disciplines would also see a large number of replication failures, if they would subject important findings to actual replication attempts.  Only empirical data will tell.

Limitations

The main limitation of replicability rankings is that the use of an automatic extraction method does not distinguish theoretically important hypothesis tests and other statistical tests.  Although this is a problem for the interpretation of the absolute estimates, it is less important for the comparison over time.  Any changes in research practices that reduce sampling error (e.g., larger samples, more reliable measures) will not only strengthen the evidence for focal hypothesis tests, but also increase the strength of evidence for non-focal hypothesis tests.

Schimmack and Brunner (2016) compared replicability estimates with actual success rates in the OSC (2015) replication studies.  They found that the statistical method overestimates replicability by about 20%.  Thus, the absolute estimates can be interpreted as very optimistic estimates.  There are several reasons for this overestimation.  One reason is that the estimation method assumes that all results with a p-value greater than .05 are equally likely to be published. If there are further selection mechanisms that favor smaller p-values, the method overestimates replicability.  For example, sometimes researchers correct for multiple comparisons and need to meet a more stringent significance criterion.  Only careful hand-coding of research articles can provide more accurate estimates of replicability.  Schimmack and Brunner (2016) hand-coded the articles that were included in the OSC (2015) article and still found that the method overestimated replicability.  Thus, the absolute values need to be interpreted with great caution and success rates of actual replication studies are expected to be at least 10% lower than these estimates.

Implications

Power and replicability have been ignored for over 50 years.  A likely reason is that replicability is difficult to measure.  A statistical method for the estimation of replicability changes this. Replicability estimates of journals make it possible for editors to compete with other journals in the replicability rankings. Flashy journals with high impact factors may publish eye-catching results, but if this journal has a reputation of publishing results that do not replicate, they are not very likely to have a big impact.  Science is build on trust and trust has to be earned and can be easily lost.  Eventually, journals that publish replicable results may also increase their impact because more researchers are going to build on replicable results published in these journals.  In this way, replicability rankings can provide a much needed correction to the current incentive structure in science that rewards publishing as many articles as possible without any concerns about the replicability of these results. This reward structure is undermining science.  It is time to change it. It is no longer sufficient to publish a significant result, if this result cannot be replicate in other labs.

Many scientists feel threatened by changes in the incentive structure and the negative consequences of replication failures for their reputation. However, researchers have control over their reputation.  First, researchers often carry out many conceptually related studies. In the past, it was acceptable to publish only the studies that worked (p < .05). This selection for significance by researchers is the key factor in the replication crisis. The researchers who are conducting the studies are fully aware that it was difficult to get a significant result, but the selective reporting of these successes produces inflated effect size estimates and an illusion of high replicability that inevitably lead to replication failures.  To avoid these embarrassing replication failures researchers need to report results of all studies or conduct fewer studies with high power.  The 2016 rankings suggest that some researchers have started to change, but we will have to wait until 2017 to see whether 2017 can replicate the positive trend in the 2016 rankings.

2015 Replicability Ranking of 100+ Psychology Journals

Replicability rankings of psychology journals differs from traditional rankings based on impact factors (citation rates) and other measures of popularity and prestige. Replicability rankings use the test statistics in the results sections of empirical articles to estimate the average power of statistical tests in a journal. Higher average power means that the results published in a journal have a higher probability to produce a significant result in an exact replication study and a lower probability of being false-positive results.

The rankings are based on statistically significant results only (p < .05, two-tailed) because only statistically significant results can be used to interpret a result as evidence for an effect and against the null-hypothesis.  Published non-significant results are useful for meta-analysis and follow-up studies, but they provide insufficient information to draw statistical inferences.

The average power across the 105 psychology journals used for this ranking is 70%. This means that a representative sample of significant results in exact replication studies is expected to produce 70% significant results. The rankings for 2015 show variability across journals with average power estimates ranging from 84% to 54%.  A factor analysis of annual estimates for 2010-2015 showed that random year-to-year variability accounts for 2/3 of the variance and that 1/3 is explained by stable differences across journals.

The Journal Names are linked to figures that show the powergraphs of a journal for the years 2010-2014 and 2015. The figures provide additional information about the number of tests used, confidence intervals around the average estimate, and power estimates that estimate power including non-significant results even if these are not reported (the file-drawer).

Rank   Journal 2010/14 2015
1   Social Indicators Research   81   84
2   Journal of Happiness Studies   81   83
3   Journal of Comparative Psychology   72   83
4   International Journal of Psychology   80   81
5   Journal of Cross-Cultural Psychology   78   81
6   Child Psychiatry and Human Development   75   81
7   Psychonomic Review and Bulletin   72   80
8   Journal of Personality   72   79
9   Journal of Vocational Behavior   79   78
10   British Journal of Developmental Psychology   75   78
11   Journal of Counseling Psychology   72   78
12   Cognitve Development   69   78
13   JPSP: Personality Processes
and Individual Differences
  65   78
14   Journal of Research in Personality   75   77
15   Depression & Anxiety   74   77
16   Asian Journal of Social Psychology   73   77
17   Personnel Psychology   78   76
18   Personality and Individual Differences   74   76
19   Personal Relationships   70   76
20   Cognitive Science   77   75
21   Memory and Cognition   73   75
22   Early Human Development   71   75
23   Journal of Sexual Medicine   76   74
24   Journal of Applied Social Psychology   74   74
25   Journal of Experimental Psychology: Learning, Memory & Cognition   74   74
26   Journal of Youth and Adolescence   72   74
27   Social Psychology   71   74
28   Journal of Experimental Psychology: Human Perception and Performance   74   73
29   Cognition and Emotion   72   73
30   Journal of Affective Disorders   71   73
31   Attention, Perception and Psychophysics   71   73
32   Evolution & Human Behavior   68   73
33   Developmental Science   68   73
34   Schizophrenia Research   66   73
35   Achive of Sexual Behavior   76   72
36   Pain   74   72
37    Acta Psychologica   72   72
38   Cognition   72   72
39   Journal of Experimental Child Psychology   72   72
40   Aggressive Behavior   72   72
41   Journal of Social Psychology   72   72
42   Behaviour Research and Therapy   70   72
43   Frontiers in Psychology   70   72
44   Journal of Autism and Developmental Disorders   70   72
45   Child Development   69   72
46   Epilepsy & Behavior   75   71
47   Journal of Child and Family Studies   72   71
48   Psychology of Music   71   71
49   Psychology and Aging   71   71
50   Journal of Memory and Language   69   71
51   Journal of Experimental Psychology: General   69   71
52   Psychotherapy   78   70
53   Developmental Psychology   71   70
54   Behavior Therapy   69   70
55   Judgment and Decision Making   68   70
56   Behavioral Brain Research   68   70
57   Social Psychology and Personality Science   62   70
58   Political Psychology   75   69
59   Cognitive Psychology   74   69
60   Organizational Behavior and Human Decision Processes   69   69
61   Appetite   69   69
62   Motivation and Emotion   69   69
63   Sex Roles   68   69
64   Journal of Experimental Psychology: Applied   68   69
65   Journal of Applied Psychology   67   69
66   Behavioral Neuroscience   67   69
67   Psychological Science   67   68
68   Emotion   67   68
69   Developmental Psychobiology   66   68
70   European Journal of Social Psychology   65   68
71   Biological Psychology   65   68
72   British Journal of Social Psychology   64   68
73   JPSP: Attitudes & Social Cognition   62   68
74   Animal Behavior   69   67
75   Psychophysiology   67   67
76   Journal of Child Psychology and Psychiatry and Allied Disciplines   66   67
77   Journal of Research on Adolescence   75   66
78   Journal of Educational Psychology   74   66
79   Clinical Psychological Science   69   66
80   Consciousness and Cognition   69   66
81   The Journal of Positive Psychology   65   66
82   Hormones & Behavior   64   66
83   Journal of Clinical Child and
Adolescence Psychology
  62   66
84   Journal of Gerontology: Series B   72   65
85   Psychological Medicine   66   65
86   Personalit and Social Psychology
Bulletin
  64   64
87   Infancy   61   64
88   Memory   75   63
89   Law and Human Behavior   70   63
90   Group Processes & Intergroup Relations   70   63
91   Journal of Social and Personal Relationships   69   63
92   Cortex   67   63
93   Journal of Abnormal Psychology   64   63
94   Journal of Consumer Psychology   60   63
95   Psychology of Violence   71   62
96   Psychoneuroendocrinology   63   62
97   Health Psychology   68   61
98   Journal of Experimental Social
Psychology
  59   61
99   JPSP: Interpersonal Relationships
and Group Processes
  60   60
100   Social Cognition   65   59
101   Journal of Consulting and Clinical Psychology   63   58
102   European Journal of Personality   72   57
103   Journal of Family Psychology   60   57
104   Social Development   75   55
105   Annals of Behavioral Medicine   65   54
106   Self and Identity   63   54

REPLICABILITY RANKING OF 26 PSYCHOLOGY JOURNALS

THEORETICAL BACKGROUND

Neyman & Pearson (1933) developed the theory of type-I and type-II errors in statistical hypothesis testing.

A type-I error is defined as the probability of rejecting the null-hypothesis (i.e., the effect size is zero) when the null-hypothesis is true.

A type-II error is defined as the probability of failing to reject the null-hypothesis when the null-hypothesis is false (i.e., there is an effect).

A common application of statistics is to provide empirical evidence for a theoretically predicted relationship between two variables (cause-effect or covariation). The results of an empirical study can produce two outcomes. Either the result is statistically significant or it is not statistically significant. Statistically significant results are interpreted as support for a theoretically predicted effect.

Statistically non-significant results are difficult to interpret because the prediction may be false (the null-hypothesis is true) or a type-II error occurred (the theoretical prediction is correct, but the results fail to provide sufficient evidence for it).

To avoid type-II errors, researchers can design studies that reduce the type-II error probability. The probability of avoiding a type-II error when a predicted effect exists is called power. It could also be called the probability of success because a significant result can be used to provide empirical support for a hypothesis.

Ideally researchers would want to maximize power to avoid type-II errors. However, powerful studies require more resources. Thus, researchers face a trade-off between the allocation of resources and their probability to obtain a statistically significant result.

Jacob Cohen dedicated a large portion of his career to help researchers with the task of planning studies that can produce a successful result, if the theoretical prediction is true. He suggested that researchers should plan studies to have 80% power. With 80% power, the type-II error rate is still 20%, which means that 1 out of 5 studies in which a theoretical prediction is true would fail to produce a statistically significant result.

Cohen (1962) examined the typical effect sizes in psychology and found that the typical effect size for the mean difference between two groups (e.g., men and women or experimental vs. control group) is about half-of a standard deviation. The standardized effect size measure is called Cohen’s d in his honor. Based on his review of the literature, Cohen suggested that an effect size of d = .2 is small, d = .5 moderate, and d = .8. Importantly, a statistically small effect size can have huge practical importance. Thus, these labels should not be used to make claims about the practical importance of effects. The main purpose of these labels is that researchers can better plan their studies. If researchers expect a large effect (d = .8), they need a relatively small sample to have high power. If researchers expect a small effect (d = .2), they need a large sample to have high power.   Cohen (1992) provided information about effect sizes and sample sizes for different statistical tests (chi-square, correlation, ANOVA, etc.).

Cohen (1962) conducted a meta-analysis of studies published in a prominent psychology journal. Based on the typical effect size and sample size in these studies, Cohen estimated that the average power in studies is about 60%. Importantly, this also means that the typical power to detect small effects is less than 60%. Thus, many studies in psychology have low power and a high type-II error probability. As a result, one would expect that journals often report that studies failed to support theoretical predictions. However, the success rate in psychological journals is over 90% (Sterling, 1959; Sterling, Rosenbaum, & Weinkam, 1995). There are two explanations for discrepancies between the reported success rate and the success probability (power) in psychology. One explanation is that researchers conduct multiple studies and only report successful studies. The other studies remain unreported in a proverbial file-drawer (Rosenthal, 1979). The other explanation is that researchers use questionable research practices to produce significant results in a study (John, Loewenstein, & Prelec, 2012). Both practices have undesirable consequences for the credibility and replicability of published results in psychological journals.

A simple solution to the problem would be to increase the statistical power of studies. If the power of psychological studies in psychology were over 90%, a success rate of 90% would be justified by the actual probability of obtaining significant results. However, meta-analysis and method articles have repeatedly pointed out that psychologists do not consider statistical power in the planning of their studies and that studies continue to be underpowered (Maxwell, 2004; Schimmack, 2012; Sedlmeier & Giegerenzer, 1989).

One reason for the persistent neglect of power could be that researchers have no awareness of the typical power of their studies. This could happen because observed power in a single study is an imperfect indicator of true power (Yuan & Maxwell, 2005). If a study produced a significant result, the observed power is at least 50%, even if the true power is only 30%. Even if the null-hypothesis is true, and researchers publish only type-I errors, observed power is dramatically inflated to 62%, when the true power is only 5% (the type-I error rate). Thus, Cohen’s estimate of 60% power is not very reassuring.

Over the past years, Schimmack and Brunner have developed a method to estimate power for sets of studies with heterogeneous designs, sample sizes, and effect sizes. A technical report is in preparation. The basic logic of this approach is to convert results of all statistical tests into z-scores using the one-tailed p-value of a statistical test.  The z-scores provide a common metric for observed statistical results. The standard normal distribution predicts the distribution of observed z-scores for a fixed value of true power.   However, for heterogeneous sets of studies the distribution of z-scores is a mixture of standard normal distributions with different weights attached to various power values. To illustrate this method, the histograms of z-scores below show simulated data with 10,000 observations with varying levels of true power: 20% null-hypotheses being true (5% power), 20% of studies with 33% power, 20% of studies with 50% power, 20% of studies with 66% power, and 20% of studies with 80% power.

RepRankSimulation

The plot shows the distribution of absolute z-scores (there are no negative effect sizes). The plot is limited to z-scores below 6 (N = 99,985 out of 10,000). Z-scores above 6 standard deviations from zero are extremely unlikely to occur by chance. Even with a conservative estimate of effect size (lower bound of 95% confidence interval), observed power is well above 99%. Moreover, quantum physics uses Z = 5 as a criterion to claim success (e.g., discovery of Higgs-Boson Particle). Thus, Z-scores above 6 can be expected to be highly replicable effects.

Z-scores below 1.96 (the vertical dotted red line) are not significant for the standard criterion of (p < .05, two-tailed). These values are excluded from the calculation of power because these results are either not reported or not interpreted as evidence for an effect. It is still important to realize that true power of all experiments would be lower if these studies were included because many of the non-significant results are produced by studies with 33% power. These non-significant results create two problems. Researchers wasted resources on studies with inconclusive results and readers may be tempted to misinterpret these results as evidence that an effect does not exist (e.g., a drug does not have side effects) when an effect is actually present. In practice, it is difficult to estimate power for non-significant results because the size of the file-drawer is difficult to estimate.

It is possible to estimate power for any range of z-scores, but I prefer the range of z-scores from 2 (just significant) to 4. A z-score of 4 has a 95% confidence interval that ranges from 2 to 6. Thus, even if the observed effect size is inflated, there is still a high chance that a replication study would produce a significant result (Z > 2). Thus, all z-scores greater than 4 can be treated as cases with 100% power. The plot also shows that conclusions are unlikely to change by using a wider range of z-scores because most of the significant results correspond to z-scores between 2 and 4 (89%).

The typical power of studies is estimated based on the distribution of z-scores between 2 and 4. A steep decrease from left to right suggests low power. A steep increase suggests high power. If the peak (mode) of the distribution were centered over Z = 2.8, the data would conform to Cohen’s recommendation to have 80% power.

Using the known distribution of power to estimate power in the critical range gives a power estimate of 61%. A simpler model that assumes a fixed power value for all studies produces a slightly inflated estimate of 63%. Although the heterogeneous model is correct, the plot shows that the homogeneous model provides a reasonable approximation when estimates are limited to a narrow range of Z-scores. Thus, I used the homogeneous model to estimate the typical power of significant results reported in psychological journals.

DATA

The results presented below are based on an ongoing project that examines power in psychological journals (see results section for the list of journals included so far). The set of journals does not include journals that primarily publish reviews and meta-analysis or clinical and applied journals. The data analysis is limited to the years from 2009 to 2015 to provide information about the typical power in contemporary research. Results regarding historic trends will be reported in a forthcoming article.

I downloaded pdf files of all articles published in the selected journals and converted the pdf files to text files. I then extracted all t-tests and F-tests that were reported in the text of the results section searching for t(df) or F(df1,df2). All t and F statistics were converted into one-tailed p-values and then converted into z-scores.

RepRankAll

The plot above shows the results based on 218,698 t and F tests reported between 2009 and 2015 in the selected psychology journals. Unlike the simulated data, the plot shows a steep drop for z-scores just below the threshold of significance (z = 1.96). This drop is due to the tendency not to publish or report non-significant results. The heterogeneous model uses the distribution of non-significant results to estimate the size of the file-drawer (unpublished non-significant results). However, for the present purpose the size of the file-drawer is irrelevant because power is estimated only for significant results for Z-scores between 2 and 4.

The green line shows the best fitting estimate for the homogeneous model. The red curve shows fit of the heterogeneous model. The heterogeneous model is doing a much better job at fitting the long tail of highly significant results, but for the critical interval of z-scores between 2 and 4, the two models provide similar estimates of power (55% homogeneous & 53% heterogeneous model).   If the range is extended to z-scores between 2 and 6, power estimates diverge (82% homogenous, 61% heterogeneous). The plot indicates that the heterogeneous model fits the data better and that the 61% estimate is a better estimate of true power for significant results in this range. Thus, the results are in line with Cohen (1962) estimate that psychological studies average 60% power.

REPLICABILITY RANKING

The distribution of z-scores between 2 and 4 was used to estimate the average power separately for each journal. As power is the probability to obtain a significant result, this measure estimates the replicability of results published in a particular journal if researchers would reproduce the studies under identical conditions with the same sample size (exact replication). Thus, even though the selection criterion ensured that all tests produced a significant result (100% success rate), the replication rate is expected to be only about 50%, even if the replication studies successfully reproduce the conditions of the published studies. The table below shows the replicability ranking of the journals, the replicability score, and a grade. Journals are graded based on a scheme that is similar to grading schemes for undergraduate students (below 50 = F, 50-59 = E, 60-69 = D, 70-79 = C, 80-89 = B, 90+ = A).

ReplicabilityRanking

The average value in 2000-2014 is 57 (D+). The average value in 2015 is 58 (D+). The correlation for the values in 2010-2014 and those in 2015 is r = .66.   These findings show that the replicability scores are reliable and that journals differ systematically in the power of published studies.

LIMITATIONS

The main limitation of the method is that focuses on t and F-tests. The results might change when other statistics are included in the analysis. The next goal is to incorporate correlations and regression coefficients.

The second limitation is that the analysis does not discriminate between primary hypothesis tests and secondary analyses. For example, an article may find a significant main effect for gender, but the critical test is whether gender interacts with an experimental manipulation. It is possible that some journals have lower scores because they report more secondary analyses with lower power. To address this issue, it will be necessary to code articles in terms of the importance of statistical test.

The ranking for 2015 is based on the currently available data and may change when more data become available. Readers should also avoid interpreting small differences in replicability scores as these scores are likely to fluctuate. However, the strong correlation over time suggests that there are meaningful differences in the replicability and credibility of published results across journals.

CONCLUSION

This article provides objective information about the replicability of published findings in psychology journals. None of the journals reaches Cohen’s recommended level of 80% replicability. Average replicability is just about 50%. This finding is largely consistent with Cohen’s analysis of power over 50 years ago. The publication of the first replicability analysis by journal should provide an incentive to editors to increase the reputation of their journal by paying more attention to the quality of the published data. In this regard, it is noteworthy that replicability scores diverge from traditional indicators of journal prestige such as impact factors. Ideally, the impact of an empirical article should be aligned with the replicability of the empirical results. Thus, the replicability index may also help researchers to base their own research on credible results that are published in journals with a high replicability score and to avoid incredible results that are published in journals with a low replicability score. Ultimately, I can only hope that journals will start competing with each other for a top spot in the replicability rankings and as a by-product increase the replicability of published findings and the credibility of psychological science.