Category Archives: Replicability Ranking

Personalized P-Values for 200+ Social/Personality Psychologists

Last update 1/24/2021 (The table will be updated when new information becomes available).

Introduction

Since Fisher invented null-hypothesis significance testing, researchers have used p < .05 as a statistical criterion to interpret results as discoveries worthwhile of discussion (i.e., the null-hypothesis is false). Once published, these results are often treated as real findings even though alpha does not control the risk of false discoveries.

Statisticians have warned against the exclusive reliance on p < .05, but nearly 100 years after Fisher popularized this approach, it is still the most common way to interpret data. The main reason is that many attempts to improve on this practice have failed. The main problem is that a single statistical result is difficult to interpret. However, when individual results are interpreted in the context of other results, they become more informative. Based on the distribution of p-values it is possible to estimate the maximum false discovery rate (Bartos & Schimmack, 2020; Jager & Leek, 2014). This approach can be applied to the p-values published by individual authors to adjust p-values to keep the risk of false discoveries at a reasonable level, FDR < .05.

Researchers who mainly test true hypotheses with high power have a high discovery rate (many p-values below .05) and a low false discovery rate (FDR < .05). Figure 1 shows an example of a researcher who followed this strategy (for a detailed description of z-curve plots, see Schimmack, 2021).

We see that out of the 317 test-statistics retrieved from his articles, 246 were significant with alpha = .05. This is an observed discovery rate of 78%. We also see that this discovery rate closely matches the estimated discovery rate based on the distribution of the significant p-values, p < .05. The EDR is 79%. With an EDR of 79%, the maximum false discovery rate is only 1%. However, the 95%CI is wide and the lower bound of the CI for the EDR, 27%, allows for 14% false discoveries.

When the ODR matches the EDR, there is no evidence of publication bias. In this case, we can improve the estimates by fitting all p-values, including the non-significant ones. With a tighter CI for the EDR, we see that the 95%CI for the maximum FDR ranges from 1% to 3%. Thus, we can be confident that no more than 5% of the significant results wit alpha = .05 are false discoveries. Readers can therefore continue to use alpha = .05 to look for interesting discoveries in Matsumoto’s articles.

Figure 3 shows the results for a different type of researcher who took a risk and studied weak effect sizes with small samples. This produces many non-significant results that are often not published. The selection for significance inflates the observed discovery rate, but the z-curve plot and the comparison with the EDR shows the influence of publication bias. Here the ODR is similar to Figure 1, but the EDR is only 11%. An EDR of 11% translates into a large maximum false discovery rate of 45%. In addition, the 95%CI of the EDR includes 5%, which means the risk of false positives could be even higher than 45%. In this case, using alpha = .05 to interpret results as discoveries is very risky. Clearly, p < .05 means something very different when reading an article by David Matsumoto or Shelly Chaiken.

Rather than dismissing all of Chaiken’s results, we can try to lower alpha to reduce the false discovery rate. If we set alpha to .001, most of the just significant results are no longer considered discoveries. Now the EDR is even higher than the ODR because a large pile of just significant results with alpha = .05 were observed, but not predicted by the model. Assuming that p-values below .001 come from a different population of studies, the FDR is now 6% and low enough to warrant inspection of the findings that meet the alpha = .001 threshold. This way 100 of the 277 significant results with p = .05 are still interpretable.

The rankings below are based on automatrically extracted test-statistics from 40 journals (List of journals). The results should be interpreted with caution and treated as preliminary. They depend on the specific set of journals that were searched, the way results are being reported, and many other factors. The data are available (data.drop) and researchers can exclude articles or add articles and run their own analyses using the z-curve package in R (https://replicationindex.com/2020/01/10/z-curve-2-0/).

I am also happy to receive feedback about coding errors. I also recommended to hand-code articles to adjust alpha for focal hypothesis tests. This typically lowers the EDR and increases the FDR. For example, the automated method produced an EDR of 31 for Bargh, whereas hand-coding of focal tests produced an EDR of 12 (Bargh-Audit).

And here are the rankings. The results are fully automated and I was not able to cover up the fact that I placed only #108 out of 221 in the rankings. In another post, I will explain how researchers can move up in the rankings. Of course, one way to move up in the rankings is to increase statistical power. The rankings will be updated in a couple of months when articles from 2020 have been added.

Despite the preliminary nature, I am confident that the results provide valuable information. Until know all p-values below .05 have been treated as if they are equally informative. The rankings here show that this is not the case. While p = .02 can be informative for one researcher, p = .002 may still entail a high false discovery risk for another researcher.

TestsODREDRERRFDRALPHA
Virgil Zeigler-Hill6767680861.05
David Matsumoto3277879851.05
Linda J. Skitka4347473812.05
Steven J. Heine5667873802.05
Mahzarin R. Banaji7867372782.05
David P. Schmitt2417872802.05
Kurt Gray3977669792.05
Phoebe C. Ellsworth5517669752.05
Richard W. Robins2458268762.05
Michael E. McCullough3117168752.05
Jim Sidanius4087267753.05
Thomas N Bradbury3106166723.05
Klaus Fiedler12628366753.05
James J. Gross10227364763.05
Barbara L. Fredrickson2408464743.05
Joris Lammers5877263703.05
Paul Rozin3737862803.05
Margaret S. Clark3867962743.05
Emily A. Impett4667462733.05
Edward L. Deci2438361663.05
Patricia G. Devine6187260673.05
Alice H. Eagly2997859734.05
Jean M. Twenge3747558634.05
Elaine Fox4538257804.05
Carol D. Ryff2078655744.05
Richard M. Ryan9177855714.05
Rainer Banse3968155774.05
Rainer Reisenzein2136655714.05
Lee Jussim2467755734.05
William A. Cunningham1836853635.01
Mark Schaller5337153635.05
B. Keith Payne6846951715.05
Jordan B. Peterson2886651815.05
Leaf van Boven5007851685.05
William B. Swann Jr.9637950795.05
Daniel M. Wegner5938050645.05
Shinobu Kitayama8207749736< .001
Agneta H. Fischer9037849705.05
Igor Grossmann1967549696< .001
Brian A. Nosek6606649825.05
Jennifer S. Lerner1668449665.05
Richard E. Nisbett2907849715.001
Tessa V. West5647348576< .001
S. Alexander Haslam11247048646.01
Lisa Feldman Barrett6416947716< .001
Constantine Sedikides22947147726.001
Edward P. Lemay3098947856< .001
Harry T. Reis5967347656.01
Paul K. Piff1537846636< .001
Bertram Gawronski13467546786.005
Dacher Keltner12117546656< .001
Charles M. Judd11617946686< .001
Jens B. Asendorpf2138246726.01
Susan T. Fiske8867845736< .001
Nicholas O. Rule8866945736.05
Jan De Houwer16376945726.01
Bernadette Park10397645656< .001
Dirk Wentura6997344707.005
Mark J. Brandt2507344757.001
Hazel Rose Markus5837644667< .001
Craig A. Anderson4777544617.01
Philip E. Tetlock4788143727.001
Norbert Schwarz11957443627< .001
Niall Bolger4067843687< .001
Paula M. Niedenthal4637242677.001
Tiffany A. Ito3268042647< .001
Carol S. Dweck9187242667.001
Michael Inzlicht5176442587.001
Ursula Hess7097841748< .001
Stacey Sinclair3226941568.01
Duane T. Wegener9017641608< .001
Jessica L. Tracy4827741757.001
Richard E. Petty25367041638< .001
Malte Friese5176341608.05
Fritz Strack6287641598.001
Christian S. Crandal3597540588< .001
John T. Cacioppo3297840678< .001
Mario Mikulincer8688840638< .001
Tobias Greitemeyer16807139678.001
Michael D. Robinson13577839668.005
Wendi L. Gardner7666839638.01
John F. Dovidio17987038629< .001
C. Nathan DeWall11477437609.005
Eva Walther4788337689.01
Antony S. R. Manstead14977837639< .001
Jerry Suls3807737689.01
David M. Buss3427937789< .001
Batja Mesquita24978367310.001
Thomas Gilovich103781366710.001
Kerry Kawakami4656836579< .001
Samuel L. Gaertner3007536619< .001
Lorne Campbell3837436629.005
Caryl E. Rusbult22568356110< .001
Steven J. Sherman80880356610.001
Anthony G. Greenwald39873358310.01
Nalini Ambady116765355810< .001
Matthew Feinberg24977357010.001
Ulrich Schimmack33777356810< .001
Claude M. Steele42478344710.01
Jennifer Crocker42468346110< .001
Dale T. Miller44174346610.005
Azim F. Sharif15579347310.01
Marcel Zeelenberg77680348310.001
Jeffry A. Simpson65777335811< .001
Nilanjana Dasgupta40675335610.005
Daphna Oyserman47454335311< .001
Russell H. Fazio99569335911.01
Jennifer A. Richeson80168335011< .001
Shigehiro Oishi91467336311.001
Emily Balcetis48872336611< .001
Karl Christoph Klauer58972327011.01
Kathleen D. Vohs90869325311< .001
Russell Spears196876325511< .001
Ap Dijksterhuis67271325311< .001
Arthur Aron29767315912< .001
Sander L. Koole72269315512.01
John A. Bargh55669315412< .001
Ara Norenzayan19875305912.005
Mark Snyder54673306412.01
Joshua Aronson17679304812< .001
Wendy Wood48776306012.001
Roger Giner-Sorolla38877307312.001
Joel Cooper26577294413< .001
Klaus R. Scherer46584298213.01
Michael Harris Bond32475298513< .001
Yoav Bar-Anan48174297913< .001
Roy F. Baumeister213371295313.01
Adam D. Galinsky180373285013.01
Galen V. Bodenhausen58074286314< .001
Gordon B. Moskowitz35373285913< .001
Grainne M. Fitzsimons55969284913.01
Shelly L. Gable32164284813< .001
Ronald S. Friedman19178284614.005
Richard J. Davidson41065275214.01
Kennon M. Sheldon63678276514.001
Jeff Greenberg138779275714.005
Sonja Lyubomirsky43676276114< .001
Lauren J. Human23840274314.01
Eliot R. Smith43379266815< .001
Eli J. Finkel123765265615< .001
John T. Jost68775266415.001
Tom Pyszczynski98471265815< .001
Jonathan Haidt35376257516< .001
Elizabeth W. Dunn34173256016< .001
Felicia Pratto33075257616< .001
Phillip R. Shaver50882257316.001
Brent. W. Roberts15676256616< .001
Roland Neumann20678256216< .001
David A. Pizarro20573247017.005
Amy J. C. Cuddy15682247717< .001
Tanya L. Chartrand41666243917.001
Joseph P. Forgas90785246017< .001
Paul Bloom24483247517.01
Mark P. Zanna66371244917.001
Yoel Inbar26069237318< .001
Klaus Rothermund66576237718.01
Peter M. Gollwitzer108668235718< .001
Robert S. Wyer83782226219.001
Laurie A. Rudman44578226918.01
Michael Ross110073226418< .001
Dieter Frey149369225818.001
Gabriele Oettingen72763224818.01
Ed Diener41874227419.01
Gerald L. Clore44372214920< .001
Roland Deutsch33180217619< .001
Andrew J. Elliot84383216419.01
Wendy Berry Mendes92969214519< .001
Eddie Harmon-Jones80669216920.005
Sandra L. Murray56067216419.001
Robert B. Cialdini35175206021< .001
Frank D. Fincham63174205821< .001
James K. McNulty90761206621< .001
Toni Schmader53066206121< .001
Benoit Monin57871205721.001
Wilhelm Hofmann60366206821.001
Gun R. Semin16371197022< .001
Boris Egloff23483196122.01
Marilynn B. Brewer34575196022.01
Thomas Mussweiler60170194622< .001
Michael W. Kraus48672185225< .001
E. Tory. Higgins179771185424< .001
David Dunning68976186925.005
Brandon J. Schmeichel63568184924.001
Ziva Kunda21668176126.01
Charles S. Carver17080176625.01
Steven W. Gangestad24656174625.01
Simone Schnall26261173125.01
Jeffrey W Sherman51273176825< .001
Dolores Albarracin42368165827.01
Laura A. King36276166927< .001
Nira Liberman112977166729< .001
Lee Ross35981166529.01
Brad J. Bushman77676166027.01
Carey K. Morewedge57776166629< .001
Travis Proulx17165166428.001
Arie W. Kruglanski115679165728.005
Paul W. Eastwick47070156329< .001
Daniel T. Gilbert65567156429.001
Steven J. Spencer55468154031< .001
Nathaniel M Lambert40870155830< .001
Timothy D. Wilson66967156430.005
Leandre R. Fabrigar50269146432.01
Yaacov Trope126175146032< .001
Shelley E. Taylor39476145933.01
William von Hippel37466144734.005
Dov Cohen61971134934.001
Mark Muraven46564135334.001
Oscar Ybarra28870125838.001
Michael A. Olson28570126137.005
Gregory M. Walton43770124138.01
Daniel M. Oppenheimer16778126037.001
Hans Ijzerman21357105246< .001
Shelly Chaiken34776105245< .001

2016 Replicability Rankings of 103 Psychology Journals

Update: October 24, 2017.
The preliminary 2017 rankings are now available. They provide information for the years 2010-2017, updated analyses, and a correction in the estimates due to a computational error that lowered estimates by about 10 percentage points, on average.  Please check the newer rankings for the most reliable information.

—————————————————————————————————————————————–


I post the rankings on top.  Detailed information and statistical analysis are provided below the table.  You can click on the journal title to see Powergraphs for each year.

Rank   Journal Change 2017 2016 2015 2014 2013 2012 2011 2010
1 Social Indicators Research 10 90 70 65 75 65 72 73 73
2 Psychology of Music -13 81 59 67 61 69 85 84 72
3 Journal of Memory and Language 11 79 76 65 71 64 71 66 70
4 British Journal of Developmental Psychology -9 77 52 61 54 82 74 69 67
5 Journal of Occupational and Organizational Psychology 13 77 59 69 58 61 65 56 64
6 Journal of Comparative Psychology 13 76 71 77 74 68 61 66 70
7 Cognitive Psychology 7 75 73 72 69 66 74 66 71
8 Epilepsy & Behavior 5 75 72 79 70 68 76 69 73
9 Evolution & Human Behavior 16 75 57 73 55 38 57 62 60
10 International Journal of Intercultural Relations 0 75 43 70 75 62 67 62 65
11 Pain 5 75 70 75 67 64 65 74 70
12 Psychological Medicine 4 75 57 66 70 58 72 61 66
13 Annals of Behavioral Medicine 10 74 50 63 62 62 62 51 61
14 Developmental Psychology 17 74 72 73 67 61 63 58 67
15 Judgment and Decision Making -3 74 59 68 56 72 66 73 67
16 Psychology and Aging 6 74 66 78 65 74 66 66 70
17 Aggressive Behavior 16 73 70 66 49 60 67 52 62
18 Journal of Gerontology-Series B 3 73 60 65 65 55 79 59 65
19 Journal of Youth and Adolescence 13 73 66 82 67 61 57 66 67
20 Memory 5 73 56 79 70 65 64 64 67
21 Sex Roles 6 73 67 59 64 72 68 58 66
22 Journal of Experimental Psychology – Learning, Memory & Cognition 4 72 74 76 71 71 67 72 72
23 Journal of Social and Personal Relationships -6 72 51 57 55 60 60 75 61
24 Psychonomic Review and Bulletin 8 72 79 62 78 66 62 69 70
25 European Journal of Social Psychology 5 71 61 63 58 50 62 67 62
26 Journal of Applied Social Psychology 4 71 58 69 59 73 67 58 65
27 Journal of Experimental Psychology – Human Perception and Performance -4 71 68 72 69 70 78 72 71
28 Journal of Research in Personality 9 71 75 47 65 51 63 63 62
29 Journal of Child and Family Studies 0 70 60 63 60 56 64 69 63
30 Journal of Cognition and Development 5 70 53 62 54 50 61 61 59
31 Journal of Happiness Studies -9 70 64 66 77 60 74 80 70
32 Political Psychology 4 70 55 64 66 71 35 75 62
33 Cognition 2 69 68 70 71 67 68 67 69
34 Depression & Anxiety -6 69 57 66 71 77 77 61 68
35 European Journal of Personality 2 69 61 75 65 57 54 77 65
36 Journal of Applied Psychology 6 69 58 71 55 64 59 62 63
37 Journal of Cross-Cultural Psychology -4 69 74 69 76 62 73 79 72
38 Journal of Psychopathology and Behavioral Assessment -13 69 67 63 77 74 77 79 72
39 JPSP-Interpersonal Relationships and Group Processes 15 69 64 56 52 54 59 50 58
40 Social Psychology 3 69 70 66 61 64 72 64 67
41 Achive of Sexual Behavior -2 68 70 78 73 69 71 74 72
42 Journal of Affective Disorders 0 68 64 54 66 70 60 65 64
43 Journal of Experimental Child Psychology 2 68 71 70 65 66 66 70 68
44 Journal of Educational Psychology -11 67 61 66 69 73 69 76 69
45 Journal of Experimental Social Psychology 13 67 56 60 52 50 54 52 56
46 Memory and Cognition -3 67 72 69 68 75 66 73 70
47 Personality and Individual Differences 8 67 68 67 68 63 64 59 65
48 Psychophysiology -1 67 66 65 65 66 63 70 66
49 Cognitve Development 6 66 78 60 65 69 61 65 66
50 Frontiers in Psychology -8 66 65 67 63 65 60 83 67
51 Journal of Autism and Developmental Disorders 0 66 65 58 63 56 61 70 63
52 Journal of Experimental Psychology – General 5 66 69 67 72 63 68 61 67
53 Law and Human Behavior 1 66 69 53 75 67 73 57 66
54 Personal Relationships 19 66 59 63 67 66 41 48 59
55 Early Human Development 0 65 52 69 71 68 49 68 63
56 Attention, Perception and Psychophysics -1 64 69 70 71 72 68 66 69
57 Consciousness and Cognition -3 64 65 67 57 64 67 68 65
58 Journal of Vocactional Behavior 5 64 78 66 78 71 74 57 70
59 The Journal of Positive Psychology 14 64 65 79 51 49 54 59 60
60 Behaviour Research and Therapy 7 63 73 73 66 69 63 60 67
61 Child Development 0 63 66 62 65 62 59 68 64
62 Emotion -1 63 61 56 66 62 57 65 61
63 JPSP-Personality Processes and Individual Differences 1 63 56 56 59 68 66 51 60
64 Schizophrenia Research 1 63 65 68 64 61 70 60 64
65 Self and Identity -4 63 52 61 62 50 55 71 59
66 Acta Psychologica -6 63 66 69 69 67 68 72 68
67 Behavioral Brain Research -3 62 67 61 62 64 65 67 64
68 Child Psychiatry and Human Development 5 62 72 83 73 50 82 58 69
69 Journal of Child Psychology and Psychiatry and Allied Disciplines 10 62 62 56 66 64 45 55 59
70 Journal of Consulting and Clinical Psychology 0 62 56 50 54 59 58 57 57
71 Journal of Counseling Psychology -3 62 70 60 74 72 56 72 67
72 Behavioral Neuroscience 1 61 66 63 62 65 58 64 63
73 Developmental Science -5 61 62 60 62 66 65 65 63
74 Journal of Experimental Psychology – Applied -4 61 61 65 53 69 57 69 62
75 Journal of Social Psychology -11 61 56 55 55 74 70 63 62
76 Social Psychology and Personality Science -5 61 42 56 59 59 65 53 56
77 Cognitive Therapy and Research 0 60 68 54 67 70 62 58 63
78 Hormones & Behavior -1 60 55 55 54 55 60 58 57
79 Motivation and Emotion 1 60 60 57 57 51 73 52 59
80 Organizational Behavior and Human Decision Processes 3 60 63 65 61 68 67 51 62
81 Psychoneuroendocrinology 5 60 58 58 56 53 59 53 57
82 Social Development -10 60 50 66 62 65 79 57 63
83 Appetite -10 59 57 57 65 64 66 67 62
84 Biological Psychology -6 59 60 55 57 57 65 64 60
85 Journal of Personality Psychology 17 59 59 60 62 69 37 45 56
86 Psychological Science 6 59 63 60 63 59 55 56 59
87 Asian Journal of Social Psychology 0 58 76 67 56 71 64 64 65
88 Behavior Therapy 0 58 63 66 69 66 52 65 63
89 Britsh Journal of Social Psychology 0 58 57 44 59 51 59 55 55
90 Social Influence 18 58 72 56 52 33 59 46 54
91 Developmental Psychobiology -9 57 54 61 60 70 64 62 61
92 Journal of Research on Adolescence 2 57 59 61 82 71 75 40 64
93 Journal of Abnormal Psychology -5 56 52 57 58 55 66 55 57
94 Social Cognition -2 56 54 52 54 62 69 46 56
95 Personality and Social Psychology Bulletin 2 55 57 58 55 53 56 54 55
96 Cognition and Emotion -14 54 66 61 62 76 69 69 65
97 Health Psychology -4 51 67 56 72 54 69 56 61
98 Journal of Clinical Child and Adolescence Psychology 1 51 66 61 74 64 58 54 61
99 Journal of Family Psychology -7 50 52 63 61 57 64 55 57
100 Group Processes & Intergroup Relations -5 49 53 68 64 54 62 55 58
101 Infancy -8 47 44 60 55 48 63 51 53
102 Journal of Consumer Psychology -5 46 57 55 51 53 48 61 53
103 JPSP-Attitudes & Social Cognition -3 45 69 62 39 54 54 62 55

Notes.
1. Change scores are the unstandardized regression weights with replicabilty estimates as outcome variable and year as predictor variable.  Year was coded from 0 for 2010 to 1 for 2016 so that the regression coefficient reflects change over the full 7 year period. This method is preferable to a simple difference score because estimates in individual years are variable and are likely to overestimate change.
2. Rich E. Lucas, Editor of JRP, noted that many articles in JRP do not report t of F values in the text and that the replicability estimates based on these statistics may not be representative of the bulk of results reported in this journal.  Hand-coding of articles is required to address this problem and the ranking of JRP, and other journals, should be interpreted with caution (see further discussion of these issues below).

Introduction

I define replicability as the probability of obtaining a significant result in an exact replication of a study that produced a significant result.  In the past five years, it has become increasingly clear that psychology suffers from a replication crisis. Even results that are replicated internally by the same author multiple times fail to replicate in independent replication attempts (Bem, 2011).  The key reason for the replication crisis is selective publishing of significant results (publication bias). While journals report over 95% significant results (Sterling, 1959; Sterling et al., 1995), a 2015 article estimated that less than 50% of these results can be replicated  (OSC, 2015).

The OSC reproducibility made an important contribution by demonstrating that published results in psychology have low replicability.  However, the reliance on actual replication studies has a a number of limitations.  First, actual replication studies are expensive or impossible (e.g., a longitudinal study spanning 20 years).  Second, studies selected for replication may not be representative because the replication team lacks expertise to replicate some studies. Finally, replication studies take time and replicability of recent studies may not be known for several years. This makes it difficult to rely on actual replication studies to rank journals and to track replicability over time.

Schimmack and Brunner (2016) developed a statistical method (z-curve) that makes it possible to estimate average replicability for a set of published results based on the original results in published articles.  This statistical approach to the estimation of replicability has several advantages over the use of actual replication studies.  Replicability can be assessed in real time, it can be estimated for all published results, and it can be used for expensive studies that are impossible to reproduce.  Finally, it has the advantage that actual replication studies can be criticized  (Gilbert, King, Pettigrew, & Wilson, 2016). Estimates of replicabilty based on original studies do not have this problem because they are based on published results in original articles.

Z-curve has been validated with simulation studies and can be used when replicability varies across studies and when there is selection for significance, and is superior to similar statistical methods that correct for publication bias (Brunner & Schimmack, 2016).  I use this method to estimate the average replicability of significant results published in 103 psychology journals. Separate estimates were obtained for the years from 2010, one year before the start of the replication crisis, to 2016 to examine whether replicability increased in response to discussions about replicability.  The OSC estimate of replicability was based on articles published in 2008 and it was limited to three journals.  I posted replicability estimates based on z-curve for the year 2015 (2015 replicability rankings).  There was no evidence that replicability had increased during this time period.

The main empirical question was whether the 2016 rankings show some improvement in replicability and whether some journals or disciplines have responded more strongly to the replication crisis than others.

A second empirical question was whether replicabilty varies across disciplines.  The OSC project provided first evidence that traditional cognitive psychology is more replicable than social psychology.  Replicability estimates with z-curve confirmed this finding.  In the 2015 rankings, The Journal of Experimental Psychology: Learning, Memory and Cognition ranked 25 with a replicability estimate of 74, whereas the two social psychology sections of the Journal of Personality and Social Psychology ranked 73 and 99 (68% and 60% replicability estimates).  For this post, I conducted more extensive analyses of disciplines.

Journals

The 103 journals that are included in these rankings were mainly chosen based on impact factors.  The list also includes diverse areas of psychology, including cognitive, developmental, social, personality, clinical, biological, and applied psychology.  The 2015 list included some new journals that started after 2010.  These journals were excluded from the 2016 rankings to avoid missing values in statistical analyses of time trends.  A few journals were added to the list and the results may change when more journals are added to the list.

The journals were classified into 9 categories: social (24), cognitive (12), development (15), clinical/medical (19), biological (8), personality (5), and applied(IO,education) (8).  Two journals were classified as general (Psychological Science, Frontiers in Psychology). The last category included topical, interdisciplinary journals (emotion, positive psychology).

Data 

All PDF versions of published articles were downloaded and converted into text files. The 2015 rankings were based on conversions with the free program pdf2text pilot.  The 2016 program used a superior conversion program pdfzilla.  Text files were searched for reports of statistical results using my own R-code (z-extraction). Only F-tests, t-tests, and z-tests were used for the rankings. t-values that were reported without df were treated as z-values which leads to a slight inflation in replicability estimates. However, the bulk of test-statistics were F-values and t-values with degrees of freedom.  A comparison of the 2015 rankings using the old method and the new method shows that extraction methods have an influence on replicability estimates some differences (r = .56). One reason for the low correlation is that replicability estimates have a relatively small range (50-80%) and low retest correlations. Thus, even small changes can have notable effects on rankings. For this reason, time trends in replicability have to be examined at the aggregate level of journals or over longer time intervals. The change score of a single journal from 2015 to 2016 is not a reliable measure of improvement.

Data Analysis

The data for each year were analyzed using z-curve Schimmack and Brunner (2016).  The results of individual analysis are presented in Powergraphs. Powergraphs for each journal and year are provided as links to the journal names in the table with the rankings.  Powergraphs convert test statistics into absolute z-scores as a common metric for the strength of evidence against the null-hypothesis.  Absolute z-scores greater than 1.96 (p < .05, two-tailed) are considered statistically significant. The distribution of z-scores greater than 1.96 is used to estimate the average true power (not observed power) of the set of significant studies. This estimate is an estimate of replicability for a set of exact replication studies because average power determines the percentage of statistically significant results.  Powergraphs provide additional information about replicability for different ranges of z-scores (z-values between 2 and 2.5 are less replicable than those between 4 and 4.5).  However, for the replicability rankings only the replicability estimate is used.

Results

Table 1 shows the replicability estimates sorted by replicability in 2016.

The data were analyzed with a growth model to examine time trends and variability across journals and disciplines using MPLUS7.4.  I compared three models. Model 1 assumed no mean level changes and variability across journals. Model 2 assumed a linear increase. Model 3 tested assumed no change from 2010 to 2015 and allowed for an increase in 2016.

Model 1 had acceptable fit (RMSEA = .043, BIC = 5004). Model 2 increased fit (RMSEA = 0.029, BIC = 5005), but BIC slightly favored the more parsimonious Model 1. Model 3 had the best fit (RMSEA = .000, BIC = 5001).  These results reproduce the results of the 2015 analysis that there was no improvement from 2010 to 2015, but there is some evidence that replicability increased in 2016.  Adding a variance component to slope in Model 3 produced an unidentified model. Subsequent analyses show that this is due to insufficient power to detect variation across journals in changes over time.

The standardized loadings of individual years on the latent intercept factor ranged from .49 to .58.  This shows high variabibility in replicability estimates from year to year. Most of the rank changes can be attributed to random factors.  A better way to compare journals is to average across years.  A moving average of five years will provide reliable information and allow for improvement over time.  The reliability of the 5-year average for the years 2012 to 2016 is 68%.

Figure 1 shows the annual averages with 95%CI as well relative to the average over the full 7-year period.

rep-by-year

A paired t-test confirmed that average replicability in 2016 was significantly higher (M = 65, SD = 8) than in the previous years (M = 63, SD = 8), t(101) = 2.95, p = .004.  This is the first evidence that psychological scientists are responding to the replicability crisis by publishing slightly more replicable results.  Of course, this positive result has to be tempered by the small effect size.  But if this trend continuous or even increases, replicability could reach 80% in 10 years.

The next analysis examined changes in replicabilty at the level of individual journals. Replicability estimates were regressed on a dummy variable that contrasted 2016 with the previous years.  This analysis produced only 7 significant increases with p < .05 (one-tailed), which is only 2 more significant results than would be expected by chance alone. Thus, the analysis failed to identify particular journals that contribute to the improvement in the average.  Figure 2 compares the observed distribution of t-values to the predicted distribution based on the null-hypothesis (no change).

t-value Distribution.png

The blue line shows the observed density distribution, which is slightly moved to the right, but there is no set of journals with notably larger t-values.  A more sustained and larger increase in replicability is needed to detect variability in change scores.

The next analyses examine stable differences between disciplines.  The first analysis compared cognitive journals to social journals.  No statistical tests are needed to see that cognitive journals publish more replicable results than social journals. This finding confirms the results with actual replications of studies published in 2008 (OSC, 2015). The Figure suggests that the improvement in 2016 is driven more by social journals, but only 2017 data can tell whether there is a real improvement in social psychology.

replicability.cog.vs.soc.png

The next Figure shows the results for 5 personality journals.  The large confidence intervals show that there is considerable variability among personality journals. The Figure shows the averages for cognitive and social psychology as horizontal lines. The average for personality is only slightly above the average for social and like social, personality shows an upward trend.  In conclusion, personality and social psychology look very similar.  This may be due to considerable overlap between the two disciplines, which is also reflected in shared journals.  Larger differences may be visible for specialized social journals that focus on experimental social psychology.

replicability-personality

The results for developmental journals show no clear time trend and the average is just about in the middle between cognitive and social psychology.  The wide confidence intervals suggest that there is considerable variability among developmental journals. Table 1 shows Developmental Psychology ranks 14 / 103 and Infancy ranks 101/103. The low rank for Infancy may be due to the great difficulty of measuring infant behavior.

replicability-developmental

The clinical/medical journals cover a wide range of topics from health psychology to special areas of psychiatry.  There has been some concern about replicability in medical research (Ioannidis, 2005). The results for clinical are similar to those for developmental journals. Replicability is lower than for cognitive psychology and higher than for social psychology.  This may seem surprising because patient populations and samples tend to be smaller. However, a randomized controlled intervention study uses pre-post designs to boost power, whereas social and personality psychologists use comparisons across individuals, which requires large samples to reduce sampling error.

replicability-clinical

The set of biological journals is very heterogeneous and small. It includes neuroscience and classic peripheral physiology.  Despite wide confidence intervals replicability for biological journals is significantly lower than replicabilty for cognitive psychology. There is no notable time trend. The average is slightly above the average for social journals.

replicability.biological.png

The last category are applied journals. One journal focuses on education. The other journals focus on industrial and organizational psychology.  Confidence intervals are wide, but replicabilty is generally lower than for cognitive psychology. There is no notable time trend for this set of journals.

replicability.applied.png

Given the stability of replicability, I averaged replicability estimates across years. The last figure shows a comparison of disciplines based on these averages.  The figure shows that social psychology is significantly below average and cognitive psychology is significantly above average with the other disciplines falling in the middle.  All averages are significantly above 50% and below 80%.

Discussion

The most exciting finding is that repicability appears to have increased in 2016. This increase is remarkable because averages in the years before consistently tracked the average of 63.  The increase by 2 percentage points in 2016 is not large, but it may represent a first response to the replication crisis.

The increase is particularly remarkable because statisticians have been sounding the alarm bells about low power and publication bias for over 50 years (Cohen, 1962; Sterling, 1959), but these warnings have had no effect on research practices. In 1989, Sedlmeier and Gigerenzer (1989) noted that studies of statistical power had no effect on the statistical power of studies.  The present results provide the first empirical evidence that psychologists are finally starting to change their research practices.

However, the results also suggest that most journals continue to publish articles with low power.  The replication crisis has affected social psychology more than other disciplines with fierce debates in journals and on social media (Schimmack, 2016).  On the one hand, the comparisons of disciplines supports the impression that social psychology has a bigger replicability problem than other disciplines. However, the differences between disciplines are small. With the exception of cognitive psychology, other disciplines are not a lot more replicable than social psychology.  The main reason for the focus on social psychology is probably that these studies are easier to replicate and that there have been more replication studies in social psychology in recent years.  The replicability rankings predict that other disciplines would also see a large number of replication failures, if they would subject important findings to actual replication attempts.  Only empirical data will tell.

Limitations

The main limitation of replicability rankings is that the use of an automatic extraction method does not distinguish theoretically important hypothesis tests and other statistical tests.  Although this is a problem for the interpretation of the absolute estimates, it is less important for the comparison over time.  Any changes in research practices that reduce sampling error (e.g., larger samples, more reliable measures) will not only strengthen the evidence for focal hypothesis tests, but also increase the strength of evidence for non-focal hypothesis tests.

Schimmack and Brunner (2016) compared replicability estimates with actual success rates in the OSC (2015) replication studies.  They found that the statistical method overestimates replicability by about 20%.  Thus, the absolute estimates can be interpreted as very optimistic estimates.  There are several reasons for this overestimation.  One reason is that the estimation method assumes that all results with a p-value greater than .05 are equally likely to be published. If there are further selection mechanisms that favor smaller p-values, the method overestimates replicability.  For example, sometimes researchers correct for multiple comparisons and need to meet a more stringent significance criterion.  Only careful hand-coding of research articles can provide more accurate estimates of replicability.  Schimmack and Brunner (2016) hand-coded the articles that were included in the OSC (2015) article and still found that the method overestimated replicability.  Thus, the absolute values need to be interpreted with great caution and success rates of actual replication studies are expected to be at least 10% lower than these estimates.

Implications

Power and replicability have been ignored for over 50 years.  A likely reason is that replicability is difficult to measure.  A statistical method for the estimation of replicability changes this. Replicability estimates of journals make it possible for editors to compete with other journals in the replicability rankings. Flashy journals with high impact factors may publish eye-catching results, but if this journal has a reputation of publishing results that do not replicate, they are not very likely to have a big impact.  Science is build on trust and trust has to be earned and can be easily lost.  Eventually, journals that publish replicable results may also increase their impact because more researchers are going to build on replicable results published in these journals.  In this way, replicability rankings can provide a much needed correction to the current incentive structure in science that rewards publishing as many articles as possible without any concerns about the replicability of these results. This reward structure is undermining science.  It is time to change it. It is no longer sufficient to publish a significant result, if this result cannot be replicate in other labs.

Many scientists feel threatened by changes in the incentive structure and the negative consequences of replication failures for their reputation. However, researchers have control over their reputation.  First, researchers often carry out many conceptually related studies. In the past, it was acceptable to publish only the studies that worked (p < .05). This selection for significance by researchers is the key factor in the replication crisis. The researchers who are conducting the studies are fully aware that it was difficult to get a significant result, but the selective reporting of these successes produces inflated effect size estimates and an illusion of high replicability that inevitably lead to replication failures.  To avoid these embarrassing replication failures researchers need to report results of all studies or conduct fewer studies with high power.  The 2016 rankings suggest that some researchers have started to change, but we will have to wait until 2017 to see whether 2017 can replicate the positive trend in the 2016 rankings.

Replicability Ranking of Psychology Departments

Evaluations of individual researchers, departments, and universities are common and arguably necessary as science is becoming bigger. Existing rankings are based to a large extent on peer-evaluations. A university is ranked highly if peers at other universities perceive it to produce a steady stream of high-quality research. At present the most widely used objective measures rely on the quantity of research output and on the number of citations. These quantitative indicators of research quality work are also heavily influenced by peers because peer-review controls what gets published, especially in journals with high rejection rates, and peers decide what research they cite in their own work. The social mechanisms that regulate peer-approval are unavoidable in a collective enterprise like science that does not have a simple objective measure of quality (e.g., customer satisfaction ratings, or accident rates of cars). Unfortunately, it is well known that social judgments are subject to many biases due to conformity pressure, self-serving biases, confirmation bias, motivated biases, etc. Therefore, it is desirable to complement peer-evaluations with objective indicators of research quality.

Some aspects of research quality are easier to measure than others. Replicability rankings focus on one aspect of research quality that can be measured objectively, namely the replicability of a published significant result. In many scientific disciplines such as psychology, a successful study reports a statistically significant result. A statistically significant result is used to minimize the risk of publishing evidence for an effect that does not exist (or even goes in the opposite direction). For example, a psychological study that shows effectiveness of a treatment for depression would have to show that the effect in the study reveals a real effect that can be observed in other studies and in real patients if the treatment is used for the treatment of depression.

In a science that produces thousands of results a year, it is inevitable that some of the published results are fluke findings (even Toyota’s break down sometimes). To minimize the risk of false results entering the literature, psychology like many other sciences, adopted a 5% error rate. By using a 5% as the criterion, psychologists ensured that no more than 5% of results are fluke findings. With thousands of results published in each year, this still means that more than 50 false results enter the literature each year. However, this is acceptable because a single study does not have immediate consequences. Only if these results are replicated in other studies, findings become the foundation of theories and may influence practical decisions in therapy or in other applications of psychological findings (at work, in schools, or in policy). Thus, to outside observers it may appear safe to trust published results in psychology and to report about these findings in newspaper articles, popular books, or textbooks.

Unfortunately, it would be a mistake to interpret a significant result in a psychology journal as evidence that the result is probably true.  The reason is that the published success rate in journals has nothing to do with the actual success rate in psychological laboratories. All insiders know that it is common practice to report only results that support a researcher’s theory. While outsiders may think of scientists as neutral observers (judges), insiders play the game of lobbyist, advertisers, and self-promoters. The game is to advance one’s theory, publish more than others, get more citations than others, and win more grant money than others. Honest reporting of failed studies does not advance this agenda. As a result, the fact that psychological studies report nearly exclusively success stories (Sterling, 1995; Sterling et al., 1995) tells outside observers nothing about the replicability of a published finding and the true rate of fluke findings could be 100%.

This problem has been known for over 50 years (Cohen, 1962; Sterling, 1959). So it would be wrong to call the selective reporting of successful studies an acute crisis. However, what changed is that some psychologists have started to criticize the widely accepted practice of selective reporting of successful studies (Asendorpf et al., 2012; Francis, 2012; Simonsohn et al., 2011; Schimmack, 2012; Wagenmakers et al., 2011). Over the past five years, psychologists, particularly social psychologists, have been engaged in heated arguments over the so-called “replication crisis.”

One group argues that selective publishing of successful studies occurred, but without real consequences on the trustworthiness of published results. The other group argues that published results cannot be trusted unless they have been successfully replicated. The problem is that neither group has objective information about the replicability of published results.  That is, there is no reliable estimate of the percentage of studies that would produce a significant result again, if a representative sample of significant results published in psychology journals were replicated.

Evidently, it is not possible to conduct exact replication studies of all studies that have been published in the past 50 years. Fortunately, it is not necessary to conduct exact replication studies to obtain an objective estimate of replicability. The reason is that replicability of exact replication studies is a function of the statistical power of studies (Sterling et al., 1995). Without selective reporting of results, a 95% success rate is an estimate of the statistical power of the studies that achieved this success rate. Vice versa, a set of studies with average power of 50% is expected to produce a success rate of 50% (Sterling, et al., 1995).

Although selection bias renders success rates uninformative, the actual statistical results provide valuable information that can be used to estimate the unbiased statistical power of published results. Although selection bias inflates effect sizes and power, Brunner and Schimmack (forcecoming) developed and validated a method that can correct for selection bias. This method makes it possible to estimate the replicability of published significant results on the basis of the original reported results. This statistical method was used to estimate the replicabilty of research published by psychology departments in the years from 2010 to 2015 (see Methodology for details).

The averages for the 2010-2012 period (M = 59) and the 2013-2015 period (M = 61) show only a small difference, indicating that psychologists have not changed their research practices in accordance with recommendations to improve replicability in 2011  (Simonsohn et al., 2011). For most of the departments the confidence intervals for the two periods overlap (see attached powergraphs). Thus, the more reliable average across all years is used for the rankings, but the information for the two time periods is presented as well.

There are no obvious predictors of variability across departments. Private universities are at the top (#1, #2, #8), the middle (#24, #26), and at the bottom (#44, #47). European universities can also be found at the top (#4, #5), middle (#25) and bottom (#46, #51). So are Canadian universities (#9, #15, #16, #18, #19, #50).

There is no consensus on an optimal number of replicability.  Cohen recommended that researchers should plan studies with 80% power to detect real effects. If 50% of studies tested real effects with 80% power and the other 50% tested a null-hypothesis (no effect = 2.5% probability to replicate a false result again), the estimated power for significant results would be 78%. The effect on average power is so small because most of the false predictions produce a non-significant result. As a result, only a few studies with low replication probability dilute the average power estimate. Thus, a value greater than 70 can be considered broadly in accordance with Cohen’s recommendations.

It is important to point out that the estimates are very optimistic estimates of the success rate in actual replications of theoretically important effects. For a representative set of 100 studies (OSC, Science, 2015), Brunner and Schimmack’s statistical approach predicted a success rate of 54%, but the success rate in actual replication studies was only 37%. One reason for this discrepancy could be that the statistical approach assumes that the replication studies are exact, but actual replications always differ in some ways from the original studies, and this uncontrollable variability in experimental conditions posses another challenge for replicability of psychological results.  Before further validation research has been completed, the estimates can only be used as a rough estimate of replicability. However, the absolute accuracy of estimates is not relevant for the relative comparison of psychology departments.

And now, without further ado, the first objective rankings of 51 psychology departments based on the replicability of published significant results. More departments will be added to these rankings as the results become available.

Rank University 2010-2015 2010-2012 2013-2015
1 U Penn 72 69 75
2 Cornell U 70 67 72
3 Purdue U 69 69 69
4 Tilburg U 69 71 66
5 Humboldt U Berlin 67 68 66
6 Carnegie Mellon 67 67 67
7 Princeton U 66 65 67
8 York U 66 63 68
9 Brown U 66 71 60
10 U Geneva 66 71 60
11 Northwestern U 65 66 63
12 U Cambridge 65 66 63
13 U Washington 65 70 59
14 Carleton U 65 68 61
15 Queen’s U 63 57 69
16 U Texas – Austin 63 63 63
17 U Toronto 63 65 61
18 McGill U 63 72 54
19 U Virginia 63 61 64
20 U Queensland 63 66 59
21 Vanderbilt U 63 61 64
22 Michigan State U 62 57 67
23 Harvard U 62 64 60
24 U Amsterdam 62 63 60
25 Stanford U 62 65 58
26 UC Davis 62 57 66
27 UCLA 61 61 61
28 U Michigan 61 63 59
29 Ghent U 61 58 63
30 U Waterloo 61 65 56
31 U Kentucky 59 58 60
32 Penn State U 59 63 55
33 Radboud U 59 60 57
34 U Western Ontario 58 66 50
35 U North Carolina Chapel Hill 58 58 58
36 Boston University 58 66 50
37 U Mass Amherst 58 52 64
38 U British Columbia 57 57 57
39 The University of Hong Kong 57 57 57
40 Arizona State U 57 57 57
41 U Missouri 57 55 59
42 Florida State U 56 63 49
43 New York U 55 55 54
44 Dartmouth College 55 68 41
45 U Heidelberg 54 48 60
46 Yale U 54 54 54
47 Ohio State U 53 58 47
48 Wake Forest U 51 53 49
49 Dalhousie U 50 45 55
50 U Oslo 49 54 44
51 U Kansas 45 45 44

 

Dr. Ulrich Schimmack’s Blog about Replicability

For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication” (Cohen, 1994).

DEFINITION OF REPLICABILITYIn empirical studies with sampling error, replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion (Schimmack, 2017).

BLOGS BY YEAR:  20192018, 2017, 2016, 2015, 2014

Featured Blog of the Month (January, 2020): Z-Curve.2.0 (with R-package) 

 

TOP TEN BLOGS

RR.Logo

  1. 2018 Replicability Rankings of 117 Psychology Journals (2010-2018)

Rankings of 117 Psychology Journals according to the average replicability of a published significant result. Also includes detailed analysis of time trends in replicability from 2010 to 2018). 

Golden2.  Introduction to Z-Curve with R-Code

This post presented the first replicability ranking and explains the methodology that is used to estimate the typical power of a significant result published in a journal.  The post provides an explanation of the new method to estimate observed power based on the distribution of test statistics converted into absolute z-scores.  The method has been developed further to estimate power for a wider range of z-scores by developing a model that allows for heterogeneity in power across tests.  A description of the new method will be published when extensive simulation studies are completed.

Say-No-to-Doping-Test-Image

3. An Introduction to the R-Index

 

The R-Index can be used to predict whether a set of published results will replicate in a set of exact replication studies. It combines information about the observed power of the original studies with information about the amount of inflation in observed power due to publication bias (R-Index = Observed Median Power – Inflation). The R-Index has predicted the outcome of actual replication studies.

Featured Image -- 203

4.  The Test of Insufficient Variance (TIVA)

 

The Test of Insufficient Variance is the most powerful test of publication bias and/or dishonest reporting practices. It can be used even if only two independent statistical results are available, although power to detect bias increases with the number of studies. After converting test results into z-scores, z-scores are expected to have a variance of one.   Unless power is very high, some of these z-scores will not be statistically significant (z .05 two-tailed).  If these non-significant results are missing, the variance shrinks, and TIVA detects that the variance is insufficient.  The observed variance is compared against the expected variance of 1 with a left-tailed chi-square test. The usefulness of TIVA is illustrated with Bem’s (2011) “Feeling the Future” data.

train-wreck-15.  MOST VIEWED POST (with comment by Noble Laureate Daniel Kahneman)

Reconstruction of a Train Wreck: How Priming Research Went off the Rails

This blog post examines the replicability of priming studies cited in Daniel Kahneman’s popular book “Thinking fast and slow.”   The results suggest that many of the cited findings are difficult to replicate.

http://schoolsnapshots.org/blog/2014/09/30/math-prize-for-girls-at-m-i-t/6. How robust are Stereotype-Threat Effects on Women’s Math Performance?

Stereotype-threat has been used by social psychologists to explain gender differences in math performance. Accordingly, the stereotype that men are better at math than women is threatening to women and threat leads to lower performance.  This theory has produced a large number of studies, but a recent meta-analysis showed that the literature suffers from publication bias and dishonest reporting.  After correcting for these effects, the stereotype-threat effect was negligible.  This blog post shows a low R-Index for the first article that appeared to provide strong support for stereotype-threat.  These results show that the R-Index can warn readers and researchers that reported results are too good to be true.

GPower7.  An attempt at explaining null-hypothesis testing and statistical power with 1 figure and 1500 words.   Null-hypothesis significance testing is old, widely used, and confusing. Many false claims have been used to suggest that NHST is a flawed statistical method. Others argue that the method is fine, but often misunderstood. Here I try to explain NHST and why it is important to consider power (type-II errors) using a picture from the free software GPower.

snake-oil

8.  The Problem with Bayesian Null-Hypothesis Testing

 

Some Bayesian statisticians have proposed Bayes-Factors to provide evidence for a Null-Hypothesis (i.e., there is no effect).  They used Bem’s (2011) “Feeling the Future” data to argue that Bayes-Factors would have demonstrated that extra-sensory perception does not exist.  This blog post shows that Bayes-Factors depend on the specification of the alternative hypothesis and that support for the null-hypothesis is often obtained by choosing an unrealistic alternative hypothesis (e.g., there is a 25% probability that effect size is greater than one standard deviation, d > 1).  As a result, Bayes-Factors can favor the null-hypothesis when there is an effect, but the effect size is small (d = .2).  A Bayes-Factor in favor of the null is more appropriately interpreted as evidence that the alternative hypothesis needs to decrease the probabilities assigned to large effect sizes. The post also shows that Bayes-Factors based on a meta-analysis of Bem’s data provide misleading evidence that an effect is present because Bayesian statistics do not take publication bias and dishonest reporting practices into account.

hidden9. Hidden figures: Replication failures in the stereotype threat literature.  A widespread problem is that failed replication studies are often not published. This blog post shows that another problem is that failed replication studies are ignored even when they are published.  Selective publishing of confirmatory results undermines the credibility of science and claims about the importance of stereotype threat to explain gender differences in mathematics.

20170620_14554410. My journey towards estimation of replicability.  In this blog post I explain how I got interested in statistical power and replicability and how I developed statistical methods to reveal selection bias and to estimate replicability.

2015 Replicability Ranking of 100+ Psychology Journals

Replicability rankings of psychology journals differs from traditional rankings based on impact factors (citation rates) and other measures of popularity and prestige. Replicability rankings use the test statistics in the results sections of empirical articles to estimate the average power of statistical tests in a journal. Higher average power means that the results published in a journal have a higher probability to produce a significant result in an exact replication study and a lower probability of being false-positive results.

The rankings are based on statistically significant results only (p < .05, two-tailed) because only statistically significant results can be used to interpret a result as evidence for an effect and against the null-hypothesis.  Published non-significant results are useful for meta-analysis and follow-up studies, but they provide insufficient information to draw statistical inferences.

The average power across the 105 psychology journals used for this ranking is 70%. This means that a representative sample of significant results in exact replication studies is expected to produce 70% significant results. The rankings for 2015 show variability across journals with average power estimates ranging from 84% to 54%.  A factor analysis of annual estimates for 2010-2015 showed that random year-to-year variability accounts for 2/3 of the variance and that 1/3 is explained by stable differences across journals.

The Journal Names are linked to figures that show the powergraphs of a journal for the years 2010-2014 and 2015. The figures provide additional information about the number of tests used, confidence intervals around the average estimate, and power estimates that estimate power including non-significant results even if these are not reported (the file-drawer).

Rank   Journal 2010/14 2015
1   Social Indicators Research   81   84
2   Journal of Happiness Studies   81   83
3   Journal of Comparative Psychology   72   83
4   International Journal of Psychology   80   81
5   Journal of Cross-Cultural Psychology   78   81
6   Child Psychiatry and Human Development   75   81
7   Psychonomic Review and Bulletin   72   80
8   Journal of Personality   72   79
9   Journal of Vocational Behavior   79   78
10   British Journal of Developmental Psychology   75   78
11   Journal of Counseling Psychology   72   78
12   Cognitve Development   69   78
13   JPSP: Personality Processes
and Individual Differences
  65   78
14   Journal of Research in Personality   75   77
15   Depression & Anxiety   74   77
16   Asian Journal of Social Psychology   73   77
17   Personnel Psychology   78   76
18   Personality and Individual Differences   74   76
19   Personal Relationships   70   76
20   Cognitive Science   77   75
21   Memory and Cognition   73   75
22   Early Human Development   71   75
23   Journal of Sexual Medicine   76   74
24   Journal of Applied Social Psychology   74   74
25   Journal of Experimental Psychology: Learning, Memory & Cognition   74   74
26   Journal of Youth and Adolescence   72   74
27   Social Psychology   71   74
28   Journal of Experimental Psychology: Human Perception and Performance   74   73
29   Cognition and Emotion   72   73
30   Journal of Affective Disorders   71   73
31   Attention, Perception and Psychophysics   71   73
32   Evolution & Human Behavior   68   73
33   Developmental Science   68   73
34   Schizophrenia Research   66   73
35   Achive of Sexual Behavior   76   72
36   Pain   74   72
37    Acta Psychologica   72   72
38   Cognition   72   72
39   Journal of Experimental Child Psychology   72   72
40   Aggressive Behavior   72   72
41   Journal of Social Psychology   72   72
42   Behaviour Research and Therapy   70   72
43   Frontiers in Psychology   70   72
44   Journal of Autism and Developmental Disorders   70   72
45   Child Development   69   72
46   Epilepsy & Behavior   75   71
47   Journal of Child and Family Studies   72   71
48   Psychology of Music   71   71
49   Psychology and Aging   71   71
50   Journal of Memory and Language   69   71
51   Journal of Experimental Psychology: General   69   71
52   Psychotherapy   78   70
53   Developmental Psychology   71   70
54   Behavior Therapy   69   70
55   Judgment and Decision Making   68   70
56   Behavioral Brain Research   68   70
57   Social Psychology and Personality Science   62   70
58   Political Psychology   75   69
59   Cognitive Psychology   74   69
60   Organizational Behavior and Human Decision Processes   69   69
61   Appetite   69   69
62   Motivation and Emotion   69   69
63   Sex Roles   68   69
64   Journal of Experimental Psychology: Applied   68   69
65   Journal of Applied Psychology   67   69
66   Behavioral Neuroscience   67   69
67   Psychological Science   67   68
68   Emotion   67   68
69   Developmental Psychobiology   66   68
70   European Journal of Social Psychology   65   68
71   Biological Psychology   65   68
72   British Journal of Social Psychology   64   68
73   JPSP: Attitudes & Social Cognition   62   68
74   Animal Behavior   69   67
75   Psychophysiology   67   67
76   Journal of Child Psychology and Psychiatry and Allied Disciplines   66   67
77   Journal of Research on Adolescence   75   66
78   Journal of Educational Psychology   74   66
79   Clinical Psychological Science   69   66
80   Consciousness and Cognition   69   66
81   The Journal of Positive Psychology   65   66
82   Hormones & Behavior   64   66
83   Journal of Clinical Child and
Adolescence Psychology
  62   66
84   Journal of Gerontology: Series B   72   65
85   Psychological Medicine   66   65
86   Personalit and Social Psychology
Bulletin
  64   64
87   Infancy   61   64
88   Memory   75   63
89   Law and Human Behavior   70   63
90   Group Processes & Intergroup Relations   70   63
91   Journal of Social and Personal Relationships   69   63
92   Cortex   67   63
93   Journal of Abnormal Psychology   64   63
94   Journal of Consumer Psychology   60   63
95   Psychology of Violence   71   62
96   Psychoneuroendocrinology   63   62
97   Health Psychology   68   61
98   Journal of Experimental Social
Psychology
  59   61
99   JPSP: Interpersonal Relationships
and Group Processes
  60   60
100   Social Cognition   65   59
101   Journal of Consulting and Clinical Psychology   63   58
102   European Journal of Personality   72   57
103   Journal of Family Psychology   60   57
104   Social Development   75   55
105   Annals of Behavioral Medicine   65   54
106   Self and Identity   63   54