Replicability rankings of psychology journals differs from traditional rankings based on impact factors (citation rates) and other measures of popularity and prestige. Replicability rankings use the test statistics in the results sections of empirical articles to estimate the average power of statistical tests in a journal. Higher average power means that the results published in a journal have a higher probability to produce a significant result in an exact replication study and a lower probability of being false-positive results.
The rankings are based on statistically significant results only (p < .05, two-tailed) because only statistically significant results can be used to interpret a result as evidence for an effect and against the null-hypothesis. Published non-significant results are useful for meta-analysis and follow-up studies, but they provide insufficient information to draw statistical inferences.
The average power across the 105 psychology journals used for this ranking is 70%. This means that a representative sample of significant results in exact replication studies is expected to produce 70% significant results. The rankings for 2015 show variability across journals with average power estimates ranging from 84% to 54%. A factor analysis of annual estimates for 2010-2015 showed that random year-to-year variability accounts for 2/3 of the variance and that 1/3 is explained by stable differences across journals.
The Journal Names are linked to figures that show the powergraphs of a journal for the years 2010-2014 and 2015. The figures provide additional information about the number of tests used, confidence intervals around the average estimate, and power estimates that estimate power including non-significant results even if these are not reported (the file-drawer).
Hello! I have a question about this sentence: “Published non-significant results are useful for meta-analysis and follow-up studies, but they provide insufficient information to draw statistical inferences.” Does this mean that such findings provide insufficient information to draw inferences about the power to detect an effect size? Or is insufficient information to draw ANY inferences? I think you mean the former, because null findings are meaningful for other purposes, but perhaps do not provide information about effect sizes (because the data do not suggest that there is an effect). Please let me know your thoughts or meaning on that. Thank you!
Thanks for your question. I mean that we cannot draw inferences from non-significant results. We can’t conclude that there is an effect and we cannot conclude that there is no effect. We should not hide these results for this reason, but unless we collect more data, the result is inconclusive.