Since 2011, it is an open secret that many published results in psychology journals do not replicate. The replicability of published results is particularly low in social psychology (Open Science Collaboration, 2015).

A key reason for low replicability is that researchers are rewarded for publishing as many articles as possible without concerns about the replicability of the published findings. This incentive structure is maintained by journal editors, review panels of granting agencies, and hiring and promotion committees at universities.

To change the incentive structure, I developed the Replicability Index, a blog that critically examined the replicability, credibility, and integrity of psychological science. In 2016, I created the first replicability rankings of psychology departments (Schimmack, 2016). Based on scientific criticisms of these methods, I have improved the selection process of articles to be used in departmental reviews.

1. I am using Web of Science to obtain lists of published articles from individual authors (Schimmack, 2022). This method minimizes the chance that articles that do not belong to an author are included in a replicability analysis. It also allows me to classify researchers into areas based on the frequency of publications in specialized journals. Currently, I cannot evaluate neuroscience research. So, the rankings are limited to cognitive, social, developmental, clinical, and applied psychologists.

2. I am using department’s websites to identify researchers that belong to the psychology department. This eliminates articles that are from other departments.

3. I am only using tenured, active professors. This eliminates emeritus professors from the evaluation of departments. I am not including assistant professors because the published results might negatively impact their chances to get tenure. Another reason is that they often do not have enough publications at their current university to produce meaningful results.

Like all empirical research, the present results rely on a number of assumptions and have some limitations. The main limitations are that

(a) only results that were found in an automatic search are included

(b) only results published in 120 journals are included (see list of journals)

(c) published significant results (p < .05) may not be a representative sample of all significant results

(d) point estimates are imprecise and can vary based on sampling error alone.

These limitations do not invalidate the results. Large difference in replicability estimates are likely to predict real differences in success rates of actual replication studies (Schimmack, 2022).

## University of British Columbia (Vancouver)

I used the department website to find core members of the psychology department. I counted 34 professors and 7 associate professors. Not all researchers conduct quantitative research and report test statistics in their result sections. I limited the analysis to 22 professors and 4 associate professors who had at least 100 significant test statistics.

Figure 1 shows the z-curve for all 13,147 tests statistics in articles published by these 26 faculty members. I use the Figure to explain how a z-curve analysis provides information about replicability and other useful meta-statistics.

1. All test-statistics are converted into absolute z-scores as a common metric of the strength of evidence (effect size over sampling error) against the null-hypothesis (typically H0 = no effect). A z-curve plot is a histogram of absolute z-scores in the range from 0 to 6. The 1,531 z-scores greater than 6 are not shown because z-scores of this magnitude are extremely unlikely to occur when the null-hypothesis is true (particle physics uses z > 5 for significance). Although they are not shown, they are included in the meta-statistics.

2. Visual inspection of the histogram shows a steep drop in frequencies at z = 1.96 (solid red line) that corresponds to the standard criterion for statistical significance, p = .05 (two-tailed). This shows that published results are selected for significance. The dashed red line shows significance for p < .10, which is often used for marginal significance. Thus, there are more results that are presented as significant than the .05 criterion suggests.

3. To quantify the amount of selection bias, z-curve fits a statistical model to the distribution of statistically significant results (z > 1.96). The grey curve shows the predicted values for the observed significant results and the unobserved non-significant results. The statistically significant results (including z > 6) make up 21% of the total area under the grey curve. This is called the expected discovery rate because the results provide an estimate of the percentage of significant results that researchers actually obtain in their statistical analyses. In comparison, the percentage of significant results (including z > 6) includes 68% of the published results. This percentage is called the observed discovery rate, which is the rate of significant results in published journal articles. The difference between a 68% ODR and a 21% EDR provides an estimate of the extent of selection for significance. The difference of~ 50 percentage points is large. The upper level of the 95% confidence interval for the EDR is 29%. Thus, the discrepancy is not just random. To put this result in context, it is possible to compare it to the average for 120 psychology journals in 2010 (Schimmack, 2022). The ODR (68% vs. 72%) is similar, but the EDR is lower (31% vs. 21%). This suggest that the research produced by UBC faculty members is somewhat less replicable than research in general.

4. The z-curve model also estimates the average power of the subset of studies with significant results (p < .05, two-tailed). This estimate is called the expected replication rate (ERR) because it predicts the percentage of significant results that are expected if the same analyses were repeated in exact replication studies with the same sample sizes. The ERR of 67% suggests a fairly high replication rate. The problem is that actual replication rates are lower (about 40% Open Science Collaboration, 2015). The main reason is that it is impossible to conduct exact replication studies and that selection for significance will lead to a regression to the mean when replication studies are not exact. Thus, the ERR represents the best case scenario that is unrealistic. In contrast, the EDR represents the worst case scenario in which selection for significance does not select more powerful studies and the success rate of replication studies is not different from the success rate of original studies. The EDR of 21% is lower than the actual replication success rate of 40%. To predict the success rate of actual replication studies, I am using the average of the EDR and ERR, which is called the actual replication prediction (ARP). For UBC research, the ARP is 44%. This is close to the currently best estimate of the success rate for actual replication studies based on the Open Science Collaboration project (~40%). Thus, UBC results are expected to replicate at the average rate of psychological research.

5. The EDR can be used to estimate the risk that published results are false positives (i.e., a statistically significant result when H0 is true), using Soric’s (1989) formula for the maximum false discovery rate. An EDR of 21% implies that no more than 20% of the significant results are false positives, but the lower limit of the 95%CI of the EDR, 13%, allows for 34% false positive results. Most readers are likely to agree that this is too high. One solution to this problem is to lower the conventional criterion for statistical significance (Benjamin et al., 2017). Figure 2 shows that alpha = .005 reduces the point estimate of the FDR to 3% with an upper limit of the 95% confidence interval of 7%. Thus, without any further information readers could use this criterion to interpret results published in articles by researchers in the psychology department of UBC.

The next analyses examine area as a potential moderator. Actual replication studies suggest that social psychology has a lower replication rate than cognitive psychology, whereas the replicability of other areas is currently unknown (OSC, 2015). UBC has a large group of social psychologists with enough data to conduct a z-curve analysis (k = 9). Figure 3, shows the z-curve for the pooled data. The results show no notable difference to the z-curve for the department in general.

The only other area with at least five members that provided data to the overall z-curve was developmental psychology. The results are similar, although the EDR is a bit higher.

The last analysis examined whether research practices changed in response to the credibility crisis and evidence of low replication rates (OSC, 2015). For this purpose, I limited the analysis to articles published in the past 5 years. The EDR increased, but only slightly (29% vs. 21%) and not significantly. This suggests that research practices have not changed notably.

There is considerable variability across individual researchers, although confidence intervals are often wide due to the smaller number of test statistics. The table below shows the meta-statistics of all 26 faculty members that provided results for the departmental z-curve. You can see the z-curve for individual faculty member by clicking on their name.

EDR = expected discovery rate (mean power before selection for significance)

ERR = expected replication rate (mean power after selection for significance)

FDR = false positive risk (maximum false positive rate, Soric, 1989)

ARP = actual replication prediction (mean of EDR and ERR)

Rank | Name | ARP | EDR | ERR | FDR |

1 | Darko Odic | 67 | 73 | 62 | 3 |

2 | Steven J. Heine | 64 | 80 | 47 | 6 |

3 | J. Kiley Hamlin | 60 | 62 | 58 | 4 |

4 | Lynn E. Alden | 57 | 71 | 43 | 7 |

5 | Azim F. Shariff | 55 | 69 | 40 | 8 |

6 | Andrew Scott Baron | 54 | 72 | 35 | 10 |

7 | James T. Enns | 53 | 75 | 31 | 12 |

8 | Catharine A. Winstanley | 53 | 53 | 53 | 5 |

9 | D. Geoffrey Hall | 49 | 76 | 22 | 19 |

10 | Elizabeth W. Dunn | 48 | 58 | 37 | 9 |

11 | Alan Kingstone | 48 | 74 | 23 | 18 |

12 | Jessica L. Tracy | 47 | 67 | 28 | 14 |

13 | Sheila R. Woody | 46 | 63 | 30 | 13 |

14 | Jeremy C. Biesanz | 45 | 61 | 28 | 13 |

15 | Kristin Laurin | 43 | 59 | 26 | 15 |

16 | Luke Clark | 42 | 63 | 21 | 20 |

17 | Frances S. Chen | 41 | 64 | 17 | 25 |

18 | Mark Schaller | 41 | 57 | 24 | 16 |

19 | Kalina Christoff | 40 | 63 | 16 | 27 |

20 | E. David Klonsky | 39 | 50 | 27 | 14 |

21 | Ara Norenzayan | 38 | 62 | 15 | 30 |

22 | Toni Schmader | 37 | 56 | 18 | 25 |

23 | Liisa A. M. Galea | 36 | 59 | 13 | 36 |

24 | Janet F. Werker | 35 | 49 | 20 | 21 |

25 | Todd C. Handy | 30 | 47 | 12 | 39 |

26 | Stan B. Floresco | 26 | 43 | 9 | 55 |

Hi Ulrich, I love the emphasis on replicability. Be careful about using non-representative sampling to make substantive inferences, e.g., I haven’t gone through everyone in the department, but in my case your computation of my replicability indices seems to be based on just 14 of my 100ish empirical papers. All the best to you. – David Klonsky

Dear David,

thank you for your comment. All methods have limitations and one limitation of my method is that I don’t have access to all journals (paywall problems). The journals that are covered focus on core psychology disciplines and may not be representative of your main line of work. I am happy to do a complete analysis of your work, if you can share a folder with pdf files of your articles.

Best, Ulrich