## Introduction

Since 2011, it is an open secret that many published results in psychology journals do not replicate. The replicability of published results is particularly low in social psychology (Open Science Collaboration, 2015).

A key reason for low replicability is that researchers are rewarded for publishing as many articles as possible without concerns about the replicability of the published findings. This incentive structure is maintained by journal editors, review panels of granting agencies, and hiring and promotion committees at universities.

To change the incentive structure, I developed the Replicability Index, a blog that critically examined the replicability, credibility, and integrity of psychological science. In 2016, I created the first replicability rankings of psychology departments (Schimmack, 2016). Based on scientific criticisms of these methods, I have improved the selection process of articles to be used in departmental reviews.

1. I am using Web of Science to obtain lists of published articles from individual authors (Schimmack, 2022). This method minimizes the chance that articles that do not belong to an author are included in a replicability analysis. It also allows me to classify researchers into areas based on the frequency of publications in specialized journals. Currently, I cannot evaluate neuroscience research. So, the rankings are limited to cognitive, social, developmental, clinical, and applied psychologists.

2. I am using department’s websites to identify researchers that belong to the psychology department. This eliminates articles that are from other departments.

3. I am only using tenured, active professors. This eliminates emeritus professors from the evaluation of departments. I am not including assistant professors because the published results might negatively impact their chances to get tenure. Another reason is that they often do not have enough publications at their current university to produce meaningful results.

Like all empirical research, the present results rely on a number of assumptions and have some limitations. The main limitations are that

(a) only results that were found in an automatic search are included

(b) only results published in 120 journals are included (see list of journals)

(c) published significant results (p < .05) may not be a representative sample of all significant results

(d) point estimates are imprecise and can vary based on sampling error alone.

These limitations do not invalidate the results. Large difference in replicability estimates are likely to predict real differences in success rates of actual replication studies (Schimmack, 2022).

## University of Michigan

I used the department website to find core members of the psychology department. I counted 55 professors and 11 associate professors, which makes it one of the largest departments in North America. Not all researchers conduct quantitative research and report test statistics in their result sections. I limited the analysis to 29 professors and 5 associate professors who had at least 100 test statistics.

Figure 1 shows the z-curve for all 12,365 tests statistics in articles published by these 19 faculty members. I use the Figure to explain how a z-curve analysis provides information about replicability and other useful meta-statistics.

1. All test-statistics are converted into absolute z-scores as a common metric of the strength of evidence (effect size over sampling error) against the null-hypothesis (typically H0 = no effect). A z-curve plot is a histogram of absolute z-scores in the range from 0 to 6. The 1,781 z-scores greater than 6 are not shown because z-scores of this magnitude are extremely unlikely to occur when the null-hypothesis is true (particle physics uses z > 5 for significance). Although they are not shown, they are included in the meta-statistics.

2. Visual inspection of the histogram shows a steep drop in frequencies at z = 1.96 (dashed blue/red line) that corresponds to the standard criterion for statistical significance, p = .05 (two-tailed). This shows that published results are selected for significance. The dashed red/white line shows significance for p < .10, which is often used for marginal significance. Thus, there are more results that are presented as significant than the .05 criterion suggests.

3. To quantify the amount of selection bias, z-curve fits a statistical model to the distribution of statistically significant results (z > 1.96). The grey curve shows the predicted values for the observed significant results and the unobserved non-significant results. The full grey curve is not shown to present a clear picture of the observed distribution. The statistically significant results (including z > 6) make up 41% of the total area under the grey curve. This is called the expected discovery rate because the results provide an estimate of the percentage of significant results that researchers actually obtain in their statistical analyses. In comparison, the percentage of significant results (including z > 6) includes 72% of the published results. This percentage is called the observed discovery rate, which is the rate of significant results in published journal articles. The difference between a 72% ODR and a 41% EDR provides an estimate of the extent of selection for significance. The difference of~ 30 percentage points is fairly large, but other departments have even bigger discrepancies. The upper level of the 95% confidence interval for the EDR is 57%. Thus, the discrepancy is not just random. To put this result in context, it is possible to compare it to the average for 120 psychology journals in 2010 (Schimmack, 2022). The ODR is similar (both 72%), but the EDR is higher (41% vs. 28%).

4. The EDR can be used to estimate the risk that published results are false positives (i.e., a statistically significant result when H0 is true), using Soric’s (1989) formula for the maximum false discovery rate. An EDR of 41% implies that no more than 8% of the significant results are false positives, however the lower limit of the 95%CI of the EDR, 28%, allows for 13% false positive results. Most readers are likely to agree that this is too high. One solution to this problem is to lower the conventional criterion for statistical significance (Benjamin et al., 2017). Figure 2 shows that alpha = .01 reduces the point estimate of the FDR to 3% with an upper limit of the 95% confidence interval of 5%. Thus, without any further information readers could use this criterion to interpret results published in articles by U Michigan faculty members.

Replicability varies across disciplines. Before 2015 , replicability was particularly low in social psychology (Schimmack, 2021). This difference is also visible in separate analyses of researches from different fields at the University of Michigan. For the 9 social psychologists with sufficient data, the EDR is only 27% , which also implies a higher false positive risk of 14%, but these point estimates have wide confidence intervals.

There were only six cognitive researchers with usable data. Their z-curve shows less selection bias and a higher EDR estimate than the EDR estimate for social psychologists. The difference between social and cognitive psychologists is significant (i.e., the 95%CI do not overlap).

However, it is notable that the z-curve overestimates the number of z-scores that are just significant (z = 2 to 2.2), while it underestimates the percentage of z-scores between 2.2 and 2.4. This may suggests that the selection model is wrong and that sometimes just significant p-values are not published. A sensitivity (or multiverse) analysis can use different selection models. Using only z-scores about 2.2 (the vertical blue dotted line in the figure below), doesn’t change the ERR estimate much, but the EDR estimate is considerably lower, 43%, and the 95%CI goes as low as 18%. This leads to higher false discovery risks with an upper limit of the 95%Ci of 24%. Caution would therefore suggest to be careful with p-values greater than .01.

University of Michigan has a large group of developmental psychologists. The 9 faculty with usable data provided 4,382 test statistics. The results are better than those for social psychology, but not as good as those for cognitive psychology when the standard selection criterion is used. These results are consistent with typical differences between these disciplines that are reflected in analyses of 120 psychology journals (Schimmack, 2022).

Most of the faculty at U Michigan are full professors. Only 5 associate professors provided sufficient data for a z-curve analysis. The total number of test-statistics was 1,100.

The z-curve shows no evidence that research practices of associate professors are different from those of full professors.

Another way to look at research practices is to limit the analysis to articles published since 2016, which is the first year in which some journals show an increase in replicability (Schimmack, 2022). However, there is no notable difference to the z-curve for all years. This is in part due to the relative (not absolute) good performance of University of Michigan. Other departments have a lot more room for improvement.

The table below shows the meta-statistics of all 19 faculty members. You can see the z-curve for each faculty member by clicking on their name.

Rank | Name | ARP | EDR | ERR | FDR |

1 | Patricia A. Reuter-Lorenz | 78 | 81 | 76 | 2 |

2 | Daniel H. Weissman | 74 | 79 | 69 | 2 |

3 | John Jonides | 73 | 76 | 70 | 2 |

4 | William J. Gehring | 72 | 75 | 68 | 2 |

5 | Cindy Lustig | 69 | 73 | 66 | 3 |

6 | Henry M. Wellman | 69 | 73 | 65 | 3 |

7 | Terri D. Conley | 68 | 71 | 64 | 3 |

8 | Allison Earl | 65 | 72 | 57 | 4 |

9 | Terry E. Robinson | 65 | 69 | 61 | 3 |

10 | Arnold K. Ho | 65 | 72 | 57 | 4 |

11 | Felix Warneken | 63 | 65 | 60 | 4 |

12 | Stephanie A. Fryberg | 62 | 65 | 60 | 4 |

13 | Twila Tardif | 62 | 65 | 58 | 4 |

14 | Martin F. Sarter | 60 | 64 | 56 | 4 |

15 | Ashley N. Gearhardt | 57 | 74 | 41 | 8 |

16 | Thad A. Polk | 54 | 76 | 32 | 11 |

17 | Ethan Kross | 52 | 64 | 40 | 8 |

18 | Susan A. Gelman | 52 | 78 | 26 | 15 |

19 | Julie E. Boland | 51 | 63 | 39 | 8 |

20 | Shinobu Kitayama | 51 | 71 | 32 | 11 |

21 | David Dunning | 50 | 68 | 32 | 11 |

22 | Christopher S. Monk | 47 | 74 | 20 | 21 |

23 | Priti Shah | 44 | 70 | 18 | 24 |

24 | Patricia J. Deldin | 43 | 52 | 34 | 10 |

25 | Robert M. Sellers | 42 | 63 | 21 | 20 |

26 | Brenda L. Volling | 40 | 48 | 33 | 11 |

27 | Joshua M. Ackerman | 40 | 62 | 17 | 26 |

28 | Abigail J. Stewart | 37 | 49 | 26 | 15 |

29 | Nestor L. Lopez-Duran | 33 | 52 | 14 | 34 |

30 | Kent C. Berridge | 32 | 50 | 14 | 33 |

31 | Sheryl L. Olson | 32 | 51 | 12 | 37 |

32 | Denise Sekaquaptewa | 31 | 52 | 10 | 46 |

33 | Fiona Lee | 25 | 42 | 9 | 56 |