2021 Replicability Report for the Psychology Department at Stanford 

Since 2011, it is an open secret that many published results in psychology journals do not replicate. The replicability of published results is particularly low in social psychology (Open Science Collaboration, 2015).

A key reason for low replicability is that researchers are rewarded for publishing as many articles as possible without concerns about the replicability of the published findings. This incentive structure is maintained by journal editors, review panels of granting agencies, and hiring and promotion committees at universities.

To change the incentive structure, I developed the Replicability Index, a blog that critically examined the replicability, credibility, and integrity of psychological science. In 2016, I created the first replicability rankings of psychology departments (Schimmack, 2016). Based on scientific criticisms of these methods, I have improved the selection process of articles to be used in departmental reviews.

1. I am using Web of Science to obtain lists of published articles from individual authors (Schimmack, 2022). This method minimizes the chance that articles that do not belong to an author are included in a replicability analysis. It also allows me to classify researchers into areas based on the frequency of publications in specialized journals. Currently, I cannot evaluate neuroscience research. So, the rankings are limited to cognitive, social, developmental, clinical, and applied psychologists.

2. I am using department’s websites to identify researchers that belong to the psychology department. This eliminates articles that are from other departments.

3. I am only using tenured, active professors. This eliminates emeritus professors from the evaluation of departments. I am not including assistant professors because the published results might negatively impact their chances to get tenure. Another reason is that they often do not have enough publications at their current university to produce meaningful results.

Like all empirical research, the present results rely on a number of assumptions and have some limitations. The main limitations are that
(a) only results that were found in an automatic search are included
(b) only results published in 120 journals are included (see list of journals)
(c) published significant results (p < .05) may not be a representative sample of all significant results
(d) point estimates are imprecise and can vary based on sampling error alone.

These limitations do not invalidate the results. Large difference in replicability estimates are likely to predict real differences in success rates of actual replication studies (Schimmack, 2022).

Stanford University

I used the department website to find core members of the psychology department. I counted 19 professors and 6 associate professors. Not all researchers conduct quantitative research and report test statistics in their result sections. I limited the analysis to 13 professors and 3 associate professors who had at least 100 significant test statistics.

Figure 1 shows the z-curve for all 13,147 tests statistics in articles published by 16 faculty members. I use the Figure to explain how a z-curve analysis provides information about replicability and other useful meta-statistics.

1. All test-statistics are converted into absolute z-scores as a common metric of the strength of evidence (effect size over sampling error) against the null-hypothesis (typically H0 = no effect). A z-curve plot is a histogram of absolute z-scores in the range from 0 to 6. The 1,344 z-scores greater than 6 are not shown because z-scores of this magnitude are extremely unlikely to occur when the null-hypothesis is true (particle physics uses z > 5 for significance). Although they are not shown, they are included in the meta-statistics.

2. Visual inspection of the histogram shows a steep drop in frequencies at z = 1.96 (solid red line) that corresponds to the standard criterion for statistical significance, p = .05 (two-tailed). This shows that published results are selected for significance. The dashed red line shows significance for p < .10, which is often used for marginal significance. Thus, there are more results that are presented as significant than the .05 criterion suggests.

3. To quantify the amount of selection bias, z-curve fits a statistical model to the distribution of statistically significant results (z > 1.96). The grey curve shows the predicted values for the observed significant results and the unobserved non-significant results. The statistically significant results (including z > 6) make up 22% of the total area under the grey curve. This is called the expected discovery rate because the results provide an estimate of the percentage of significant results that researchers actually obtain in their statistical analyses. In comparison, the percentage of significant results (including z > 6) includes 69% of the published results. This percentage is called the observed discovery rate, which is the rate of significant results in published journal articles. The difference between a 67% ODR and a 22% EDR provides an estimate of the extent of selection for significance. The difference of~ 45 percentage points is fairly large. The upper level of the 95% confidence interval for the EDR is 30%. Thus, the discrepancy is not just random. To put this result in context, it is possible to compare it to the average for 120 psychology journals in 2010 (Schimmack, 2022). The ODR (67% vs. 72%) and the EDR (22% vs. 28%) are somewhat lower, suggesting that statistical power is lower in studies from Stanford.

4. The z-curve model also estimates the average power of the subset of studies with significant results (p < .05, two-tailed). This estimate is called the expected replication rate (ERR) because it predicts the percentage of significant results that are expected if the same analyses were repeated in exact replication studies with the same sample sizes. The ERR of 60% suggests a fairly high replication rate. The problem is that actual replication rates are lower than the ERR predictions (about 40% Open Science Collaboration, 2015). The main reason is that it is impossible to conduct exact replication studies and that selection for significance will lead to a regression to the mean when replication studies are not exact. Thus, the ERR represents the best case scenario that is unrealistic. In contrast, the EDR represents the worst case scenario in which selection for significance does not select more powerful studies and the success rate of replication studies is not different from the success rate of original studies. The EDR of 21% is lower than the actual replication success rate of 40%. To predict the success rate of actual replication studies, I am using the average of the EDR and ERR, which is called the actual replication prediction (ARP). For Stanford, the ARP is (60 + 22)/2 = 41%. This is close to the currently best estimate of the success rate for actual replication studies based on the Open Science Collaboration project (~40%). Thus, Stanford results are expected to replicate at the average rate of psychological research.

5. The EDR can be used to estimate the risk that published results are false positives (i.e., a statistically significant result when H0 is true), using Soric’s (1989) formula for the maximum false discovery rate. An EDR of 22% implies that no more than 18% of the significant results are false positives, but the lower limit of the 95%CI of the EDR, 18%, allows for 31% false positive results. Most readers are likely to agree that this is too high. One solution to this problem is to lower the conventional criterion for statistical significance (Benjamin et al., 2017). Figure 2 shows that alpha = .005 reduces the point estimate of the FDR to 4% with an upper limit of the 95% confidence interval of 8%. Thus, without any further information readers could use this criterion to interpret results published in articles by researchers in the psychology department of Stanford University.

Comparisons of research areas typically show lower replicability for social psychology (OSC, 2015), and Stanford has a large group of social psychologists (k = 10). However, the results for social psychologists at Stanford are comparable to the results for the entire faculty. Thus, the relatively low replicability of research from Stanford compared to other departments cannot be attributed to the large contingent of social psychologists.

Some researchers have changed research practices in response to the replication crisis. It is therefore interesting to examine whether replicability of newer research has improved. To examine this question, I performed a z-curve analysis for articles published in the past five year. The results show a marked improvement. The expected discovery rate more than doubled from 22% to 50%, and this increase is statistically significant. (So far, I have analyzed only 7 departments, but this is the only one with a significant increase yet). The high EDR reduces the false positive risk to a point estimate of 5% and an upper limit of the 95% confidence interval of 9%. Thus, for newer research, most of the results that are statistically significant with the conventional significance criterion of .05 are likely to be true effects. However, effect sizes are still going to be inflated because selection for significance with modest power results in regression to the mean. Nevertheless, these results provide first evidence of positive change at the level of departments. It would be interesting to examine whether these changes are due to individual efforts of researchers or reflect systemic changes that have been instituted at Stanford.

There is considerable variability across individual researchers, although confidence intervals are often wide due to the smaller number of test statistics. The table below shows the meta-statistics of all 16 faculty members that provided results for the departmental z-curve. You can see the z-curve for individual faculty member by clicking on their name.

Rank   Name ARP EDR ERR FDR
1Jamil Zaki6973653
2James J. Gross6369564
3Jennifer L. Eberhardt5466417
4Jeanne L. Tsai5166369
5Hyowon Gweon5062398
6Michael C. Frank47702317
7Hazel Rose Markus47652913
8Noah D. Goodman46722022
9Ian H. Gotlib45652615
10Ellen M. Markman43622516
11Carol S. Dweck41582417
12Claude M. Steele37522120
13Laura L. Carstensen35571337
14Benoit Monin33531336
15Geoffrey L. Cohen29461337
16Gregory M. Walton29451433

2021 Replicability Report for the Psychology Department at UBC (Vancouver) 

Since 2011, it is an open secret that many published results in psychology journals do not replicate. The replicability of published results is particularly low in social psychology (Open Science Collaboration, 2015).

A key reason for low replicability is that researchers are rewarded for publishing as many articles as possible without concerns about the replicability of the published findings. This incentive structure is maintained by journal editors, review panels of granting agencies, and hiring and promotion committees at universities.

To change the incentive structure, I developed the Replicability Index, a blog that critically examined the replicability, credibility, and integrity of psychological science. In 2016, I created the first replicability rankings of psychology departments (Schimmack, 2016). Based on scientific criticisms of these methods, I have improved the selection process of articles to be used in departmental reviews.

1. I am using Web of Science to obtain lists of published articles from individual authors (Schimmack, 2022). This method minimizes the chance that articles that do not belong to an author are included in a replicability analysis. It also allows me to classify researchers into areas based on the frequency of publications in specialized journals. Currently, I cannot evaluate neuroscience research. So, the rankings are limited to cognitive, social, developmental, clinical, and applied psychologists.

2. I am using department’s websites to identify researchers that belong to the psychology department. This eliminates articles that are from other departments.

3. I am only using tenured, active professors. This eliminates emeritus professors from the evaluation of departments. I am not including assistant professors because the published results might negatively impact their chances to get tenure. Another reason is that they often do not have enough publications at their current university to produce meaningful results.

Like all empirical research, the present results rely on a number of assumptions and have some limitations. The main limitations are that
(a) only results that were found in an automatic search are included
(b) only results published in 120 journals are included (see list of journals)
(c) published significant results (p < .05) may not be a representative sample of all significant results
(d) point estimates are imprecise and can vary based on sampling error alone.

These limitations do not invalidate the results. Large difference in replicability estimates are likely to predict real differences in success rates of actual replication studies (Schimmack, 2022).

University of British Columbia (Vancouver)

I used the department website to find core members of the psychology department. I counted 34 professors and 7 associate professors. Not all researchers conduct quantitative research and report test statistics in their result sections. I limited the analysis to 22 professors and 4 associate professors who had at least 100 significant test statistics.

Figure 1 shows the z-curve for all 13,147 tests statistics in articles published by these 26 faculty members. I use the Figure to explain how a z-curve analysis provides information about replicability and other useful meta-statistics.

1. All test-statistics are converted into absolute z-scores as a common metric of the strength of evidence (effect size over sampling error) against the null-hypothesis (typically H0 = no effect). A z-curve plot is a histogram of absolute z-scores in the range from 0 to 6. The 1,531 z-scores greater than 6 are not shown because z-scores of this magnitude are extremely unlikely to occur when the null-hypothesis is true (particle physics uses z > 5 for significance). Although they are not shown, they are included in the meta-statistics.

2. Visual inspection of the histogram shows a steep drop in frequencies at z = 1.96 (solid red line) that corresponds to the standard criterion for statistical significance, p = .05 (two-tailed). This shows that published results are selected for significance. The dashed red line shows significance for p < .10, which is often used for marginal significance. Thus, there are more results that are presented as significant than the .05 criterion suggests.

3. To quantify the amount of selection bias, z-curve fits a statistical model to the distribution of statistically significant results (z > 1.96). The grey curve shows the predicted values for the observed significant results and the unobserved non-significant results. The statistically significant results (including z > 6) make up 21% of the total area under the grey curve. This is called the expected discovery rate because the results provide an estimate of the percentage of significant results that researchers actually obtain in their statistical analyses. In comparison, the percentage of significant results (including z > 6) includes 68% of the published results. This percentage is called the observed discovery rate, which is the rate of significant results in published journal articles. The difference between a 68% ODR and a 21% EDR provides an estimate of the extent of selection for significance. The difference of~ 50 percentage points is large. The upper level of the 95% confidence interval for the EDR is 29%. Thus, the discrepancy is not just random. To put this result in context, it is possible to compare it to the average for 120 psychology journals in 2010 (Schimmack, 2022). The ODR (68% vs. 72%) is similar, but the EDR is lower (31% vs. 21%). This suggest that the research produced by UBC faculty members is somewhat less replicable than research in general.

4. The z-curve model also estimates the average power of the subset of studies with significant results (p < .05, two-tailed). This estimate is called the expected replication rate (ERR) because it predicts the percentage of significant results that are expected if the same analyses were repeated in exact replication studies with the same sample sizes. The ERR of 67% suggests a fairly high replication rate. The problem is that actual replication rates are lower (about 40% Open Science Collaboration, 2015). The main reason is that it is impossible to conduct exact replication studies and that selection for significance will lead to a regression to the mean when replication studies are not exact. Thus, the ERR represents the best case scenario that is unrealistic. In contrast, the EDR represents the worst case scenario in which selection for significance does not select more powerful studies and the success rate of replication studies is not different from the success rate of original studies. The EDR of 21% is lower than the actual replication success rate of 40%. To predict the success rate of actual replication studies, I am using the average of the EDR and ERR, which is called the actual replication prediction (ARP). For UBC research, the ARP is 44%. This is close to the currently best estimate of the success rate for actual replication studies based on the Open Science Collaboration project (~40%). Thus, UBC results are expected to replicate at the average rate of psychological research.

5. The EDR can be used to estimate the risk that published results are false positives (i.e., a statistically significant result when H0 is true), using Soric’s (1989) formula for the maximum false discovery rate. An EDR of 21% implies that no more than 20% of the significant results are false positives, but the lower limit of the 95%CI of the EDR, 13%, allows for 34% false positive results. Most readers are likely to agree that this is too high. One solution to this problem is to lower the conventional criterion for statistical significance (Benjamin et al., 2017). Figure 2 shows that alpha = .005 reduces the point estimate of the FDR to 3% with an upper limit of the 95% confidence interval of 7%. Thus, without any further information readers could use this criterion to interpret results published in articles by researchers in the psychology department of UBC.

The next analyses examine area as a potential moderator. Actual replication studies suggest that social psychology has a lower replication rate than cognitive psychology, whereas the replicability of other areas is currently unknown (OSC, 2015). UBC has a large group of social psychologists with enough data to conduct a z-curve analysis (k = 9). Figure 3, shows the z-curve for the pooled data. The results show no notable difference to the z-curve for the department in general.

The only other area with at least five members that provided data to the overall z-curve was developmental psychology. The results are similar, although the EDR is a bit higher.

The last analysis examined whether research practices changed in response to the credibility crisis and evidence of low replication rates (OSC, 2015). For this purpose, I limited the analysis to articles published in the past 5 years. The EDR increased, but only slightly (29% vs. 21%) and not significantly. This suggests that research practices have not changed notably.

There is considerable variability across individual researchers, although confidence intervals are often wide due to the smaller number of test statistics. The table below shows the meta-statistics of all 26 faculty members that provided results for the departmental z-curve. You can see the z-curve for individual faculty member by clicking on their name.

EDR = expected discovery rate (mean power before selection for significance)
ERR = expected replication rate (mean power after selection for significance)
FDR = false positive risk (maximum false positive rate, Soric, 1989)
ARP = actual replication prediction (mean of EDR and ERR)

Rank   Name ARP EDR ERR FDR
1Darko Odic6773623
2Steven J. Heine6480476
3J. Kiley Hamlin6062584
4Lynn E. Alden5771437
5Azim F. Shariff5569408
6Andrew Scott Baron54723510
7James T. Enns53753112
8Catharine A. Winstanley5353535
9D. Geoffrey Hall49762219
10Elizabeth W. Dunn4858379
11Alan Kingstone48742318
12Jessica L. Tracy47672814
13Sheila R. Woody46633013
14Jeremy C. Biesanz45612813
15Kristin Laurin43592615
16Luke Clark42632120
17Frances S. Chen41641725
18Mark Schaller41572416
19Kalina Christoff40631627
20E. David Klonsky39502714
21Ara Norenzayan38621530
22Toni Schmader37561825
23Liisa A. M. Galea36591336
24Janet F. Werker35492021
25Todd C. Handy30471239
26Stan B. Floresco2643955

2021 Replicability Report for the Psychology Department at Yale

Since 2011, it is an open secret that many published results in psychology journals do not replicate. The replicability of published results is particularly low in social psychology (Open Science Collaboration, 2015).

A key reason for low replicability is that researchers are rewarded for publishing as many articles as possible without concerns about the replicability of the published findings. This incentive structure is maintained by journal editors, review panels of granting agencies, and hiring and promotion committees at universities.

To change the incentive structure, I developed the Replicability Index, a blog that critically examined the replicability, credibility, and integrity of psychological science. In 2016, I created the first replicability rankings of psychology departments (Schimmack, 2016). Based on scientific criticisms of these methods, I have improved the selection process of articles to be used in departmental reviews.

1. I am using Web of Science to obtain lists of published articles from individual authors (Schimmack, 2022). This method minimizes the chance that articles that do not belong to an author are included in a replicability analysis. It also allows me to classify researchers into areas based on the frequency of publications in specialized journals. Currently, I cannot evaluate neuroscience research. So, the rankings are limited to cognitive, social, developmental, clinical, and applied psychologists.

2. I am using department’s websites to identify researchers that belong to the psychology department. This eliminates articles that are from other departments.

3. I am only using tenured, active professors. This eliminates emeritus professors from the evaluation of departments. I am not including assistant professors because the published results might negatively impact their chances to get tenure. Another reason is that they often do not have enough publications at their current university to produce meaningful results.

Like all empirical research, the present results rely on a number of assumptions and have some limitations. The main limitations are that
(a) only results that were found in an automatic search are included
(b) only results published in 120 journals are included (see list of journals)
(c) published significant results (p < .05) may not be a representative sample of all significant results
(d) point estimates are imprecise and can vary based on sampling error alone.

These limitations do not invalidate the results. Large difference in replicability estimates are likely to predict real differences in success rates of actual replication studies (Schimmack, 2022).

Yale

I used the department website to find core members of the psychology department. I counted 13 professors and 4 associate professors, which makes it one of the smaller departments in North America. Not all researchers conduct quantitative research and report test statistics in their result sections. I limited the analysis to 12 professors and 1 associate professors who had at least 100 significant test statistics.

Figure 1 shows the z-curve for all 13,147 tests statistics in articles published by these 18 faculty members. I use the Figure to explain how a z-curve analysis provides information about replicability and other useful meta-statistics.

1. All test-statistics are converted into absolute z-scores as a common metric of the strength of evidence (effect size over sampling error) against the null-hypothesis (typically H0 = no effect). A z-curve plot is a histogram of absolute z-scores in the range from 0 to 6. The 1,178 z-scores greater than 6 are not shown because z-scores of this magnitude are extremely unlikely to occur when the null-hypothesis is true (particle physics uses z > 5 for significance). Although they are not shown, they are included in the meta-statistics.

2. Visual inspection of the histogram shows a steep drop in frequencies at z = 1.96 (solid red line) that corresponds to the standard criterion for statistical significance, p = .05 (two-tailed). This shows that published results are selected for significance. The dashed red line shows significance for p < .10, which is often used for marginal significance. Thus, there are more results that are presented as significant than the .05 criterion suggests.

3. To quantify the amount of selection bias, z-curve fits a statistical model to the distribution of statistically significant results (z > 1.96). The grey curve shows the predicted values for the observed significant results and the unobserved non-significant results. The statistically significant results (including z > 6) make up 31% of the total area under the grey curve. This is called the expected discovery rate because the results provide an estimate of the percentage of significant results that researchers actually obtain in their statistical analyses. In comparison, the percentage of significant results (including z > 6) includes 69% of the published results. This percentage is called the observed discovery rate, which is the rate of significant results in published journal articles. The difference between a 69% ODR and a 31% EDR provides an estimate of the extent of selection for significance. The difference of~ 40 percentage points is fairly large. The upper level of the 95% confidence interval for the EDR is 42%. Thus, the discrepancy is not just random. To put this result in context, it is possible to compare it to the average for 120 psychology journals in 2010 (Schimmack, 2022). The ODR (69% vs. 72%) and the EDR (31% vs. 28%) are similar. This suggest that the research produced by Yale faculty members is neither more nor less replicable than research produced at other universities.

4. The EDR can be used to estimate the risk that published results are false positives (i.e., a statistically significant result when H0 is true), using Soric’s (1989) formula for the maximum false discovery rate. An EDR of 31% implies that no more than 12% of the significant results are false positives, but the lower limit of the 95%CI of the EDR, 18%, allows for 24% false positive results. Most readers are likely to agree that this is too high. One solution to this problem is to lower the conventional criterion for statistical significance (Benjamin et al., 2017). Figure 2 shows that alpha = .005 reduces the point estimate of the FDR to 3% with an upper limit of the 95% confidence interval of 5%. Thus, without any further information readers could use this criterion to interpret results published in articles by researchers in the psychology department of Harvard University.

Given the small size of the department, it is not very meaningful to conduct separate analyses by area. However, I did conduct a z-curve analysis of articles published since 2016 to examine whether research at Yale has changed in response to the call for improvements in research practices. The results show an increase in the expected discovery rate from 31% to 40%, although the confidence intervals still overlap. Thus, it is not possible to conclude at this moment that this is a real improvement (i.e., it could just be sampling error). The expected replication rate also increased slightly from 65% to 70%. Thus, there are some positive trends, but there is still evidence of selection for significance (ODR 71% vs. EDR = 40%).

There is considerable variability across individual researchers, although confidence intervals are often wide due to the smaller number of test statistics. The table below shows the meta-statistics of all 13 faculty members that provided results for the departmental z-curve. You can see the z-curve for individual faculty member by clicking on their name.

Rank   Name ARP EDR ERR FDR
1Tyrone D. Cannon6973643
2Frank C. Keil6777564
3Yarrow Dunham53713510
4Woo-Kyoung Ahn52733112
5B. J. Casey5264398
6Nicholas B. Turk-Browne50663410
7Jutta Joorman49643510
8Brian J. Scholl46692318
9Laurie R. Santos42661725
10Melissa J. Ferguson38621434
11Jennifer A. Richeson37492615
12Peter Salovey36571530
13John A. Bargh35561531

2021 Replicability Report for the Psychology Department at Harvard

Since 2011, it is an open secret that many published results in psychology journals do not replicate. The replicability of published results is particularly low in social psychology (Open Science Collaboration, 2015).

A key reason for low replicability is that researchers are rewarded for publishing as many articles as possible without concerns about the replicability of the published findings. This incentive structure is maintained by journal editors, review panels of granting agencies, and hiring and promotion committees at universities.

To change the incentive structure, I developed the Replicability Index, a blog that critically examined the replicability, credibility, and integrity of psychological science. In 2016, I created the first replicability rankings of psychology departments (Schimmack, 2016). Based on scientific criticisms of these methods, I have improved the selection process of articles to be used in departmental reviews.

1. I am using Web of Science to obtain lists of published articles from individual authors (Schimmack, 2022). This method minimizes the chance that articles that do not belong to an author are included in a replicability analysis. It also allows me to classify researchers into areas based on the frequency of publications in specialized journals. Currently, I cannot evaluate neuroscience research. So, the rankings are limited to cognitive, social, developmental, clinical, and applied psychologists.

2. I am using department’s websites to identify researchers that belong to the psychology department. This eliminates articles that are from other departments.

3. I am only using tenured, active professors. This eliminates emeritus professors from the evaluation of departments. I am not including assistant professors because the published results might negatively impact their chances to get tenure. Another reason is that they often do not have enough publications at their current university to produce meaningful results.

Like all empirical research, the present results rely on a number of assumptions and have some limitations. The main limitations are that
(a) only results that were found in an automatic search are included
(b) only results published in 120 journals are included (see list of journals)
(c) published significant results (p < .05) may not be a representative sample of all significant results
(d) point estimates are imprecise and can vary based on sampling error alone.

These limitations do not invalidate the results. Large difference in replicability estimates are likely to predict real differences in success rates of actual replication studies (Schimmack, 2022).

Harvard

I used the department website to find core members of the psychology department. I counted 23 professors and 1 associate professors, which makes it one of the smaller departments in North America. Not all researchers conduct quantitative research and report test statistics in their result sections. I limited the analysis to 16 professors and 1 associate professors that had at least 100 significant test statistics.

Figure 1 shows the z-curve for all 13,147 tests statistics in articles published by these 17 faculty members. I use the Figure to explain how a z-curve analysis provides information about replicability and other useful meta-statistics.

1. All test-statistics are converted into absolute z-scores as a common metric of the strength of evidence (effect size over sampling error) against the null-hypothesis (typically H0 = no effect). A z-curve plot is a histogram of absolute z-scores in the range from 0 to 6. The 1,465 z-scores greater than 6 are not shown because z-scores of this magnitude are extremely unlikely to occur when the null-hypothesis is true (particle physics uses z > 5 for significance). Although they are not shown, they are included in the meta-statistics.

2. Visual inspection of the histogram shows a steep drop in frequencies at z = 1.96 (solid red line) that corresponds to the standard criterion for statistical significance, p = .05 (two-tailed). This shows that published results are selected for significance. The dashed red line shows significance for p < .10, which is often used for marginal significance. Thus, there are more results that are presented as significant than the .05 criterion suggests.

3. To quantify the amount of selection bias, z-curve fits a statistical model to the distribution of statistically significant results (z > 1.96). The grey curve shows the predicted values for the observed significant results and the unobserved non-significant results. The full grey curve is not shown to present a clear picture of the observed distribution. The statistically significant results (including z > 6) make up 27% of the total area under the grey curve. This is called the expected discovery rate because the results provide an estimate of the percentage of significant results that researchers actually obtain in their statistical analyses. In comparison, the percentage of significant results (including z > 6) includes 69% of the published results. This percentage is called the observed discovery rate, which is the rate of significant results in published journal articles. The difference between a 69% ODR and a 27% EDR provides an estimate of the extent of selection for significance. The difference of~ 40 percentage points is fairly large. The upper level of the 95% confidence interval for the EDR is 38%. Thus, the discrepancy is not just random. To put this result in context, it is possible to compare it to the average for 120 psychology journals in 2010 (Schimmack, 2022). The ODR (69% vs. 72%) and the EDR (27% vs. 28%) are similar. This suggest that the research produced by Harvard faculty members is neither more nor less replicable than research produced at other universities.

4. The EDR can be used to estimate the risk that published results are false positives (i.e., a statistically significant result when H0 is true), using Soric’s (1989) formula for the maximum false discovery rate. An EDR of 27% implies that no more than 14 of the significant results are false positives, however the lower limit of the 95%CI of the EDR, 18%, allows for 24% false positive results. Most readers are likely to agree that this is too high. One solution to this problem is to lower the conventional criterion for statistical significance (Benjamin et al., 2017). Figure 2 shows that alpha = .005 reduces the point estimate of the FDR to 3% with an upper limit of the 95% confidence interval of 5%. Thus, without any further information readers could use this criterion to interpret results published in articles by researchers in the psychology department of Harvard University.

Most of the faulty are cognitive psychologists (k = 7) or clinical psychologists (k = 5). The z-curve for clinical research shows a lower EDR and ERR, but the confidence intervals are wide and the difference may just reflect sampling error.

Consistent with other comparisons of disciplines, cognitive results have a higher EDR and ERR, but the confidence intervals are too wide to conclude that this difference is statistically significant at Harvard. Thus, the overall results hold largely across areas.

The next analysis examines whether research practices changed in response to the credibility crisis in psychology. I selected articles published since 2016 for this purpose.

The EDR is higher than the EDR for all years (42% vs. 27%), but the ODR (64% vs. 67%) and the ERR remained unchanged (68% vs. 69%). Thus, selection for significance decreased, but is still present in more recent articles.

There is considerable variability across individual researchers, although confidence intervals are often wide due to the smaller number of test statistics. The table below shows the meta-statistics of all 18 faculty members that provided results for the departmental z-curve. You can see the z-curve for individual faculty member by clicking on their name.

Rank   Name ARP EDR ERR FDR
1Mina Cikara6973653
2Samuel J. Gershman6977604
3George A. Alvarez6283418
4Daniel L. Schacter5770447
5Alfonso Caramazza5671418
6Mahzarin R. Banaji5675379
7Jason P. Mitchell5560505
8Katie A. McLaughlin51723012
9Fiery Cushman50772219
10Elizabeth S. Spelke49643311
11Elizabeth A. Phelps46692218
12Susan E. Carey45612813
13Jesse Snedeker45672218
14Matthew K. Nock43572913
15Jill M. Hooley43592714
16Daniel T. Gilbert41641726
17John R. Weisz29431531

Personality and Subjective Well-Being

Over 40-years ago, Costa and McCrae (1980) proposed that subjective well-being is influenced by two personality traits, namely extraversion and neuroticism. They even presented their theory in the form of a causal model.

Forty years later, this model still dominates personality theories of subjective well-being (Anglim et al., 2020). The main revision has been the addition of agreeableness and conscientiousness as additional personality factors that influence well-being (Heller et al., 2004; McCrae & Costa, 1991).

Although it seems natural to test the proposed causal model using structural equation modeling (SEM), personality researchers have resisted the use of causal modeling as a statistical tool. One reason has been that SEM models often do not fit personality data (McCrae et al., 1996). This is hardly a convincing reason to avoid using SEM in personality research. Astronomers did not ban new telescopes that showed more moons around Jupiter than Galileo discovered. They simply revised the number of moons.

It is therefore urgently needed to test Costa and McCrae’s theory with a method that can falsify the theory. Even if the data do not fit Costa and McCrae’s original theory, it does not take away from their important contribution 40 years ago. Nobody is arguing that Galileo was a bad astronomer because he only discovered four moons.

Structural equation modeling has two benefits for theory development. First, it can be used to test causal theories. For example, the model in Figure 1 predicts that effects of extraversion on life-satisfaction is mediated by positive affect, whereas the effect of neuroticism is mediated by negative affect. Finding additional mediators would falsify the model and lead to the revision of the theory. The second benefit is that SEM makes it possible to fit measurement models to the data. This is the aim of the present blog post.

Measurement of Personality

In the 1980s, personality psychologists developed the Big Five model as a general framework to describe individual differences in personality traits. This unified framework has led to the development of Big Five questionnaires. The development of relatively short questionnaire enabled researchers to include personality measures in studies that had a different research focus. As a result, many studies reported correlations between Big Five measures and life-satisfaction. These correlations have been meta-analyzed in several meta-analysis. The results were summarized by Anglim et al. (2020), who conducted the latest meta-analysis (current (core)).

Assuming that the latest meta-analysis provides the best estimate of the average correlations, neuroticism shows the strongest relationship with r = -.4, followed by extraversion and conscientiousness, r = .3, and agreeableness, r = .2. The correlation with openness is consistently the weakest, r = .1.

The main problem with these correlations is that they cannot be interpreted as effect sizes of the individual personality traits on life-satisfaction, even if we are willing to assume a causal relationship between personality traits and life-satisfaction. The reason is that that scores on the Big Five scales are not independent. As a result, the unique relationship between one Big Five scale and life-satisfaction ratings is smaller (Kim et al., 2018).

It also matters what causes correlations among Big Five scales? While some theories postulate that these correlations reflect common personality factors, evidence suggests that most of the correlations reflect two response styles, evaluative bias (halo) and acquiescence bias (Anusic et al., 2009). Evaluative bias is particularly problematic because it also influences life-satisfaction ratings (Kim et al., 2012; Schimmack, Schupp, & Wagner, 2008). This shared method variance inflates correlations between personality ratings and life-satisfaction ratings. Thus, the simple correlations in meta-analysis provide inflated effect size estimates. However, it is unclear how much halo bias contributes to variation in life-satisfaction ratings and how much the Big Five contribute to well-being when halo variance is statistically controlled. To answer this question, I conducted a new meta-analysis.

I started with Anglim et al.’s (2020) list of studies to search for reasonably large datasets (N > 700) that reported correlations among all Big Five scales and a life-satisfaction measure. I then added additional datasets that I knew that were not included in the meta-analysis. The focus on large datasets was motivated by two considerations. First, studies with small N may not meet the requirements for structural equation modeling and are likely to produce unreliable estimates. Second, large samples are weighted more heavily in meta-analyses. Thus, small datasets will often only increase sampling error without altering the actual results. To ensure that the selection of studies did not influence the results, I compared the simple correlations to the results reported by Anglim et al. (2020).

I included 32 datasets in the analysis with a total sample size of N = 154,223. The correlations tended to be a little bit weaker, but the differences are rather small.

Measurement Models

Typical studies that correlate Big Five scales with life-satisfaction ratings do not have a formal measurement model. Personality is operationalized in terms of the sum score (or mean) on several personality items. This “measurement model” is illustrated in Figure 1. To simplify the presentation, the model includes only two items per Big Five dimension, but the model can be generalized to longer scales.

The key part of this model are the arrows from items (n1 … c2) to scale scores (N … C). The correlations among scale scores are determined by the inter-item correlations. The scale scores are then correlated with the life-satisfaction scores.

In this model the assignment of items to scales is arbitrary, but Big Five scales are not. They are based on a theory that scale scores reflect factors that produce a systematic pattern of correlations. The simple assumption is that items were selected to reflect a particular trait (e.g., n1 & n2 reflect the factor Neuroticism). This assumption is illustrated in Figure 2.

In Figure 2, the squares with capital letters in italics (N, E, O, A, C) represent factors. Factors are unobserved variables that cause variation in items. In contrast, the squares with capital letters represent observed variables that are created by averaging observed scores on items. Figure 2 is a measurement model because it makes predictions about the covariation among the items. The causal arrows from the factors to items imply that two items of the same factor are correlated because they are influenced by a common factor (e.g., n1 & n2 are correlated because they are influenced by neuroticism). The simple model in Figure 2 implies that the correlations between items of different factors are zero. It is well-known that this simple measurement model does not fit actual data because there are non-zero correlations between items from different factors. However, this lack of fit is often ignored when researchers simply use scale scale scores (N to C) as if they have perfect indicators of factors (N to C). With structural equation modeling it is possible to fit measurement models that actually fit the data and examine how the Big Five factors are related to life-satisfaction scores. This is what I did for each of the 32 datasets. The basic measurement model is shown in Figure 3.

The model in Figure 3 represents the covariation among Big Five items as the function of the Big Five factors (N = Neuroticism/Negative Emotionality, E = Extraversion, O = Openness, A = Agreeableness, & C = Conscientiousness) and a sixth factor H = Halo. The halo factor produces evaluatively consistent correlations among all items. Typically positively coded items of extraversion, openness, agreeableness, and conscientiousness and reverse coded items of neuroticism have positive loadings on this factor because these items tend to be more desirable. However, the loading depends on the desirability of each item. The model no longer contains sum scores (N, E, O, A, C). The reason is that sum scores are suboptimal predictors of a particular criterion. This model maximizes the variance predicted in life-satisfaction scores. However, the model can distinguish between variance that is explained by the six factors and variance that is explained by residual variance in specific items. Using the model indirect function, we get standardized estimates of the contribution of the six factors to the life-satisfaction scores.

Average Effect Sizes

Figure 4 shows the standardized effect sizes. All effect sizes are smaller than the meta-analytic correlations. Neuroticism remains the strongest Big Five predictor, b = .25, but halo is an equally strong predictor, b = .26. Consistent with Costa and McCrae’s model, extraversion is a significant predictor, but the effect size is small, b = .14. Consistent with McCrae and Costa (1991), agreeableness and conscientiousness are additional predictors, but the effect sizes are even smaller, agreeableness b = .04 and conscientiousness b = .09. Together, the Big Five factors explain 10% of the variance in life-satisfaction and halo explains an additional 6%.

The following figures show the results for each Big Five factor in more detail. Neuroticism effect sizes show a normal distribution. The 95%CI based on tau ranges from b = .09 to b = .44. This variability is not sampling error, which is much smaller in these large samples. Rather it reflects heterogeneity in effect sizes due to differences in populations, types of measurement, and a host of other factors that vary across studies. Thus, it is possible to conclude that neuroticism explains between 1% and 19% of the variance in well-being. This is a wide confidence interval and future research is needed to obtain more precise estimates and to find moderators of this relationship.The figure also shows that the mean effect size for representative samples is a bit smaller than the one for the average sample. As the average is based on an arbitrary sample of studies, the average for representative samples may be considered a better estimate of the typical effect size, b = .20. Based on this finding, neuroticism may explain only 4% of the variance in life-satisfaction ratings.

Extraversion is also a significant predictor, with a point estimate of b = .14 for all samples and b = .08 for representative samples, suggesting that extraversion explains about 1 to 2 percent of the variance in life-satisfaction judgments. The 95%CI ranges from b = .01 to b = .27, which corresponds to an estimate of 8% explained variance. This confirms the widely held assumption that extraverts are happier than introverts, but the effect size is smaller than some reviews of the literature suggest.

The results for openness are very clear. Openness has no direct relationship with life-satisfaction. The point estimate is close to zero and the 95%CI ranges from b = -.08 to b = .05. Thus, there is insufficient heterogeneity to make it worthwhile to examine moderators. Of course, the relationship is unlikely to be exactly zero, but very small effect sizes are impossible to study reliably given the current levels of measurement error in personality measures.

McCrae and Costa (1991) provided some evidence that agreeableness and conscientiousness also predict life-satisfaction. Meta-analyses supported this conclusion, but effect size estimates were inflated by shared method variance. The following results show that the contribution of these two personality factors to life-satisfaction is small.

The point estimate for agreeableness is b = .04 for all samples and b = .05 for representative samples. The 95%CI ranges from b = .03 to b = .11. Thus, the upper limit of the confidence interval corresponds to an estimate that agreeableness explains only 1% of the variance in life-satisfaction judgments. This is a small effect size. This finding has theoretical implications for theories that try to link pro-social traits like empathy or gratitude to life-satisfaction. The present results suggest that another way to achieve well-being is to focus on one’s own well-being. Of course, selfish pursuits of happiness may have other negative consequences that make pro-social pursuits of happiness more desirable, but the present results do not suggest that a prosocial orientation in itself ensures higher levels of life-satisfaction to any substantial degree. In short, assholes can be happy assholes.

The effect size for conscientiousness is a bit stronger, with a point estimate of b = .09 for all samples and b = .08 for representative samples. The 95%CI is relatively wide and ranges from b = .01 to b = .18, which covers a range of effect sizes that could be considered too small to matter to effect sizes that are substantial with up to 3% explained variance. Thus, future research needs to explore moderators of this relationship.

The most overlooked predictor of life-satisfaction judgments is the shared variance among Big Five ratings that can be attributed to evaluative or halo bias. This factor has been studied separately in a large literature on positive illusions and self-enhancement. The present meta-analysis shows point estimates for the halo factor that match those for neuroticism with b = .26 for all samples and b = .22 for representative samples. The 95^CI ranges from b = .12 to b = .4, which means the effect size is in the small to moderate range. The biggest question is how this finding should be interpreted. One interpretation is that positive illusions contribute to higher life-satisfaction (Dufner et al., 2019; Taylor & Brown, 1988). The alternative interpretation is that halo variance merely reflects shared method variance that produces a spurious correlation between self-ratings of personality and life-satisfaction (Schimmack, Schupp, & Wagner, 2008; Schimmack & Kim, 2020). Only multi-method studies that measure well-being with methods other than self-ratings can answer this question, but the present meta-analysis shows that up to 16% of the variance in self-ratings can be attributed to halo bias. Past studies often failed to distinguish between Big Five factors and the halo factor, leading to inflated effect sizes estimates for neuroticism, extraversion, and conscientiousness. Future studies need to control for evaluative biases in studies that correlate self-ratings of personality with self-ratings of outcome measures.

In conclusion, the results do confirm Costa and McCrae’s (1980) prediction that neuroticism and extraversion contribute to life-satisfaction. In addition, they confirm McCrae and Costa’s (1991) prediction that conscientiousness is also a positive predictor of life-satisfaction. While the effect for agreeableness is statistically significant, the effect size is too small to be theoretically meaningful. In addition to the predicted effects, evaluative bias in personality ratings contributes to life-satisfaction judgments and the effect is as strong as the effect of neuroticism.

Moderator Analysis

I used the metafor r-package to conduct moderator analyses. Potential moderators were type of measure (NEO vs. other, BFI vs. other), number of personality items per factor, number of life-satisfaction items (one item vs. scale), the type of data (correlation matrix vs. raw data), and culture (anglo vs. other). I ran the analysis for each of the six factors. With 7 moderators and 6 variables, there were 42 statistical tests. Thus, chance alone is expected to produce two significant result with alpha = .05, but no significant result with alpha = .01. I therefore used alpha = .01 to discuss moderator results.

There were no significant moderators for neuroticism. For extraversion, number of personality items was a significant predictor, p < .001. However, the effect size is weak and suggested that the correlation would increase from .12 for 2 items to .15 for 10 items. For openness, a significant culture effect emerged, p = .015 . Openness had a small negative effect in Anglo cultures, b = -.04, and no effect in non-Anglo cultures, b = .01. However, the effect size is too small to be meaningful. There were no significant results for agreeableness. For conscientiousness, an effect of questionnaire was significant, p = .008. The NEO showed somewhat smaller effects, but this effect was no longer significant in a univariate analysis, and the effect size was small, NEO b = .06 vs. other b = .10. For halo, a significant effect of the number of life-satisfaction items emerged, p = .005. The effect was stronger in studies with multiple-item measures of life-satisfaction, b = .20 vs. b = .28. The reason could be that aggregation of life-satisfaction items increases reliable variance, including response styles. In sum, the moderator analysis suggests that results are fairly robust across studies with different measures and across different Western cultures.

Discussion

Quantifying effect sizes is necessary to build quantitative theories of personality and well-being. The present results show that three of the Big Five traits are reliable predictors of life-satisfaction ratings that jointly explain a substantial amount of variance in life-satisfaction ratings. Neuroticism is the strongest predictor, but the amount of explained variance is unclear. The 95^CI ranges from 1% to 19%, with a point estimate of 7%. Extraversion is the second strongest predictor with a 95%CI ranging from 0 to 8% of explained variance with a point estimate of 2% explained variance. The 95%CI for conscientiousness also ranges from 0 to 3% of variance with a point estimate of 1%. Combined these results suggest that the Big Five personality traits explain between 1% and 20% of the variance in life-satisfaction with a point estimate of 10% explained variance. Worded differently, the Big Five traits explain 10 +/- 10% of the variance in life-satisfaction judgments. Another 7 +/- 6 percent of the variance is explained by halo bias. Subsequently, I discuss the various implications of these findings for future research on personality and well-being.

Improvement in Measurement

Future research needs to improve the construct validity of Big Five measures. Existing measures are ad-hoc scales that lack a clear theoretical foundation with no clear rational for the selection of items. New measures like the Big Five Inventory 2 are an improvement, but effect size estimates with this measure that control for halo variance are lacking. Even the BFI-2 has limitations. It measures the higher-order Big Five factors with three facet measures, but more facet measures would be better to obtain stable estimates of the factor loadings of facets on the Big Five factors.

Longitudinal Evidence

The Big Five personality factors and their facets are conceptualized as stable personality dispositions. In support of this view, longitudinal studies show that the majority of the variance in Big Five measures is stable (Anusic & Schimmack, 2016). There is also evidence that up to 50% of the variance in well-being measures is stable (Anusic & Schimmack, 2016), but that there is more state variance in well-being that changes in response to life circumstances (Schimmack & Lucas, 2010). Taken together, these results suggest that personality traits account for a larger portion of the stable variance in well-being. Future studies need to test this prediction with longitudinal studies of personality and well-being.

Mediating Processes

Most of the research on personality and well-being has been limited to correlational studies. Therefore theories that explain how personality traits influence well-being are rare. One theory postulates that extraversion and neuroticism are affective dispositions that produce individual differences in affect independent of situational factors (mood), and mood colors life-evaluations (Schimmack, Diener, & Oishi, 2002). Alternative theories suggests that personality traits influence actual life-circumstances or interact with environmental factors to produce individual differences in well-being. To test these theories, it is important to include measures of environmental factors in studies of personality and well-being. Moreover, sample sizes have to be large to detect interaction effects.

Halo and Well-Being

The presence of halo variance in personality ratings has been known for over 100 years (Thorndike, 1920). Over the years, this variance has been attributed to mere response styles or considered evidence of a motivation to boost self-esteem. There have been few attempts to test empirically whether halo variance is merely a rating bias (other deception) or reflects positive illusions about the self. This makes it difficult to interpret the contribution of halo variance to variance in self-ratings of well-being. Does this finding merely show shared method variance among self-ratings or do positive illusions about the self increase positive affect, life-satisfaction, and well-being . Studies with informant ratings of well-being are rare, but tend to show no relationship between halo bias and informant ratings of well-being (Kim, Schimmack, & Oishi, 2012; Schimmack & Kim, 2020). This suggests that halo variance is merely rating bias, but it remains possible that positive illusions increase well-being in ways that are not reflected in informant ratings of well-being.

Normative Theories of Personality Change

Some personality psychologists have proposed a normative theory of personality. Accordingly, the end goal of personality development is to become low in neuroticism and high in agreeableness and conscientiousness. This personality type is considered more mature. However, a justification for this normative model of personality is lacking. The main objective justification for normative theories of personality is optimal functioning because functions have clear and external standards of evaluations. For example, within the context of work psychology, conscientiousness can be evaluated positively because highly conscientious workers are more productive. However, there are no objective criteria to evaluate people’s lives. Thus, the only justification for normative theories of personality would be evidence that some personality traits make it more difficult for individuals to achieve high well-being. The present results suggest that neuroticism is the key personality trait that impedes well-being. However, the results do not support the notion that high agreeableness or high conscientiousness are normatively better than low agreeableness or low conscientiousness because these traits can vary without notable effects on well-being.

Conclusion

Correlations between personality and well-being measures have been reported since Hartmann’s (1936) seminal study of neuroticism and well-being. The literature has grown and it has been meta-analyzed repeatedly. The results consistently show that neuroticism is the strongest predictor of life-satisfaction ratings, but that extraversion, agreeableness, and conscientiousness also show notable simple correlations with life-satisfaction. The present meta-analysis went beyond simple correlations and separated content variance from evaluative variance (halo) in self-ratings of personality. The results showed that halo predicts substantive variance in well-being and accounts for most of the correlations of agreeableness and conscientiousness with well-being. Future studies of personality and well-being need to separate substantive and evaluative variance to take the evaluative nature of personality ratings into account.

Unconscious Thought Theory in Decline

In the late 1980s, experimental social psychology rediscovered the unconscious. One reason might be that psychological laboratories started to use personal computers to conduct studies. This made it possible to present subliminal stimuli or measure reaction times cheaply. Another reason might have been that conscious social cognitive processes are relatively boring and easily accessible to introspection. It was difficult to find novel and eye-catching results with self-reports. The so called implicit revolution (Greenwald & Banaji, 2017) is still going strong, but first signs of problems are visible everywhere. An article by Ap Dijksterhuis (2004) proposed that unconscious process are better than conscious deliberations in making complex choices. This article stimulated research into unconscious thought theory.

Figure 1 shows publication and citation rates in Web of Science. Notably, the publication rate increased steeply until 2011, the year the replication crisis in social psychology started. Afterwards, publications show a slowly decreasing trend. However, citations continue to increase, suggesting that concerns about the robustness of published results has not reduced trust in this literature.

A meta-analysis and failed replication study raised concerns that many findings in this literature may be false positive results (Nieuwenstein et al., 2017). To further examine the credibility of this literature, I subjected the 220 articles in the Web of Science topic search to a z-curve analysis. I first looked for matching articles in a database of articles from 121 psychology journals that includes all major social psychology journals (Schimmack, 2022). This search retrieved 44 articles. An automatic search of these 44 articles produced 534 test statistics. A z-curve analysis of these test statistics showed 64% significant results (not counting marginally significant results, z > 1.65), but the z-curve estimate of power was only 30%. The 95% confidence interval ranges from 10% to 50% and does not include the observed discovery rate of 64%. Thus, there is clear evidence that the published rate of significant results is inflated by unscientific research practices.

An EDR of 30% implies that up to 12% of significant results could be false positive results (Soric, 1989). However, due to uncertainty in the estimate of the EDR, the upper limit of false positive results could be as high as 49%. The main problem is that it is unclear which of the published results are false positives and which ones are real effects. Another problem is that selection for significance inflates effect size estimates and that actual effects are likely to be smaller than published effect size estimates.

One solution to this problem is to focus on results with stronger evidence against the null-hypothesis by lowering the criterion for statistical significance. Some researchers have proposed setting alpha to .005. Figure 3 shows the implications of this criterion value.

The observed discovery rate is now only 34% because many results that were significant with alpha = .05 are no longer significant with alpha = .005. The expected discovery rate also decreases, but the more stringent criterion for significance lowers the false discovery risk to 3% and even the upper limit of the 95% confidence interval is only 20%. This suggests that most of the results with p-values below .005 reported a real effect. However, automatic extraction of test statistics does not distinguish between focal tests of unconscious thought theory and incidental tests of other hypotheses. Thus, it is unclear how many and which of these 184 significant results provide support for unconscious thought theory. The failed replication study by Nieuwenstein et al. (2017) suggests that it is not easy to find conditions under which unconscious thought is superior. In conclusion, there is presently little to no empirical support for unconscious thought theory, but research articles and literature reviews often cite the existing literature as if these studies can be trusted. The decrease in new studies suggests that it is difficult to find credible evidence.

2021 Replicability Report for the Psychology Department at the University of Toronto

Since 2011, it is an open secret that many published results in psychology journals do not replicate. The replicability of published results is particularly low in social psychology (Open Science Collaboration, 2015).

A key reason for low replicability is that researchers are rewarded for publishing as many articles as possible without concerns about the replicability of the published findings. This incentive structure is maintained by journal editors, review panels of granting agencies, and hiring and promotion committees at universities.

To change the incentive structure, I developed the Replicability Index, a blog that critically examined the replicability, credibility, and integrity of psychological science. In 2016, I created the first replicability rankings of psychology departments (Schimmack, 2016). Based on scientific criticisms of these methods, I have improved the selection process of articles to be used in departmental reviews.

1. I am using Web of Science to obtain lists of published articles from individual authors (Schimmack, 2022). This method minimizes the chance that articles that do not belong to an author are included in a replicability analysis. It also allows me to classify researchers into areas based on the frequency of publications in specialized journals. Currently, I cannot evaluate neuroscience research. So, the rankings are limited to cognitive, social, developmental, clinical, and applied psychologists.

2. I am using department’s websites to identify researchers that belong to the psychology department. This eliminates articles that are from other departments.

3. I am only using tenured, active professors. This eliminates emeritus professors from the evaluation of departments. I am not including assistant professors because the published results might negatively impact their chances to get tenure. Another reason is that they often do not have enough publications at their current university to produce meaningful results.

Like all empirical research, the present results rely on a number of assumptions and have some limitations. The main limitations are that
(a) only results that were found in an automatic search are included
(b) only results published in 120 journals are included (see list of journals)
(c) published significant results (p < .05) may not be a representative sample of all significant results
(d) point estimates are imprecise and can vary based on sampling error alone.

These limitations do not invalidate the results. Large difference in replicability estimates are likely to predict real differences in success rates of actual replication studies (Schimmack, 2022).

University of Toronto

I used the department website to find core members of the psychology department. I counted 27 professors and 25 associate professors, which makes it one of the larger departments in North America. Not all researchers conduct quantitative research and report test statistics in their result sections. I limited the analysis to 19 professors and 13 associate professors who had at least 100 test statistics.

Figure 1 shows the z-curve for all 13,462 tests statistics in articles published by these 31 faculty members. I use the Figure to explain how a z-curve analysis provides information about replicability and other useful meta-statistics.

1. All test-statistics are converted into absolute z-scores as a common metric of the strength of evidence (effect size over sampling error) against the null-hypothesis (typically H0 = no effect). A z-curve plot is a histogram of absolute z-scores in the range from 0 to 6. The 1,743 z-scores greater than 6 are not shown because z-scores of this magnitude are extremely unlikely to occur when the null-hypothesis is true (particle physics uses z > 5 for significance). Although they are not shown, they are included in the meta-statistics.

2. Visual inspection of the histogram shows a steep drop in frequencies at z = 1.96 (dashed blue/red line) that corresponds to the standard criterion for statistical significance, p = .05 (two-tailed). This shows that published results are selected for significance. The dashed red/white line shows significance for p < .10, which is often used for marginal significance. Thus, there are more results that are presented as significant than the .05 criterion suggests.

3. To quantify the amount of selection bias, z-curve fits a statistical model to the distribution of statistically significant results (z > 1.96). The grey curve shows the predicted values for the observed significant results and the unobserved non-significant results. The full grey curve is not shown to present a clear picture of the observed distribution. The statistically significant results (including z > 6) make up 41% of the total area under the grey curve. This is called the expected discovery rate because the results provide an estimate of the percentage of significant results that researchers actually obtain in their statistical analyses. In comparison, the percentage of significant results (including z > 6) includes 69% of the published results. This percentage is called the observed discovery rate, which is the rate of significant results in published journal articles. The difference between a 69% ODR and a 41% EDR provides an estimate of the extent of selection for significance. The difference of~ 30 percentage points is fairly large, but other departments have even bigger discrepancies. The upper level of the 95% confidence interval for the EDR is 50%. Thus, the discrepancy is not just random. To put this result in context, it is possible to compare it to the average for 120 psychology journals in 2010 (Schimmack, 2022). The ODR is similar (both 72%), but the EDR is higher (41% vs. 28%).

4. The EDR can be used to estimate the risk that published results are false positives (i.e., a statistically significant result when H0 is true), using Soric’s (1989) formula for the maximum false discovery rate. An EDR of 41% implies that no more than 8% of the significant results are false positives, however the lower limit of the 95%CI of the EDR, 33%, allows for 11% false positive results. Most readers are likely to agree that this is too high. One solution to this problem is to lower the conventional criterion for statistical significance (Benjamin et al., 2017). Figure 2 shows that alpha = .01 reduces the point estimate of the FDR to 2% with an upper limit of the 95% confidence interval of 4%. Thus, without any further information readers could use this criterion to interpret results published in articles by UofT faculty members.

The University of Toronto has three distinct campuses with a joint graduate program. Faculty members are appointed to one of the campuses and hiring and promotion decisions are made autonomously at each of the campuses. The three campuses also have different specializations. For example, clinical psychology is concentrated at the Scarborough (UTSC) campus. It is therefore interesting to examine whether results differ across the three campuses. The next figure shows the results for the University of Toronto – Mississauga (UTM) campus; home of the R-Index.

The observed discovery rate and the expected replication rate are very similar, but the point estimate of the EDR for the UTM campus is lower than for UofT in general (29% vs. 41). The confidence intervals do overlap. Thus, it is not clear whether this is a systematic difference or just sampling error.

The results for the Scarborough campus also show a similar ODR and ERR. The point estimate of the expected discovery rate is a bit higher than for UTM and lower than for the combined analysis, but the confidence intervals overlap.

The results for the St. George campus are mostly in line with the overall results. This is partially due to the fact, that researchers on this campus contributed a large number of test results. Overall, these results show that the three departments are more similar than different from each other.

Another potential moderator is the area of research. Social psychology has been shown to be less replicable than cognitive psychology (OSC, 2015). UofT has a fairly large number of social psychologists who contributed to the z-curve (k = 13), especially on the St. George campus (k = 8). The z-curve for social psychologists at UofT is not different from the overall z-curve and the EDR is higher than for social psychologists at other universities.

The results for the other areas are based on smaller numbers of faculty members. Developmental psychology has a slightly lower EDR but the confidence interval is very wide.

There were only 4 associate or full professors in cognitive psychology with sufficient z-scores (many cognitive researchers publish in neuropsychology journals that are not yet covered). The results are similar to the overall z-curve. Thus, UofT research does not show the difference between social and cognitive psychology that is observed in general or at other universities (Schimmack, 2022).

Another possible moderator is time. Before 2011, researchers were often not aware that searching for significant p-values with many analyses inflates the risk of false positive results considerably. After 2011, some researchers have changed their research practices to increase replicability and reduce the risk of false positive results. As change takes time, I looked for articles published after 2015 to see whether UofT faculty shows signs of improved research practices. Unfortunately, this is not the case. The z-curve is similar to the z-curve for all tests.

The table below shows the meta-statistics of all 32 faculty members that provided results for the departmental z-curve. You can see the z-curve for individual faculty member by clicking on their name.

Rank   Name ARP EDR ERR FDR
1

Publication Bias in the Stereotype Threat Literature

Two recent meta-analyses of stereotype threat studies found evidence of publication bias (Flore & Wicherts, 2014; Shewach et al., 2019). This blog post adds to this evidence by using a new method to examine publication bias that also quantifies the amount of publication bias, called z-curve. The data are based on a search for studies in Web of Science that include “stereotype threat” in the Title or Abstract. This search found 1,077 articles. Figure 1 shows that publications and citation are still increasing.

I then searched for matching articles in a database with 121 psychology journals that includes all major social psychology journals. This search yielded 256 matching articles. I then performed a search of these 256 articles for tests results of hypothesis tests. This search produced 3,872 test results that were converted into absolute z-scores as a measure of the strength of evidence against the null-hypothesis. Figure 2 shows a histogram of these z-scores that is called a z-curve plot.

Visual inspection of the plot shows a clear drop in reported results at z = 1.96. This value corresponds to the standard criterion for statistical significance, p = .05 (two-tailed). This finding confirms the meta-analytic findings of publication bias. To quantify the amount of publication bias, we can compare the observed discovery rate to the expected discovery rate. The observed discovery rate is simply the percentage of statistically significant results, 2,841/3,872 = 73. The expected discovery rate is based on fitting a model to the distribution of statistically significant results and extrapolating from these results to the expected distribution for non-significant results (i.e., the grey curve in Figure 2). The full grey curve is not shown because the mode of the density distribution exceeds the maximum value on the y-axis. The significant results make up only 16% of the area under the grey curve. This suggests that actual tests of stereotype threat effects only produce significant results in 16% of all attempts.

The expected discovery rate can also be used to compute the maximum percentage of significant results that are false positives; that is, studies produced a significant result without a real effect. An expected discovery rate of 16% implies a false positive risk of 27%. Thus, about a quarter of published results could be false positives. The problem is that we do not know which of the published results are false positives and which ones are not. Another problem is that selection for significance also inflates effect size estimates. Thus, even real effects may be so small that they have no practical significance.

Is Terror Management Theory Immortal?

The basic idea of terror management theory is that humans are aware of their own mortality and that thoughts about one’s own death elicit fear. To cope with this fear or anxiety, humans engage in various behaviors that reduce death anxiety. To study these effects, participants in experimental studies are either asked to think about death or some other unpleasant event (e.g., dental visits). Numerous studies show statistically significant effects of these manipulations on a variety of measures.

Figure 1 shows that terror management research has grown exponentially. Although the rate of publications is leveling off, citations are still increasing exponentially.

While the growth of terror management research suggests that the theory rests on a large body of evidence, it is not clear whether this evidence is trustworthy. The reason is that psychologists have used a biased statistical procedure to test theories. When a statistically significant result is obtained, the results are written up and submitted for publication. However, when the results are not significant and do not support a prediction, the results typically remain unpublished. It has been pointed out a long time ago, that this bias can produce entirely literatures with significant results in the absence of a real effects (Rosenthal, 1979).

Recent advances in statistical methods make it possible to examine the strength of evidence for a theory after taking publication bias into account. To use this method, I searched Web of Science for articles with the topic “terror management”. This search retrieved 2,394 articles. I then searched for matching articles in a database of 121 psychology journals that includes all major social psychology journals (Schimmack, 2022). This search produced a list of 259 articles. I then searched these 259 articles for statistical tests and converted the results of these tests into absolute z-scores as a measure of the strength of evidence against the null-hypothesis. Figure 2 shows the z-curve plot of the 4,014 results.

The z-curve shows a peak at the criterion for statistical significance (z = 1.96 equals p = .05, two-tailed). This peak is not a natural phenomenon. It rather reflects the selective reporting of supporting evidence. Whereas the published results show 72% significant results, the z-curve model that is fitted to the distribution of significant z-scores estimates that studies had only 14% power to produce significant results. This difference between the observed discovery rate of 72% and the expected discovery rate of 14% shows that unscientific practices dramatically inflate the evidence in favor of terror management theory. This means reported effect sizes are inflated. Moreover, an expected discovery rate of 14% implies that up to 32% of the significant results could be false positive results that were obtained without any real effect. The upper limit of the 95% confidence interval even allows for 71% false positive results. The problem is that it is unclear which published results produced real findings that could be replicated. Thus, it is currently unclear how reminders of death influence human behavior.

One limitation of the method used to generate Figure 2 is the automatic extraction of all test results from articles. A better method uses hand-coding of focal hypothesis tests of terror management theory. Fortunately, an independent team of researchers conducted a hand-coding analysis of terror management studies (Chen, Benjamin, Lai, & Heine, 2022).

The results mostly confirm the results of the automated analysis. The key difference is that the selection for significance effect is even more evident because researchers hardly ever report non-significant results for focal hypothesis tests. The observed discovery rate of 95% is consistent with analyses by Sterling over 50 years ago (Sterling, 1959). Moreover, most of the non-significant results are in the range between z = 1.65 (p = .10) and z = 1.96 (p = .05) that are often used as evidence to reject the null-hypothesis. While the observed discovery rate in published articles is nearly 100%, the expected discovery rate is only 9% and the 95%CI includes 5%, which is expected by chance alone. Thus, the data provide no credible evidence for any terror management effects and it is possible that 100% of the significant results are false positive results without any real effect.