All posts by Ulrich Schimmack

About Ulrich Schimmack

Since Cohen (1962) published his famous article on statistical power in psychological journals, statistical power has not increased. The R-Index makes it possible f to distinguish studies with high power (good science) and studies with low power (bad science). Protect yourself from bad science and check the R-Index before you believe statistical results.

Replicability Rankings of Psychology Departments

Introduction

Since 2011, it is an open secret that many published results in psychology journals do not replicate. The replicability of published results is particularly low in social psychology (Open Science Collaboration, 2015).

A key reason for low replicability is that researchers are rewarded for publishing as many articles as possible without concerns about the replicability of the published findings. This incentive structure is maintained by journal editors, review panels of granting agencies, and hiring and promotion committees at universities.

To change the incentive structure, I developed the Replicability Index, a blog that critically examined the replicability, credibility, and integrity of psychological science. In 2016, I created the first replicability rankings of psychology departments (Schimmack, 2016). Based on scientific criticisms of these methods, I have improved the selection process of articles to be used in departmental reviews.

1. I am using Web of Science to obtain lists of published articles from individual authors (Schimmack, 2022). This method minimizes the chance that articles that do not belong to an author are included in a replicability analysis. It also allows me to classify researchers into areas based on the frequency of publications in specialized journals. Currently, I cannot evaluate neuroscience research. So, the rankings are limited to cognitive, social, developmental, clinical, and applied psychologists.

2. I am using department’s websites to identify researchers that belong to the psychology department. This eliminates articles that are from other departments.

3. I am only using tenured, active professors. This eliminates emeritus professors from the evaluation of departments. I am not including assistant professors because the published results might negatively impact their chances to get tenure. Another reason is that they often do not have enough publications at their current university to produce meaningful results.

Like all empirical research, the present results rely on a number of assumptions and have some limitations. The main limitations are that
(a) only results that were found in an automatic search are included
(b) only results published in 120 journals are included (see list of journals)
(c) published significant results (p < .05) may not be a representative sample of all significant results
(d) point estimates are imprecise and can vary based on sampling error alone.

These limitations do not invalidate the results. Large difference in replicability estimates are likely to predict real differences in success rates of actual replication studies (Schimmack, 2022).

Department Rankings

The main results of the replicability analysis are included in this table. Detailed analyses of departments and faculty members can be found by clicking on the hyperlink of a university.

The table is sorted by the all time actual replication prediction (ARP). It is easy to sort the table by other meta-statistics.

The ERR is the expected replication rate that is estimated based on the average power of studies with significant results (p < .05).

The EDR is the expected discovery rate that is estimated based on the average power of studies before selection for significance. It is estimated using the distribution of significant p-values converted into z-scores.

Bias is the discrepancy between the observed discovery rate (i.e., the percentage of significant results in publications) and the expected discovery rate. Bias reflects the selective reporting of significant results.

The FDR is the false discovery risk. It is estimated using Soric’s formula that converts the expected discovery rate into an estimate of the maximum percentage of false positive results under the assumption that true hypothesis are tested with 100% power.

For more information about these statistics, please look for tutorials or articles on z-curve on this blog.

UniversityARP-AllERR-ALLEDR-ALLBias-AllFDR-AllARP-5YERR-5YEDR-5YBias-5YFDR-5Y  
University of Michigan55694131858.57245276
Western University54.5703929873.5777012
University of Toronto546741288566943247
Princeton University52.5654030867.5746183
University of Amsterdam50.5663535104769254115
Harvard University4869274014556842227
Yale University4865313812557040318
University Texas - Austin46.56627441455.57041248
University of British Columbia 44672147204765293413
McGill University43.56621572043.569185723
Columbia University41.5622149193961175026
New York University41622050204870264315
Stanford University4160224518586650155

2021 Replicability Report for the Psychology Department at Columbia University

Since 2011, it is an open secret that many published results in psychology journals do not replicate. The replicability of published results is particularly low in social psychology (Open Science Collaboration, 2015).

A key reason for low replicability is that researchers are rewarded for publishing as many articles as possible without concerns about the replicability of the published findings. This incentive structure is maintained by journal editors, review panels of granting agencies, and hiring and promotion committees at universities.

To change the incentive structure, I developed the Replicability Index, a blog that critically examined the replicability, credibility, and integrity of psychological science. In 2016, I created the first replicability rankings of psychology departments (Schimmack, 2016). Based on scientific criticisms of these methods, I have improved the selection process of articles to be used in departmental reviews.

1. I am using Web of Science to obtain lists of published articles from individual authors (Schimmack, 2022). This method minimizes the chance that articles that do not belong to an author are included in a replicability analysis. It also allows me to classify researchers into areas based on the frequency of publications in specialized journals. Currently, I cannot evaluate neuroscience research. So, the rankings are limited to cognitive, social, developmental, clinical, and applied psychologists.

2. I am using department’s websites to identify researchers that belong to the psychology department. This eliminates articles that are from other departments.

3. I am only using tenured, active professors. This eliminates emeritus professors from the evaluation of departments. I am not including assistant professors because the published results might negatively impact their chances to get tenure. Another reason is that they often do not have enough publications at their current university to produce meaningful results.

Like all empirical research, the present results rely on a number of assumptions and have some limitations. The main limitations are that
(a) only results that were found in an automatic search are included
(b) only results published in 120 journals are included (see list of journals)
(c) published significant results (p < .05) may not be a representative sample of all significant results
(d) point estimates are imprecise and can vary based on sampling error alone.

These limitations do not invalidate the results. Large difference in replicability estimates are likely to predict real differences in success rates of actual replication studies (Schimmack, 2022).

Columbia University

A research assistant, Dellania Segreti, used the department website to find core members of the psychology department. She found 11 professors and 2 associate professors. This makes Columbia U one of the smaller psychology departments. She used Web of Science to download references related to the authors name and initial. An r-script searched for related publications in the database of publications in 120 psychology journals.

Not all researchers conduct quantitative research and report test statistics in their result sections. Therefore, the analysis is limited to 10 faculty members that had at least 100 significant test statistics. This criterion eliminated many faculty members who publish predominantly in neuroscience journals.

Figure 1 shows the z-curve for all 7,776 tests statistics. I use the Figure to explain how a z-curve analysis provides information about replicability and other useful meta-statistics.

1. All test-statistics are converted into absolute z-scores as a common metric of the strength of evidence (effect size over sampling error) against the null-hypothesis (typically H0 = no effect). A z-curve plot is a histogram of absolute z-scores in the range from 0 to 6. The 934 z-scores greater than 6 are not shown because z-scores of this magnitude are extremely unlikely to occur when the null-hypothesis is true (particle physics uses z > 5 for significance). Although they are not shown, they are included in the computation of the meta-statistics.

2. Visual inspection of the histogram shows a steep drop in frequencies at z = 1.96 (solid red line) that corresponds to the standard criterion for statistical significance, p = .05 (two-tailed). This shows that published results are selected for significance. The dashed red line shows significance for p < .10, which is often used for marginal significance. Thus, there are more results that are presented as significant than the .05 criterion suggests.

3. To quantify the amount of selection bias, z-curve fits a statistical model to the distribution of statistically significant results (z > 1.96). The grey curve shows the predicted values for the observed significant results and the unobserved non-significant results. The statistically significant results (including z > 6) make up 21% of the total area under the grey curve. This is called the expected discovery rate because the results provide an estimate of the percentage of significant results that researchers actually obtain in their statistical analyses. In comparison, the percentage of significant results (including z > 6) includes 70% of the published results. This percentage is called the observed discovery rate, which is the rate of significant results in published journal articles. The difference between a 70% ODR and a 21% EDR provides an estimate of the extent of selection for significance. The difference of~ 50 percentage points is large, and among the largest differences of psychology departments analyzed so far. The upper level of the 95% confidence interval for the EDR is 31%. Thus, the discrepancy is not just random. To put this result in context, it is possible to compare it to the average for 120 psychology journals in 2010 (Schimmack, 2022). The ODR (70% vs. 72%) is similar, but the EDR (21% vs. 28%) is lower, although the difference is not statistically significant and could just be sampling error.

4. The z-curve model also estimates the average power of the subset of studies with significant results (p < .05, two-tailed). This estimate is called the expected replication rate (ERR) because it predicts the percentage of significant results that are expected if the same analyses were repeated in exact replication studies with the same sample sizes. The ERR of 62% suggests a fairly high replication rate. The problem is that actual replication rates are lower than the ERR predictions (about 40% Open Science Collaboration, 2015). The main reason is that it is impossible to conduct exact replication studies and that selection for significance will lead to a regression to the mean when replication studies are not exact. Thus, the ERR represents the best case scenario that is unrealistic. In contrast, the EDR represents the worst case scenario in which selection for significance does not select more powerful studies and the success rate of replication studies is not different from the success rate of original studies. The EDR of 21% is lower than the actual replication success rate of 40%. To predict the success rate of actual replication studies, I am using the average of the EDR and ERR, which is called the actual replication prediction (ARP). For Columbia, the ARP is (62 + 21)/2 = 42%. This is close to the currently best estimate of the success rate for actual replication studies based on the Open Science Collaboration project (~40%). Thus, research from Columbia University is expected to replicate at the average rate of actual replication studies.

5. The EDR can be used to estimate the risk that published results are false positives (i.e., a statistically significant result when H0 is true), using Soric’s (1989) formula for the maximum false discovery rate. An EDR of 21% implies that no more than 19% of the significant results are false positives, but the lower limit of the 95%CI of the EDR, 13%, allows for 38% false positive results. Most readers are likely to agree that this is too high. One solution to this problem is to lower the conventional criterion for statistical significance (Benjamin et al., 2017). Figure 2 shows that alpha = .005 reduces the point estimate of the FDR to 4% with an upper limit of the 95% confidence interval of 9%. Thus, without any further information readers could use this criterion to interpret results published in articles by psychology researchers at Columbia. Of course, this criterion will be inappropriate for some researchers, but the present results show that the traditional alpha criterion of .05 is also inappropriate to maintain a reasonably low probability of false positive results.

Some researchers have changed research practices in response to the replication crisis. It is therefore interesting to examine whether replicability of newer research has improved. To examine this question, I performed a z-curve analysis for articles published in the past five year (2016-2021).

The results are disappointing. The point estimate is even lower than for all year, although the difference could just be sampling error. Mostly, these results suggest that the psychology department at Columbia University has not responded to the replication crisis in psychology, despite a low replication rate that provides more room for improvement. The ARP of 39% for research published since 2016 places Columbia University at the bottom of universities analyzed so far.

Only one area had enough researchers to conduct an area-specific analysis. The social area had 6 members with useable data. The z-curve shows a slightly lower EDR than the z-curve for all 10 faculty members, although the difference is not statistically significant. The low EDR for the department is partially due to the high percentage of social faculty members with useable data.

There is considerable variability across individual researchers, although confidence intervals are often wide due to the smaller number of test statistics. The table below shows the meta-statistics of all faculty members that provided results for the departmental z-curve. You can see the z-curve for individual faculty member by clicking on their name.

Rank  NameARPEDRERRFDR
1Jonathan B. Freeman8486821
2Janet Metcalfe7780732
3Kevin N. Ochsner50663311
4Lila Davachi48752120
5Dima Amso44632516
6Niall Bolger43573012
7Geraldine A. Downey37581726
8E. Tory Higgins34531529
9Nim Tottenham34501725
10Valerie Purdie34402714

2021 Replicability Report for the Psychology Department at U Texas – Austin

Since 2011, it is an open secret that many published results in psychology journals do not replicate. The replicability of published results is particularly low in social psychology (Open Science Collaboration, 2015).

A key reason for low replicability is that researchers are rewarded for publishing as many articles as possible without concerns about the replicability of the published findings. This incentive structure is maintained by journal editors, review panels of granting agencies, and hiring and promotion committees at universities.

To change the incentive structure, I developed the Replicability Index, a blog that critically examined the replicability, credibility, and integrity of psychological science. In 2016, I created the first replicability rankings of psychology departments (Schimmack, 2016). Based on scientific criticisms of these methods, I have improved the selection process of articles to be used in departmental reviews.

1. I am using Web of Science to obtain lists of published articles from individual authors (Schimmack, 2022). This method minimizes the chance that articles that do not belong to an author are included in a replicability analysis. It also allows me to classify researchers into areas based on the frequency of publications in specialized journals. Currently, I cannot evaluate neuroscience research. So, the rankings are limited to cognitive, social, developmental, clinical, and applied psychologists.

2. I am using department’s websites to identify researchers that belong to the psychology department. This eliminates articles that are from other departments.

3. I am only using tenured, active professors. This eliminates emeritus professors from the evaluation of departments. I am not including assistant professors because the published results might negatively impact their chances to get tenure. Another reason is that they often do not have enough publications at their current university to produce meaningful results.

Like all empirical research, the present results rely on a number of assumptions and have some limitations. The main limitations are that
(a) only results that were found in an automatic search are included
(b) only results published in 120 journals are included (see list of journals)
(c) published significant results (p < .05) may not be a representative sample of all significant results
(d) point estimates are imprecise and can vary based on sampling error alone.

These limitations do not invalidate the results. Large difference in replicability estimates are likely to predict real differences in success rates of actual replication studies (Schimmack, 2022).

University of Texas – Austin

I used the department website to find core members of the psychology department. I counted 35 professors and 6 associate professors. Not all researchers conduct quantitative research and report test statistics in their result sections. I limited the analysis to 20 professors and 3 associate professors who had at least 100 significant test statistics. As noted above, this eliminated many faculty members who publish predominantly in neuroscience journals.

Figure 1 shows the z-curve for all 10,679 tests statistics in articles published by 23 faculty members. I use the Figure to explain how a z-curve analysis provides information about replicability and other useful meta-statistics.

1. All test-statistics are converted into absolute z-scores as a common metric of the strength of evidence (effect size over sampling error) against the null-hypothesis (typically H0 = no effect). A z-curve plot is a histogram of absolute z-scores in the range from 0 to 6. The 1,559 z-scores greater than 6 are not shown because z-scores of this magnitude are extremely unlikely to occur when the null-hypothesis is true (particle physics uses z > 5 for significance). Although they are not shown, they are included in the meta-statistics.

2. Visual inspection of the histogram shows a steep drop in frequencies at z = 1.96 (solid red line) that corresponds to the standard criterion for statistical significance, p = .05 (two-tailed). This shows that published results are selected for significance. The dashed red line shows significance for p < .10, which is often used for marginal significance. Thus, there are more results that are presented as significant than the .05 criterion suggests.

3. To quantify the amount of selection bias, z-curve fits a statistical model to the distribution of statistically significant results (z > 1.96). The grey curve shows the predicted values for the observed significant results and the unobserved non-significant results. The statistically significant results (including z > 6) make up 27% of the total area under the grey curve. This is called the expected discovery rate because the results provide an estimate of the percentage of significant results that researchers actually obtain in their statistical analyses. In comparison, the percentage of significant results (including z > 6) includes 69% of the published results. This percentage is called the observed discovery rate, which is the rate of significant results in published journal articles. The difference between a 71% ODR and a 27% EDR provides an estimate of the extent of selection for significance. The difference of~ 45 percentage points is fairly large. The upper level of the 95% confidence interval for the EDR is 39%. Thus, the discrepancy is not just random. To put this result in context, it is possible to compare it to the average for 120 psychology journals in 2010 (Schimmack, 2022). The ODR (71% vs. 72%) and the EDR (27% vs. 28%) are very similar to the average for 120 psychology journals.

4. The z-curve model also estimates the average power of the subset of studies with significant results (p < .05, two-tailed). This estimate is called the expected replication rate (ERR) because it predicts the percentage of significant results that are expected if the same analyses were repeated in exact replication studies with the same sample sizes. The ERR of 66% suggests a fairly high replication rate. The problem is that actual replication rates are lower than the ERR predictions (about 40% Open Science Collaboration, 2015). The main reason is that it is impossible to conduct exact replication studies and that selection for significance will lead to a regression to the mean when replication studies are not exact. Thus, the ERR represents the best case scenario that is unrealistic. In contrast, the EDR represents the worst case scenario in which selection for significance does not select more powerful studies and the success rate of replication studies is not different from the success rate of original studies. The EDR of 27% is lower than the actual replication success rate of 40%. To predict the success rate of actual replication studies, I am using the average of the EDR and ERR, which is called the actual replication prediction (ARP). For UT Austin, the ARP is (66 + 27)/2 = 47%. This is just a bit above the currently best estimate of the success rate for actual replication studies based on the Open Science Collaboration project (~40%). Thus, UT Austin results are expected to replicate at the average rate of psychological research.

5. The EDR can be used to estimate the risk that published results are false positives (i.e., a statistically significant result when H0 is true), using Soric’s (1989) formula for the maximum false discovery rate. An EDR of 22% implies that no more than 18% of the significant results are false positives, but the lower limit of the 95%CI of the EDR, 18%, allows for 31% false positive results. Most readers are likely to agree that this is too high. One solution to this problem is to lower the conventional criterion for statistical significance (Benjamin et al., 2017). Figure 2 shows that alpha = .005 reduces the point estimate of the FDR to 2% with an upper limit of the 95% confidence interval of 5%. Thus, without any further information readers could use this criterion to interpret results published in articles by psychology researchers at UT Austin. Of course, this criterion will be inappropriate for some researchers, but the present results show that the traditional alpha criterion of .05 is also inappropriate to maintain a reasonably low probability of false positive results.

Some researchers have changed research practices in response to the replication crisis. It is therefore interesting to examine whether replicability of newer research has improved. To examine this question, I performed a z-curve analysis for articles published in the past five year (2016-2021).

The results show an improvement. The EDR increased from 27% to 41%, but the confidence intervals are too wide to infer that this is a systematic change. The false discovery risk dropped to 8%, but due to the smaller sample size the upper limit of the 95% confidence interval is still 19%. Thus, it would be premature to lower the significance level at this point. notable improvement. The muted response to the replication crisis is by no means an exception. Rather, currently the exception is Stanford University that has shown the only significant increase in the EDR.

Only one area had enough researchers to conduct an area-specific analysis. The social area had 8 members with useable data. The z-curve is similar to the overall z-curve. Thus, there is no evidence that social psychology at UT Austin has lower replicability than other areas.

There is considerable variability across individual researchers, although confidence intervals are often wide due to the smaller number of test statistics. The table below shows the meta-statistics of all faculty members that provided results for the departmental z-curve. You can see the z-curve for individual faculty member by clicking on their name.

Rank   Name ARP ERR EDR FDR
1Chen Yu7477712
2Yvon Delville7176663
3K. Paige Harden6269544
4Cristine H. Legare5873427
5James W. Pennebaker5563476
6William B. Swann53753211
7Bertram Gawronski52742913
8Jessica A. Church51772417
9David M. Buss48762021
10Jasper A. J. Smits45573311
11Michael J. Telch45672317
12Hongjoo J. Lee44701824
13Cindy M. Meston44553410
14Jacqueline D. Woolley42661824
15Christopher G. Beevers41622021
16Marie H. Monfils41671627
17Samuel D. Gosling38532218
18Arthur B. Markman38591824
19David S. Yeager38482814
20Robert A. Josephs37462814
21Jennifer S. Beer36512021
22Frances A. Champagne34551334
23Marlone D. Henderson27341922

2021 Replicability Report for the Psychology Department at Stanford 

Since 2011, it is an open secret that many published results in psychology journals do not replicate. The replicability of published results is particularly low in social psychology (Open Science Collaboration, 2015).

A key reason for low replicability is that researchers are rewarded for publishing as many articles as possible without concerns about the replicability of the published findings. This incentive structure is maintained by journal editors, review panels of granting agencies, and hiring and promotion committees at universities.

To change the incentive structure, I developed the Replicability Index, a blog that critically examined the replicability, credibility, and integrity of psychological science. In 2016, I created the first replicability rankings of psychology departments (Schimmack, 2016). Based on scientific criticisms of these methods, I have improved the selection process of articles to be used in departmental reviews.

1. I am using Web of Science to obtain lists of published articles from individual authors (Schimmack, 2022). This method minimizes the chance that articles that do not belong to an author are included in a replicability analysis. It also allows me to classify researchers into areas based on the frequency of publications in specialized journals. Currently, I cannot evaluate neuroscience research. So, the rankings are limited to cognitive, social, developmental, clinical, and applied psychologists.

2. I am using department’s websites to identify researchers that belong to the psychology department. This eliminates articles that are from other departments.

3. I am only using tenured, active professors. This eliminates emeritus professors from the evaluation of departments. I am not including assistant professors because the published results might negatively impact their chances to get tenure. Another reason is that they often do not have enough publications at their current university to produce meaningful results.

Like all empirical research, the present results rely on a number of assumptions and have some limitations. The main limitations are that
(a) only results that were found in an automatic search are included
(b) only results published in 120 journals are included (see list of journals)
(c) published significant results (p < .05) may not be a representative sample of all significant results
(d) point estimates are imprecise and can vary based on sampling error alone.

These limitations do not invalidate the results. Large difference in replicability estimates are likely to predict real differences in success rates of actual replication studies (Schimmack, 2022).

Stanford University

I used the department website to find core members of the psychology department. I counted 19 professors and 6 associate professors. Not all researchers conduct quantitative research and report test statistics in their result sections. I limited the analysis to 13 professors and 3 associate professors who had at least 100 significant test statistics.

Figure 1 shows the z-curve for all 13,147 tests statistics in articles published by 16 faculty members. I use the Figure to explain how a z-curve analysis provides information about replicability and other useful meta-statistics.

1. All test-statistics are converted into absolute z-scores as a common metric of the strength of evidence (effect size over sampling error) against the null-hypothesis (typically H0 = no effect). A z-curve plot is a histogram of absolute z-scores in the range from 0 to 6. The 1,344 z-scores greater than 6 are not shown because z-scores of this magnitude are extremely unlikely to occur when the null-hypothesis is true (particle physics uses z > 5 for significance). Although they are not shown, they are included in the meta-statistics.

2. Visual inspection of the histogram shows a steep drop in frequencies at z = 1.96 (solid red line) that corresponds to the standard criterion for statistical significance, p = .05 (two-tailed). This shows that published results are selected for significance. The dashed red line shows significance for p < .10, which is often used for marginal significance. Thus, there are more results that are presented as significant than the .05 criterion suggests.

3. To quantify the amount of selection bias, z-curve fits a statistical model to the distribution of statistically significant results (z > 1.96). The grey curve shows the predicted values for the observed significant results and the unobserved non-significant results. The statistically significant results (including z > 6) make up 22% of the total area under the grey curve. This is called the expected discovery rate because the results provide an estimate of the percentage of significant results that researchers actually obtain in their statistical analyses. In comparison, the percentage of significant results (including z > 6) includes 69% of the published results. This percentage is called the observed discovery rate, which is the rate of significant results in published journal articles. The difference between a 67% ODR and a 22% EDR provides an estimate of the extent of selection for significance. The difference of~ 45 percentage points is fairly large. The upper level of the 95% confidence interval for the EDR is 30%. Thus, the discrepancy is not just random. To put this result in context, it is possible to compare it to the average for 120 psychology journals in 2010 (Schimmack, 2022). The ODR (67% vs. 72%) and the EDR (22% vs. 28%) are somewhat lower, suggesting that statistical power is lower in studies from Stanford.

4. The z-curve model also estimates the average power of the subset of studies with significant results (p < .05, two-tailed). This estimate is called the expected replication rate (ERR) because it predicts the percentage of significant results that are expected if the same analyses were repeated in exact replication studies with the same sample sizes. The ERR of 60% suggests a fairly high replication rate. The problem is that actual replication rates are lower than the ERR predictions (about 40% Open Science Collaboration, 2015). The main reason is that it is impossible to conduct exact replication studies and that selection for significance will lead to a regression to the mean when replication studies are not exact. Thus, the ERR represents the best case scenario that is unrealistic. In contrast, the EDR represents the worst case scenario in which selection for significance does not select more powerful studies and the success rate of replication studies is not different from the success rate of original studies. The EDR of 21% is lower than the actual replication success rate of 40%. To predict the success rate of actual replication studies, I am using the average of the EDR and ERR, which is called the actual replication prediction (ARP). For Stanford, the ARP is (60 + 22)/2 = 41%. This is close to the currently best estimate of the success rate for actual replication studies based on the Open Science Collaboration project (~40%). Thus, Stanford results are expected to replicate at the average rate of psychological research.

5. The EDR can be used to estimate the risk that published results are false positives (i.e., a statistically significant result when H0 is true), using Soric’s (1989) formula for the maximum false discovery rate. An EDR of 22% implies that no more than 18% of the significant results are false positives, but the lower limit of the 95%CI of the EDR, 18%, allows for 31% false positive results. Most readers are likely to agree that this is too high. One solution to this problem is to lower the conventional criterion for statistical significance (Benjamin et al., 2017). Figure 2 shows that alpha = .005 reduces the point estimate of the FDR to 4% with an upper limit of the 95% confidence interval of 8%. Thus, without any further information readers could use this criterion to interpret results published in articles by researchers in the psychology department of Stanford University.

Comparisons of research areas typically show lower replicability for social psychology (OSC, 2015), and Stanford has a large group of social psychologists (k = 10). However, the results for social psychologists at Stanford are comparable to the results for the entire faculty. Thus, the relatively low replicability of research from Stanford compared to other departments cannot be attributed to the large contingent of social psychologists.

Some researchers have changed research practices in response to the replication crisis. It is therefore interesting to examine whether replicability of newer research has improved. To examine this question, I performed a z-curve analysis for articles published in the past five year. The results show a marked improvement. The expected discovery rate more than doubled from 22% to 50%, and this increase is statistically significant. (So far, I have analyzed only 7 departments, but this is the only one with a significant increase yet). The high EDR reduces the false positive risk to a point estimate of 5% and an upper limit of the 95% confidence interval of 9%. Thus, for newer research, most of the results that are statistically significant with the conventional significance criterion of .05 are likely to be true effects. However, effect sizes are still going to be inflated because selection for significance with modest power results in regression to the mean. Nevertheless, these results provide first evidence of positive change at the level of departments. It would be interesting to examine whether these changes are due to individual efforts of researchers or reflect systemic changes that have been instituted at Stanford.

There is considerable variability across individual researchers, although confidence intervals are often wide due to the smaller number of test statistics. The table below shows the meta-statistics of all 16 faculty members that provided results for the departmental z-curve. You can see the z-curve for individual faculty member by clicking on their name.

Rank   Name ARP EDR ERR FDR
1Jamil Zaki6973653
2James J. Gross6369564
3Jennifer L. Eberhardt5466417
4Jeanne L. Tsai5166369
5Hyowon Gweon5062398
6Michael C. Frank47702317
7Hazel Rose Markus47652913
8Noah D. Goodman46722022
9Ian H. Gotlib45652615
10Ellen M. Markman43622516
11Carol S. Dweck41582417
12Claude M. Steele37522120
13Laura L. Carstensen35571337
14Benoit Monin33531336
15Geoffrey L. Cohen29461337
16Gregory M. Walton29451433

2021 Replicability Report for the Psychology Department at UBC (Vancouver) 

Since 2011, it is an open secret that many published results in psychology journals do not replicate. The replicability of published results is particularly low in social psychology (Open Science Collaboration, 2015).

A key reason for low replicability is that researchers are rewarded for publishing as many articles as possible without concerns about the replicability of the published findings. This incentive structure is maintained by journal editors, review panels of granting agencies, and hiring and promotion committees at universities.

To change the incentive structure, I developed the Replicability Index, a blog that critically examined the replicability, credibility, and integrity of psychological science. In 2016, I created the first replicability rankings of psychology departments (Schimmack, 2016). Based on scientific criticisms of these methods, I have improved the selection process of articles to be used in departmental reviews.

1. I am using Web of Science to obtain lists of published articles from individual authors (Schimmack, 2022). This method minimizes the chance that articles that do not belong to an author are included in a replicability analysis. It also allows me to classify researchers into areas based on the frequency of publications in specialized journals. Currently, I cannot evaluate neuroscience research. So, the rankings are limited to cognitive, social, developmental, clinical, and applied psychologists.

2. I am using department’s websites to identify researchers that belong to the psychology department. This eliminates articles that are from other departments.

3. I am only using tenured, active professors. This eliminates emeritus professors from the evaluation of departments. I am not including assistant professors because the published results might negatively impact their chances to get tenure. Another reason is that they often do not have enough publications at their current university to produce meaningful results.

Like all empirical research, the present results rely on a number of assumptions and have some limitations. The main limitations are that
(a) only results that were found in an automatic search are included
(b) only results published in 120 journals are included (see list of journals)
(c) published significant results (p < .05) may not be a representative sample of all significant results
(d) point estimates are imprecise and can vary based on sampling error alone.

These limitations do not invalidate the results. Large difference in replicability estimates are likely to predict real differences in success rates of actual replication studies (Schimmack, 2022).

University of British Columbia (Vancouver)

I used the department website to find core members of the psychology department. I counted 34 professors and 7 associate professors. Not all researchers conduct quantitative research and report test statistics in their result sections. I limited the analysis to 22 professors and 4 associate professors who had at least 100 significant test statistics.

Figure 1 shows the z-curve for all 13,147 tests statistics in articles published by these 26 faculty members. I use the Figure to explain how a z-curve analysis provides information about replicability and other useful meta-statistics.

1. All test-statistics are converted into absolute z-scores as a common metric of the strength of evidence (effect size over sampling error) against the null-hypothesis (typically H0 = no effect). A z-curve plot is a histogram of absolute z-scores in the range from 0 to 6. The 1,531 z-scores greater than 6 are not shown because z-scores of this magnitude are extremely unlikely to occur when the null-hypothesis is true (particle physics uses z > 5 for significance). Although they are not shown, they are included in the meta-statistics.

2. Visual inspection of the histogram shows a steep drop in frequencies at z = 1.96 (solid red line) that corresponds to the standard criterion for statistical significance, p = .05 (two-tailed). This shows that published results are selected for significance. The dashed red line shows significance for p < .10, which is often used for marginal significance. Thus, there are more results that are presented as significant than the .05 criterion suggests.

3. To quantify the amount of selection bias, z-curve fits a statistical model to the distribution of statistically significant results (z > 1.96). The grey curve shows the predicted values for the observed significant results and the unobserved non-significant results. The statistically significant results (including z > 6) make up 21% of the total area under the grey curve. This is called the expected discovery rate because the results provide an estimate of the percentage of significant results that researchers actually obtain in their statistical analyses. In comparison, the percentage of significant results (including z > 6) includes 68% of the published results. This percentage is called the observed discovery rate, which is the rate of significant results in published journal articles. The difference between a 68% ODR and a 21% EDR provides an estimate of the extent of selection for significance. The difference of~ 50 percentage points is large. The upper level of the 95% confidence interval for the EDR is 29%. Thus, the discrepancy is not just random. To put this result in context, it is possible to compare it to the average for 120 psychology journals in 2010 (Schimmack, 2022). The ODR (68% vs. 72%) is similar, but the EDR is lower (31% vs. 21%). This suggest that the research produced by UBC faculty members is somewhat less replicable than research in general.

4. The z-curve model also estimates the average power of the subset of studies with significant results (p < .05, two-tailed). This estimate is called the expected replication rate (ERR) because it predicts the percentage of significant results that are expected if the same analyses were repeated in exact replication studies with the same sample sizes. The ERR of 67% suggests a fairly high replication rate. The problem is that actual replication rates are lower (about 40% Open Science Collaboration, 2015). The main reason is that it is impossible to conduct exact replication studies and that selection for significance will lead to a regression to the mean when replication studies are not exact. Thus, the ERR represents the best case scenario that is unrealistic. In contrast, the EDR represents the worst case scenario in which selection for significance does not select more powerful studies and the success rate of replication studies is not different from the success rate of original studies. The EDR of 21% is lower than the actual replication success rate of 40%. To predict the success rate of actual replication studies, I am using the average of the EDR and ERR, which is called the actual replication prediction (ARP). For UBC research, the ARP is 44%. This is close to the currently best estimate of the success rate for actual replication studies based on the Open Science Collaboration project (~40%). Thus, UBC results are expected to replicate at the average rate of psychological research.

5. The EDR can be used to estimate the risk that published results are false positives (i.e., a statistically significant result when H0 is true), using Soric’s (1989) formula for the maximum false discovery rate. An EDR of 21% implies that no more than 20% of the significant results are false positives, but the lower limit of the 95%CI of the EDR, 13%, allows for 34% false positive results. Most readers are likely to agree that this is too high. One solution to this problem is to lower the conventional criterion for statistical significance (Benjamin et al., 2017). Figure 2 shows that alpha = .005 reduces the point estimate of the FDR to 3% with an upper limit of the 95% confidence interval of 7%. Thus, without any further information readers could use this criterion to interpret results published in articles by researchers in the psychology department of UBC.

The next analyses examine area as a potential moderator. Actual replication studies suggest that social psychology has a lower replication rate than cognitive psychology, whereas the replicability of other areas is currently unknown (OSC, 2015). UBC has a large group of social psychologists with enough data to conduct a z-curve analysis (k = 9). Figure 3, shows the z-curve for the pooled data. The results show no notable difference to the z-curve for the department in general.

The only other area with at least five members that provided data to the overall z-curve was developmental psychology. The results are similar, although the EDR is a bit higher.

The last analysis examined whether research practices changed in response to the credibility crisis and evidence of low replication rates (OSC, 2015). For this purpose, I limited the analysis to articles published in the past 5 years. The EDR increased, but only slightly (29% vs. 21%) and not significantly. This suggests that research practices have not changed notably.

There is considerable variability across individual researchers, although confidence intervals are often wide due to the smaller number of test statistics. The table below shows the meta-statistics of all 26 faculty members that provided results for the departmental z-curve. You can see the z-curve for individual faculty member by clicking on their name.

EDR = expected discovery rate (mean power before selection for significance)
ERR = expected replication rate (mean power after selection for significance)
FDR = false positive risk (maximum false positive rate, Soric, 1989)
ARP = actual replication prediction (mean of EDR and ERR)

Rank   Name ARP EDR ERR FDR
1Darko Odic6773623
2Steven J. Heine6480476
3J. Kiley Hamlin6062584
4Lynn E. Alden5771437
5Azim F. Shariff5569408
6Andrew Scott Baron54723510
7James T. Enns53753112
8Catharine A. Winstanley5353535
9D. Geoffrey Hall49762219
10Elizabeth W. Dunn4858379
11Alan Kingstone48742318
12Jessica L. Tracy47672814
13Sheila R. Woody46633013
14Jeremy C. Biesanz45612813
15Kristin Laurin43592615
16Luke Clark42632120
17Frances S. Chen41641725
18Mark Schaller41572416
19Kalina Christoff40631627
20E. David Klonsky39502714
21Ara Norenzayan38621530
22Toni Schmader37561825
23Liisa A. M. Galea36591336
24Janet F. Werker35492021
25Todd C. Handy30471239
26Stan B. Floresco2643955

2021 Replicability Report for the Psychology Department at Yale

Since 2011, it is an open secret that many published results in psychology journals do not replicate. The replicability of published results is particularly low in social psychology (Open Science Collaboration, 2015).

A key reason for low replicability is that researchers are rewarded for publishing as many articles as possible without concerns about the replicability of the published findings. This incentive structure is maintained by journal editors, review panels of granting agencies, and hiring and promotion committees at universities.

To change the incentive structure, I developed the Replicability Index, a blog that critically examined the replicability, credibility, and integrity of psychological science. In 2016, I created the first replicability rankings of psychology departments (Schimmack, 2016). Based on scientific criticisms of these methods, I have improved the selection process of articles to be used in departmental reviews.

1. I am using Web of Science to obtain lists of published articles from individual authors (Schimmack, 2022). This method minimizes the chance that articles that do not belong to an author are included in a replicability analysis. It also allows me to classify researchers into areas based on the frequency of publications in specialized journals. Currently, I cannot evaluate neuroscience research. So, the rankings are limited to cognitive, social, developmental, clinical, and applied psychologists.

2. I am using department’s websites to identify researchers that belong to the psychology department. This eliminates articles that are from other departments.

3. I am only using tenured, active professors. This eliminates emeritus professors from the evaluation of departments. I am not including assistant professors because the published results might negatively impact their chances to get tenure. Another reason is that they often do not have enough publications at their current university to produce meaningful results.

Like all empirical research, the present results rely on a number of assumptions and have some limitations. The main limitations are that
(a) only results that were found in an automatic search are included
(b) only results published in 120 journals are included (see list of journals)
(c) published significant results (p < .05) may not be a representative sample of all significant results
(d) point estimates are imprecise and can vary based on sampling error alone.

These limitations do not invalidate the results. Large difference in replicability estimates are likely to predict real differences in success rates of actual replication studies (Schimmack, 2022).

Yale

I used the department website to find core members of the psychology department. I counted 13 professors and 4 associate professors, which makes it one of the smaller departments in North America. Not all researchers conduct quantitative research and report test statistics in their result sections. I limited the analysis to 12 professors and 1 associate professors who had at least 100 significant test statistics.

Figure 1 shows the z-curve for all 13,147 tests statistics in articles published by these 18 faculty members. I use the Figure to explain how a z-curve analysis provides information about replicability and other useful meta-statistics.

1. All test-statistics are converted into absolute z-scores as a common metric of the strength of evidence (effect size over sampling error) against the null-hypothesis (typically H0 = no effect). A z-curve plot is a histogram of absolute z-scores in the range from 0 to 6. The 1,178 z-scores greater than 6 are not shown because z-scores of this magnitude are extremely unlikely to occur when the null-hypothesis is true (particle physics uses z > 5 for significance). Although they are not shown, they are included in the meta-statistics.

2. Visual inspection of the histogram shows a steep drop in frequencies at z = 1.96 (solid red line) that corresponds to the standard criterion for statistical significance, p = .05 (two-tailed). This shows that published results are selected for significance. The dashed red line shows significance for p < .10, which is often used for marginal significance. Thus, there are more results that are presented as significant than the .05 criterion suggests.

3. To quantify the amount of selection bias, z-curve fits a statistical model to the distribution of statistically significant results (z > 1.96). The grey curve shows the predicted values for the observed significant results and the unobserved non-significant results. The statistically significant results (including z > 6) make up 31% of the total area under the grey curve. This is called the expected discovery rate because the results provide an estimate of the percentage of significant results that researchers actually obtain in their statistical analyses. In comparison, the percentage of significant results (including z > 6) includes 69% of the published results. This percentage is called the observed discovery rate, which is the rate of significant results in published journal articles. The difference between a 69% ODR and a 31% EDR provides an estimate of the extent of selection for significance. The difference of~ 40 percentage points is fairly large. The upper level of the 95% confidence interval for the EDR is 42%. Thus, the discrepancy is not just random. To put this result in context, it is possible to compare it to the average for 120 psychology journals in 2010 (Schimmack, 2022). The ODR (69% vs. 72%) and the EDR (31% vs. 28%) are similar. This suggest that the research produced by Yale faculty members is neither more nor less replicable than research produced at other universities.

4. The EDR can be used to estimate the risk that published results are false positives (i.e., a statistically significant result when H0 is true), using Soric’s (1989) formula for the maximum false discovery rate. An EDR of 31% implies that no more than 12% of the significant results are false positives, but the lower limit of the 95%CI of the EDR, 18%, allows for 24% false positive results. Most readers are likely to agree that this is too high. One solution to this problem is to lower the conventional criterion for statistical significance (Benjamin et al., 2017). Figure 2 shows that alpha = .005 reduces the point estimate of the FDR to 3% with an upper limit of the 95% confidence interval of 5%. Thus, without any further information readers could use this criterion to interpret results published in articles by researchers in the psychology department of Harvard University.

Given the small size of the department, it is not very meaningful to conduct separate analyses by area. However, I did conduct a z-curve analysis of articles published since 2016 to examine whether research at Yale has changed in response to the call for improvements in research practices. The results show an increase in the expected discovery rate from 31% to 40%, although the confidence intervals still overlap. Thus, it is not possible to conclude at this moment that this is a real improvement (i.e., it could just be sampling error). The expected replication rate also increased slightly from 65% to 70%. Thus, there are some positive trends, but there is still evidence of selection for significance (ODR 71% vs. EDR = 40%).

There is considerable variability across individual researchers, although confidence intervals are often wide due to the smaller number of test statistics. The table below shows the meta-statistics of all 13 faculty members that provided results for the departmental z-curve. You can see the z-curve for individual faculty member by clicking on their name.

Rank   Name ARP EDR ERR FDR
1Tyrone D. Cannon6973643
2Frank C. Keil6777564
3Yarrow Dunham53713510
4Woo-Kyoung Ahn52733112
5B. J. Casey5264398
6Nicholas B. Turk-Browne50663410
7Jutta Joorman49643510
8Brian J. Scholl46692318
9Laurie R. Santos42661725
10Melissa J. Ferguson38621434
11Jennifer A. Richeson37492615
12Peter Salovey36571530
13John A. Bargh35561531

2021 Replicability Report for the Psychology Department at Harvard

Since 2011, it is an open secret that many published results in psychology journals do not replicate. The replicability of published results is particularly low in social psychology (Open Science Collaboration, 2015).

A key reason for low replicability is that researchers are rewarded for publishing as many articles as possible without concerns about the replicability of the published findings. This incentive structure is maintained by journal editors, review panels of granting agencies, and hiring and promotion committees at universities.

To change the incentive structure, I developed the Replicability Index, a blog that critically examined the replicability, credibility, and integrity of psychological science. In 2016, I created the first replicability rankings of psychology departments (Schimmack, 2016). Based on scientific criticisms of these methods, I have improved the selection process of articles to be used in departmental reviews.

1. I am using Web of Science to obtain lists of published articles from individual authors (Schimmack, 2022). This method minimizes the chance that articles that do not belong to an author are included in a replicability analysis. It also allows me to classify researchers into areas based on the frequency of publications in specialized journals. Currently, I cannot evaluate neuroscience research. So, the rankings are limited to cognitive, social, developmental, clinical, and applied psychologists.

2. I am using department’s websites to identify researchers that belong to the psychology department. This eliminates articles that are from other departments.

3. I am only using tenured, active professors. This eliminates emeritus professors from the evaluation of departments. I am not including assistant professors because the published results might negatively impact their chances to get tenure. Another reason is that they often do not have enough publications at their current university to produce meaningful results.

Like all empirical research, the present results rely on a number of assumptions and have some limitations. The main limitations are that
(a) only results that were found in an automatic search are included
(b) only results published in 120 journals are included (see list of journals)
(c) published significant results (p < .05) may not be a representative sample of all significant results
(d) point estimates are imprecise and can vary based on sampling error alone.

These limitations do not invalidate the results. Large difference in replicability estimates are likely to predict real differences in success rates of actual replication studies (Schimmack, 2022).

Harvard

I used the department website to find core members of the psychology department. I counted 23 professors and 1 associate professors, which makes it one of the smaller departments in North America. Not all researchers conduct quantitative research and report test statistics in their result sections. I limited the analysis to 16 professors and 1 associate professors that had at least 100 significant test statistics.

Figure 1 shows the z-curve for all 13,147 tests statistics in articles published by these 17 faculty members. I use the Figure to explain how a z-curve analysis provides information about replicability and other useful meta-statistics.

1. All test-statistics are converted into absolute z-scores as a common metric of the strength of evidence (effect size over sampling error) against the null-hypothesis (typically H0 = no effect). A z-curve plot is a histogram of absolute z-scores in the range from 0 to 6. The 1,465 z-scores greater than 6 are not shown because z-scores of this magnitude are extremely unlikely to occur when the null-hypothesis is true (particle physics uses z > 5 for significance). Although they are not shown, they are included in the meta-statistics.

2. Visual inspection of the histogram shows a steep drop in frequencies at z = 1.96 (solid red line) that corresponds to the standard criterion for statistical significance, p = .05 (two-tailed). This shows that published results are selected for significance. The dashed red line shows significance for p < .10, which is often used for marginal significance. Thus, there are more results that are presented as significant than the .05 criterion suggests.

3. To quantify the amount of selection bias, z-curve fits a statistical model to the distribution of statistically significant results (z > 1.96). The grey curve shows the predicted values for the observed significant results and the unobserved non-significant results. The full grey curve is not shown to present a clear picture of the observed distribution. The statistically significant results (including z > 6) make up 27% of the total area under the grey curve. This is called the expected discovery rate because the results provide an estimate of the percentage of significant results that researchers actually obtain in their statistical analyses. In comparison, the percentage of significant results (including z > 6) includes 69% of the published results. This percentage is called the observed discovery rate, which is the rate of significant results in published journal articles. The difference between a 69% ODR and a 27% EDR provides an estimate of the extent of selection for significance. The difference of~ 40 percentage points is fairly large. The upper level of the 95% confidence interval for the EDR is 38%. Thus, the discrepancy is not just random. To put this result in context, it is possible to compare it to the average for 120 psychology journals in 2010 (Schimmack, 2022). The ODR (69% vs. 72%) and the EDR (27% vs. 28%) are similar. This suggest that the research produced by Harvard faculty members is neither more nor less replicable than research produced at other universities.

4. The EDR can be used to estimate the risk that published results are false positives (i.e., a statistically significant result when H0 is true), using Soric’s (1989) formula for the maximum false discovery rate. An EDR of 27% implies that no more than 14 of the significant results are false positives, however the lower limit of the 95%CI of the EDR, 18%, allows for 24% false positive results. Most readers are likely to agree that this is too high. One solution to this problem is to lower the conventional criterion for statistical significance (Benjamin et al., 2017). Figure 2 shows that alpha = .005 reduces the point estimate of the FDR to 3% with an upper limit of the 95% confidence interval of 5%. Thus, without any further information readers could use this criterion to interpret results published in articles by researchers in the psychology department of Harvard University.

Most of the faulty are cognitive psychologists (k = 7) or clinical psychologists (k = 5). The z-curve for clinical research shows a lower EDR and ERR, but the confidence intervals are wide and the difference may just reflect sampling error.

Consistent with other comparisons of disciplines, cognitive results have a higher EDR and ERR, but the confidence intervals are too wide to conclude that this difference is statistically significant at Harvard. Thus, the overall results hold largely across areas.

The next analysis examines whether research practices changed in response to the credibility crisis in psychology. I selected articles published since 2016 for this purpose.

The EDR is higher than the EDR for all years (42% vs. 27%), but the ODR (64% vs. 67%) and the ERR remained unchanged (68% vs. 69%). Thus, selection for significance decreased, but is still present in more recent articles.

There is considerable variability across individual researchers, although confidence intervals are often wide due to the smaller number of test statistics. The table below shows the meta-statistics of all 18 faculty members that provided results for the departmental z-curve. You can see the z-curve for individual faculty member by clicking on their name.

Rank   Name ARP EDR ERR FDR
1Mina Cikara6973653
2Samuel J. Gershman6977604
3George A. Alvarez6283418
4Daniel L. Schacter5770447
5Alfonso Caramazza5671418
6Mahzarin R. Banaji5675379
7Jason P. Mitchell5560505
8Katie A. McLaughlin51723012
9Fiery Cushman50772219
10Elizabeth S. Spelke49643311
11Elizabeth A. Phelps46692218
12Susan E. Carey45612813
13Jesse Snedeker45672218
14Matthew K. Nock43572913
15Jill M. Hooley43592714
16Daniel T. Gilbert41641726
17John R. Weisz29431531

Personality and Subjective Well-Being

Over 40-years ago, Costa and McCrae (1980) proposed that subjective well-being is influenced by two personality traits, namely extraversion and neuroticism. They even presented their theory in the form of a causal model.

Forty years later, this model still dominates personality theories of subjective well-being (Anglim et al., 2020). The main revision has been the addition of agreeableness and conscientiousness as additional personality factors that influence well-being (Heller et al., 2004; McCrae & Costa, 1991).

Although it seems natural to test the proposed causal model using structural equation modeling (SEM), personality researchers have resisted the use of causal modeling as a statistical tool. One reason has been that SEM models often do not fit personality data (McCrae et al., 1996). This is hardly a convincing reason to avoid using SEM in personality research. Astronomers did not ban new telescopes that showed more moons around Jupiter than Galileo discovered. They simply revised the number of moons.

It is therefore urgently needed to test Costa and McCrae’s theory with a method that can falsify the theory. Even if the data do not fit Costa and McCrae’s original theory, it does not take away from their important contribution 40 years ago. Nobody is arguing that Galileo was a bad astronomer because he only discovered four moons.

Structural equation modeling has two benefits for theory development. First, it can be used to test causal theories. For example, the model in Figure 1 predicts that effects of extraversion on life-satisfaction is mediated by positive affect, whereas the effect of neuroticism is mediated by negative affect. Finding additional mediators would falsify the model and lead to the revision of the theory. The second benefit is that SEM makes it possible to fit measurement models to the data. This is the aim of the present blog post.

Measurement of Personality

In the 1980s, personality psychologists developed the Big Five model as a general framework to describe individual differences in personality traits. This unified framework has led to the development of Big Five questionnaires. The development of relatively short questionnaire enabled researchers to include personality measures in studies that had a different research focus. As a result, many studies reported correlations between Big Five measures and life-satisfaction. These correlations have been meta-analyzed in several meta-analysis. The results were summarized by Anglim et al. (2020), who conducted the latest meta-analysis (current (core)).

Assuming that the latest meta-analysis provides the best estimate of the average correlations, neuroticism shows the strongest relationship with r = -.4, followed by extraversion and conscientiousness, r = .3, and agreeableness, r = .2. The correlation with openness is consistently the weakest, r = .1.

The main problem with these correlations is that they cannot be interpreted as effect sizes of the individual personality traits on life-satisfaction, even if we are willing to assume a causal relationship between personality traits and life-satisfaction. The reason is that that scores on the Big Five scales are not independent. As a result, the unique relationship between one Big Five scale and life-satisfaction ratings is smaller (Kim et al., 2018).

It also matters what causes correlations among Big Five scales? While some theories postulate that these correlations reflect common personality factors, evidence suggests that most of the correlations reflect two response styles, evaluative bias (halo) and acquiescence bias (Anusic et al., 2009). Evaluative bias is particularly problematic because it also influences life-satisfaction ratings (Kim et al., 2012; Schimmack, Schupp, & Wagner, 2008). This shared method variance inflates correlations between personality ratings and life-satisfaction ratings. Thus, the simple correlations in meta-analysis provide inflated effect size estimates. However, it is unclear how much halo bias contributes to variation in life-satisfaction ratings and how much the Big Five contribute to well-being when halo variance is statistically controlled. To answer this question, I conducted a new meta-analysis.

I started with Anglim et al.’s (2020) list of studies to search for reasonably large datasets (N > 700) that reported correlations among all Big Five scales and a life-satisfaction measure. I then added additional datasets that I knew that were not included in the meta-analysis. The focus on large datasets was motivated by two considerations. First, studies with small N may not meet the requirements for structural equation modeling and are likely to produce unreliable estimates. Second, large samples are weighted more heavily in meta-analyses. Thus, small datasets will often only increase sampling error without altering the actual results. To ensure that the selection of studies did not influence the results, I compared the simple correlations to the results reported by Anglim et al. (2020).

I included 32 datasets in the analysis with a total sample size of N = 154,223. The correlations tended to be a little bit weaker, but the differences are rather small.

Measurement Models

Typical studies that correlate Big Five scales with life-satisfaction ratings do not have a formal measurement model. Personality is operationalized in terms of the sum score (or mean) on several personality items. This “measurement model” is illustrated in Figure 1. To simplify the presentation, the model includes only two items per Big Five dimension, but the model can be generalized to longer scales.

The key part of this model are the arrows from items (n1 … c2) to scale scores (N … C). The correlations among scale scores are determined by the inter-item correlations. The scale scores are then correlated with the life-satisfaction scores.

In this model the assignment of items to scales is arbitrary, but Big Five scales are not. They are based on a theory that scale scores reflect factors that produce a systematic pattern of correlations. The simple assumption is that items were selected to reflect a particular trait (e.g., n1 & n2 reflect the factor Neuroticism). This assumption is illustrated in Figure 2.

In Figure 2, the squares with capital letters in italics (N, E, O, A, C) represent factors. Factors are unobserved variables that cause variation in items. In contrast, the squares with capital letters represent observed variables that are created by averaging observed scores on items. Figure 2 is a measurement model because it makes predictions about the covariation among the items. The causal arrows from the factors to items imply that two items of the same factor are correlated because they are influenced by a common factor (e.g., n1 & n2 are correlated because they are influenced by neuroticism). The simple model in Figure 2 implies that the correlations between items of different factors are zero. It is well-known that this simple measurement model does not fit actual data because there are non-zero correlations between items from different factors. However, this lack of fit is often ignored when researchers simply use scale scale scores (N to C) as if they have perfect indicators of factors (N to C). With structural equation modeling it is possible to fit measurement models that actually fit the data and examine how the Big Five factors are related to life-satisfaction scores. This is what I did for each of the 32 datasets. The basic measurement model is shown in Figure 3.

The model in Figure 3 represents the covariation among Big Five items as the function of the Big Five factors (N = Neuroticism/Negative Emotionality, E = Extraversion, O = Openness, A = Agreeableness, & C = Conscientiousness) and a sixth factor H = Halo. The halo factor produces evaluatively consistent correlations among all items. Typically positively coded items of extraversion, openness, agreeableness, and conscientiousness and reverse coded items of neuroticism have positive loadings on this factor because these items tend to be more desirable. However, the loading depends on the desirability of each item. The model no longer contains sum scores (N, E, O, A, C). The reason is that sum scores are suboptimal predictors of a particular criterion. This model maximizes the variance predicted in life-satisfaction scores. However, the model can distinguish between variance that is explained by the six factors and variance that is explained by residual variance in specific items. Using the model indirect function, we get standardized estimates of the contribution of the six factors to the life-satisfaction scores.

Average Effect Sizes

Figure 4 shows the standardized effect sizes. All effect sizes are smaller than the meta-analytic correlations. Neuroticism remains the strongest Big Five predictor, b = .25, but halo is an equally strong predictor, b = .26. Consistent with Costa and McCrae’s model, extraversion is a significant predictor, but the effect size is small, b = .14. Consistent with McCrae and Costa (1991), agreeableness and conscientiousness are additional predictors, but the effect sizes are even smaller, agreeableness b = .04 and conscientiousness b = .09. Together, the Big Five factors explain 10% of the variance in life-satisfaction and halo explains an additional 6%.

The following figures show the results for each Big Five factor in more detail. Neuroticism effect sizes show a normal distribution. The 95%CI based on tau ranges from b = .09 to b = .44. This variability is not sampling error, which is much smaller in these large samples. Rather it reflects heterogeneity in effect sizes due to differences in populations, types of measurement, and a host of other factors that vary across studies. Thus, it is possible to conclude that neuroticism explains between 1% and 19% of the variance in well-being. This is a wide confidence interval and future research is needed to obtain more precise estimates and to find moderators of this relationship.The figure also shows that the mean effect size for representative samples is a bit smaller than the one for the average sample. As the average is based on an arbitrary sample of studies, the average for representative samples may be considered a better estimate of the typical effect size, b = .20. Based on this finding, neuroticism may explain only 4% of the variance in life-satisfaction ratings.

Extraversion is also a significant predictor, with a point estimate of b = .14 for all samples and b = .08 for representative samples, suggesting that extraversion explains about 1 to 2 percent of the variance in life-satisfaction judgments. The 95%CI ranges from b = .01 to b = .27, which corresponds to an estimate of 8% explained variance. This confirms the widely held assumption that extraverts are happier than introverts, but the effect size is smaller than some reviews of the literature suggest.

The results for openness are very clear. Openness has no direct relationship with life-satisfaction. The point estimate is close to zero and the 95%CI ranges from b = -.08 to b = .05. Thus, there is insufficient heterogeneity to make it worthwhile to examine moderators. Of course, the relationship is unlikely to be exactly zero, but very small effect sizes are impossible to study reliably given the current levels of measurement error in personality measures.

McCrae and Costa (1991) provided some evidence that agreeableness and conscientiousness also predict life-satisfaction. Meta-analyses supported this conclusion, but effect size estimates were inflated by shared method variance. The following results show that the contribution of these two personality factors to life-satisfaction is small.

The point estimate for agreeableness is b = .04 for all samples and b = .05 for representative samples. The 95%CI ranges from b = .03 to b = .11. Thus, the upper limit of the confidence interval corresponds to an estimate that agreeableness explains only 1% of the variance in life-satisfaction judgments. This is a small effect size. This finding has theoretical implications for theories that try to link pro-social traits like empathy or gratitude to life-satisfaction. The present results suggest that another way to achieve well-being is to focus on one’s own well-being. Of course, selfish pursuits of happiness may have other negative consequences that make pro-social pursuits of happiness more desirable, but the present results do not suggest that a prosocial orientation in itself ensures higher levels of life-satisfaction to any substantial degree. In short, assholes can be happy assholes.

The effect size for conscientiousness is a bit stronger, with a point estimate of b = .09 for all samples and b = .08 for representative samples. The 95%CI is relatively wide and ranges from b = .01 to b = .18, which covers a range of effect sizes that could be considered too small to matter to effect sizes that are substantial with up to 3% explained variance. Thus, future research needs to explore moderators of this relationship.

The most overlooked predictor of life-satisfaction judgments is the shared variance among Big Five ratings that can be attributed to evaluative or halo bias. This factor has been studied separately in a large literature on positive illusions and self-enhancement. The present meta-analysis shows point estimates for the halo factor that match those for neuroticism with b = .26 for all samples and b = .22 for representative samples. The 95^CI ranges from b = .12 to b = .4, which means the effect size is in the small to moderate range. The biggest question is how this finding should be interpreted. One interpretation is that positive illusions contribute to higher life-satisfaction (Dufner et al., 2019; Taylor & Brown, 1988). The alternative interpretation is that halo variance merely reflects shared method variance that produces a spurious correlation between self-ratings of personality and life-satisfaction (Schimmack, Schupp, & Wagner, 2008; Schimmack & Kim, 2020). Only multi-method studies that measure well-being with methods other than self-ratings can answer this question, but the present meta-analysis shows that up to 16% of the variance in self-ratings can be attributed to halo bias. Past studies often failed to distinguish between Big Five factors and the halo factor, leading to inflated effect sizes estimates for neuroticism, extraversion, and conscientiousness. Future studies need to control for evaluative biases in studies that correlate self-ratings of personality with self-ratings of outcome measures.

In conclusion, the results do confirm Costa and McCrae’s (1980) prediction that neuroticism and extraversion contribute to life-satisfaction. In addition, they confirm McCrae and Costa’s (1991) prediction that conscientiousness is also a positive predictor of life-satisfaction. While the effect for agreeableness is statistically significant, the effect size is too small to be theoretically meaningful. In addition to the predicted effects, evaluative bias in personality ratings contributes to life-satisfaction judgments and the effect is as strong as the effect of neuroticism.

Moderator Analysis

I used the metafor r-package to conduct moderator analyses. Potential moderators were type of measure (NEO vs. other, BFI vs. other), number of personality items per factor, number of life-satisfaction items (one item vs. scale), the type of data (correlation matrix vs. raw data), and culture (anglo vs. other). I ran the analysis for each of the six factors. With 7 moderators and 6 variables, there were 42 statistical tests. Thus, chance alone is expected to produce two significant result with alpha = .05, but no significant result with alpha = .01. I therefore used alpha = .01 to discuss moderator results.

There were no significant moderators for neuroticism. For extraversion, number of personality items was a significant predictor, p < .001. However, the effect size is weak and suggested that the correlation would increase from .12 for 2 items to .15 for 10 items. For openness, a significant culture effect emerged, p = .015 . Openness had a small negative effect in Anglo cultures, b = -.04, and no effect in non-Anglo cultures, b = .01. However, the effect size is too small to be meaningful. There were no significant results for agreeableness. For conscientiousness, an effect of questionnaire was significant, p = .008. The NEO showed somewhat smaller effects, but this effect was no longer significant in a univariate analysis, and the effect size was small, NEO b = .06 vs. other b = .10. For halo, a significant effect of the number of life-satisfaction items emerged, p = .005. The effect was stronger in studies with multiple-item measures of life-satisfaction, b = .20 vs. b = .28. The reason could be that aggregation of life-satisfaction items increases reliable variance, including response styles. In sum, the moderator analysis suggests that results are fairly robust across studies with different measures and across different Western cultures.

Discussion

Quantifying effect sizes is necessary to build quantitative theories of personality and well-being. The present results show that three of the Big Five traits are reliable predictors of life-satisfaction ratings that jointly explain a substantial amount of variance in life-satisfaction ratings. Neuroticism is the strongest predictor, but the amount of explained variance is unclear. The 95^CI ranges from 1% to 19%, with a point estimate of 7%. Extraversion is the second strongest predictor with a 95%CI ranging from 0 to 8% of explained variance with a point estimate of 2% explained variance. The 95%CI for conscientiousness also ranges from 0 to 3% of variance with a point estimate of 1%. Combined these results suggest that the Big Five personality traits explain between 1% and 20% of the variance in life-satisfaction with a point estimate of 10% explained variance. Worded differently, the Big Five traits explain 10 +/- 10% of the variance in life-satisfaction judgments. Another 7 +/- 6 percent of the variance is explained by halo bias. Subsequently, I discuss the various implications of these findings for future research on personality and well-being.

Improvement in Measurement

Future research needs to improve the construct validity of Big Five measures. Existing measures are ad-hoc scales that lack a clear theoretical foundation with no clear rational for the selection of items. New measures like the Big Five Inventory 2 are an improvement, but effect size estimates with this measure that control for halo variance are lacking. Even the BFI-2 has limitations. It measures the higher-order Big Five factors with three facet measures, but more facet measures would be better to obtain stable estimates of the factor loadings of facets on the Big Five factors.

Longitudinal Evidence

The Big Five personality factors and their facets are conceptualized as stable personality dispositions. In support of this view, longitudinal studies show that the majority of the variance in Big Five measures is stable (Anusic & Schimmack, 2016). There is also evidence that up to 50% of the variance in well-being measures is stable (Anusic & Schimmack, 2016), but that there is more state variance in well-being that changes in response to life circumstances (Schimmack & Lucas, 2010). Taken together, these results suggest that personality traits account for a larger portion of the stable variance in well-being. Future studies need to test this prediction with longitudinal studies of personality and well-being.

Mediating Processes

Most of the research on personality and well-being has been limited to correlational studies. Therefore theories that explain how personality traits influence well-being are rare. One theory postulates that extraversion and neuroticism are affective dispositions that produce individual differences in affect independent of situational factors (mood), and mood colors life-evaluations (Schimmack, Diener, & Oishi, 2002). Alternative theories suggests that personality traits influence actual life-circumstances or interact with environmental factors to produce individual differences in well-being. To test these theories, it is important to include measures of environmental factors in studies of personality and well-being. Moreover, sample sizes have to be large to detect interaction effects.

Halo and Well-Being

The presence of halo variance in personality ratings has been known for over 100 years (Thorndike, 1920). Over the years, this variance has been attributed to mere response styles or considered evidence of a motivation to boost self-esteem. There have been few attempts to test empirically whether halo variance is merely a rating bias (other deception) or reflects positive illusions about the self. This makes it difficult to interpret the contribution of halo variance to variance in self-ratings of well-being. Does this finding merely show shared method variance among self-ratings or do positive illusions about the self increase positive affect, life-satisfaction, and well-being . Studies with informant ratings of well-being are rare, but tend to show no relationship between halo bias and informant ratings of well-being (Kim, Schimmack, & Oishi, 2012; Schimmack & Kim, 2020). This suggests that halo variance is merely rating bias, but it remains possible that positive illusions increase well-being in ways that are not reflected in informant ratings of well-being.

Normative Theories of Personality Change

Some personality psychologists have proposed a normative theory of personality. Accordingly, the end goal of personality development is to become low in neuroticism and high in agreeableness and conscientiousness. This personality type is considered more mature. However, a justification for this normative model of personality is lacking. The main objective justification for normative theories of personality is optimal functioning because functions have clear and external standards of evaluations. For example, within the context of work psychology, conscientiousness can be evaluated positively because highly conscientious workers are more productive. However, there are no objective criteria to evaluate people’s lives. Thus, the only justification for normative theories of personality would be evidence that some personality traits make it more difficult for individuals to achieve high well-being. The present results suggest that neuroticism is the key personality trait that impedes well-being. However, the results do not support the notion that high agreeableness or high conscientiousness are normatively better than low agreeableness or low conscientiousness because these traits can vary without notable effects on well-being.

Conclusion

Correlations between personality and well-being measures have been reported since Hartmann’s (1936) seminal study of neuroticism and well-being. The literature has grown and it has been meta-analyzed repeatedly. The results consistently show that neuroticism is the strongest predictor of life-satisfaction ratings, but that extraversion, agreeableness, and conscientiousness also show notable simple correlations with life-satisfaction. The present meta-analysis went beyond simple correlations and separated content variance from evaluative variance (halo) in self-ratings of personality. The results showed that halo predicts substantive variance in well-being and accounts for most of the correlations of agreeableness and conscientiousness with well-being. Future studies of personality and well-being need to separate substantive and evaluative variance to take the evaluative nature of personality ratings into account.

Unconscious Thought Theory in Decline

In the late 1980s, experimental social psychology rediscovered the unconscious. One reason might be that psychological laboratories started to use personal computers to conduct studies. This made it possible to present subliminal stimuli or measure reaction times cheaply. Another reason might have been that conscious social cognitive processes are relatively boring and easily accessible to introspection. It was difficult to find novel and eye-catching results with self-reports. The so called implicit revolution (Greenwald & Banaji, 2017) is still going strong, but first signs of problems are visible everywhere. An article by Ap Dijksterhuis (2004) proposed that unconscious process are better than conscious deliberations in making complex choices. This article stimulated research into unconscious thought theory.

Figure 1 shows publication and citation rates in Web of Science. Notably, the publication rate increased steeply until 2011, the year the replication crisis in social psychology started. Afterwards, publications show a slowly decreasing trend. However, citations continue to increase, suggesting that concerns about the robustness of published results has not reduced trust in this literature.

A meta-analysis and failed replication study raised concerns that many findings in this literature may be false positive results (Nieuwenstein et al., 2017). To further examine the credibility of this literature, I subjected the 220 articles in the Web of Science topic search to a z-curve analysis. I first looked for matching articles in a database of articles from 121 psychology journals that includes all major social psychology journals (Schimmack, 2022). This search retrieved 44 articles. An automatic search of these 44 articles produced 534 test statistics. A z-curve analysis of these test statistics showed 64% significant results (not counting marginally significant results, z > 1.65), but the z-curve estimate of power was only 30%. The 95% confidence interval ranges from 10% to 50% and does not include the observed discovery rate of 64%. Thus, there is clear evidence that the published rate of significant results is inflated by unscientific research practices.

An EDR of 30% implies that up to 12% of significant results could be false positive results (Soric, 1989). However, due to uncertainty in the estimate of the EDR, the upper limit of false positive results could be as high as 49%. The main problem is that it is unclear which of the published results are false positives and which ones are real effects. Another problem is that selection for significance inflates effect size estimates and that actual effects are likely to be smaller than published effect size estimates.

One solution to this problem is to focus on results with stronger evidence against the null-hypothesis by lowering the criterion for statistical significance. Some researchers have proposed setting alpha to .005. Figure 3 shows the implications of this criterion value.

The observed discovery rate is now only 34% because many results that were significant with alpha = .05 are no longer significant with alpha = .005. The expected discovery rate also decreases, but the more stringent criterion for significance lowers the false discovery risk to 3% and even the upper limit of the 95% confidence interval is only 20%. This suggests that most of the results with p-values below .005 reported a real effect. However, automatic extraction of test statistics does not distinguish between focal tests of unconscious thought theory and incidental tests of other hypotheses. Thus, it is unclear how many and which of these 184 significant results provide support for unconscious thought theory. The failed replication study by Nieuwenstein et al. (2017) suggests that it is not easy to find conditions under which unconscious thought is superior. In conclusion, there is presently little to no empirical support for unconscious thought theory, but research articles and literature reviews often cite the existing literature as if these studies can be trusted. The decrease in new studies suggests that it is difficult to find credible evidence.