Category Archives: Uncategorized

Replicability Rankings of Psychological Science 2021

A major replication project of 100 studies published in 2008 showed that only 37% of published significant results could be replicated (i.e., the replication study also produced a significant result). This finding has raised concerns about the replicability of psychological science (OSC, 2015).

Conducting actual replication studies to examine the credibility of published results is effortful and costly. Therefore, my colleagues and I developed a cheaper and faster method to estimate the replicability of published results on the basis of published test statistics (Schimmack, 2022). The method produces two estimates of replicability that represent the best possible and the worst possible scenario. The expected replication rate (ERR) assumes that it is possible to replicate studies exactly. When this is not the case, selection for significance and regression to the mean will lead to lower success rates in actual replication studies. The expected discovery rate (EDR) is an estimate of the actual number of statistical results that researchers obtain in their laboratories. If selection for significance is ineffective in reducing the risk of false positive results or in selecting studies with more power, replication studies are expected to be no more successful than original studies (Brunner & Schimmack, 2020). In the absence of any further information, I am using the average of the EDR and ERR as the best prediction of the outcome of actual replication studies. I call this index the Actual Replicability Prediction (ARP). Whereas previous rankings relied exclusively on the ERR, the 2021 rankings start using the ARP to rank the replicability of journals.

Figure 1 shows the average ARP, ERR, and EDR for a broad range of psychology journals. Given the large number of test statistics in each year (k > 100,000), the estimates are very precise. The dashed lines show the 95%CI around the linear trend line. The results show that the ARP has increased from around 50% to slightly under 60%. This finding shows that results published in psychological journals have become a bit more replicable, although this prediction needs to be verified with actual replication studies.

However, the increase is not uniform across journals. Whereas some journals in social psychology showed some big increases, other journals show no changes. The big increases in social psychology are in part due to very low replication rates in this field before 2015 (OSC, 2015). For readers of journals changes are less important than actual replication rates. Table 1 shows the rankings of journals. Predicted replication rates range from an astonishing 97% in the Journal of Individual Differences to a disappointing 37% for Annals of Behavioural Medicine. Of course, results for 2021 are influenced by sampling error. More detailed information about previous years and trends can be found by clicking on the Journal Name.

For now, you can compare these results to previous results using prior rankings from 2020, 2019, or 2018 (these posts only report the ERR).

Rank   JournalARP 2021EDR 2021ERR 2021
1Journal of Individual Differences979697
2Journal of Occupational Health Psychology919191
3JPSP-Personality Processes and Individual Differences817686
4Archives of Sexual Behavior787481
5JPSP-Interpersonal Relationships and Group Processes787582
6Sex Roles767379
7Aggressive Behaviours746682
8Evolutionary Psychology747078
9Social Psychology746979
10Journal of Memory and Language736779
12Self and Identity736977
13Journal of Organizational Psychology715686
14Attention, Perception and Psychophysics706277
15European Journal of Developmental Psychology706774
16Journal of Child Psychology and Psychiatry and Allied Disciplines705882
17Law and Human Behavior706278
18Psychology of Religion and Spirituality706773
20Judgment and Decision Making695683
21Political Psychology695880
22Journal of Family Psychology686372
23Journal of Abnormal Psychology665875
24J. of Exp. Psychology – Learning, Memory & Cognition665774
25J. of Exp. Psychology – Human Perception and Performance655377
26Journal of Comparative Psychology644485
27Behavior Therapy635077
28Journal of Research on Adolescence635669
29Quarterly Journal of Experimental Psychology635176
30British Journal of Developmental Psychology624876
31Personality and Individual Differences624678
32Psychology and Aging624975
33Psychonomic Bulletin and Review624975
34Psychological Science624479
35Acta Psychologica614676
36Cognition and Emotion614478
37Journal of Sex Research613884
38Child Development604675
39Cognitive Development604871
40European Journal of Social Psychology604576
41Evolution & Human Behavior593681
42Journal of Experimental Psychology – Applied594473
43Journal of Experimental Child Psychology594671
44JPSP-Attitudes & Social Cognition594375
45Memory and Cognition593880
46Cognitive Therapy and Research584769
47Journal of Experimental Psychology – General584076
48Journal of Experimental Social Psychology583978
49Journal of Health Psychology584274
50Personality and Social Psychology Bulletin583878
51Social Development584670
52Journal of Nonverbal Behavior573678
53Motivation and Emotion574469
55Social Psychological and Personality Science573678
56Cognitive Behaviour Therapy564072
57Developmental Psychology564369
58Frontiers in Psychology564270
59Consciousness and Cognition553773
60Journal of Applied Psychology553970
61Journal of Behavioral Decision Making553079
62Journal of Cross-Cultural Psychology553970
63Journal of Cognition and Development553673
65Addictive Behaviors543376
66Asian Journal of Social Psychology543672
67International Journal of Psychophysiology544068
68British Journal of Psychology533868
70Frontiers in Behavioral Neuroscience534363
71Journal of Affective Disorders533175
72Journal of Child and Family Studies533967
73Journal of Cognitive Psychology533175
74Journal of Research in Personality533174
76Psychology of Men and Masculinity533472
77Psychology of Music534166
79British Journal of Social Psychology523667
80Journal of Applied Social Psychology523966
81Journal of Business and Psychology523271
82Organizational Behavior and Human Decision Processes522580
83Psychology and Marketing523667
84Psychological Medicine524263
85Animal Behavior512875
86Behavioural Brain Research513073
87Canadian Journal of Experimental Psychology512577
88Journal of Personality512380
89Journal of Religion and Health513467
91Group Processes & Intergroup Relations502772
92Journal of Social and Personal Relationships502773
93Biological Psychology493267
94Depression & Anxiety493365
95Experimental Psychology492672
96Journal of Consumer Research493761
97Journal of Educational Psychology492376
98Journal of Youth and Adolescence492772
99Behaviour Research and Therapy483165
101Journal of Consumer Behaviour482274
102Developmental Psychobiology473459
103Frontiers in Human Neuroscience473064
104Journal of Consulting and Clinical Psychology473361
105Hormones and Behavior462863
106Journal of Anxiety Disorders462073
107Journal of Positive Psychology462467
108Cognitive Psychology452267
109European Journal of Personality453259
110Psychology Crime and Law451971
111Developmental Science442563
112Journal of Consumer Psychology441870
113Journal of Social Psychology431770
114Behavioral Neuroscience421667
115Journal of Happiness Studies421667
116Journal of Occupational and Organizational Psychology422361
117Personal Relationships421768
118Journal of Vocational Behavior411765
119Health Psychology391959
120Journal of Counseling Psychology381661
121Annals of Behavioral Medicine371954
> >

If you liked this post, you might also be interested in “Estimating the False Discovery Risk in Psychological Science‘.”

More Madness than Genius: Meta-Traits of Personality

In 1997, Digman (1997) published an article that aimed to explain correlations among self-rating scales of the Big Five personality traits in terms of two orthogonal higher-order factors. One factor related Extraversion and Openness and the other factor related Emotional Stability (the opposite of Neuroticism), Agreeableness, and Conscientiousness.

This model has had relatively little influence on personality psychology, except for work by Colin DeYoung. The first article on the higher-order factors was published when he was a graduate student with his supervisor Jordan B. Peterson (DeYoung, Peterson, & Higgins, 2002).

In this article, the authors relabel Digman’s factors as stability (Emotional Stability, Agreeableness, & Conscientiousness) and Plasticity (Extraversion & Openness). They suggested that Stability is related to serotonin and Plasticity is related to dopamine.

“We present a biologically predicated model of these two personality factors,relating them to serotonergic and dopaminergic function,an d we label them Stability (Emotional Stability, Agreeableness, and Conscientiousness) and Plasticity (Extraversion and Openness)” (p. 533).

The article, however, does not test relationships between biological markers of these neurotransmitter systems and variation in personality. In this regard, the article merely introduces a hypothesis, but does not provide empirical support for or against it. The only empirical evidence in support of the hypothesis would be that Big Five factors are actually related to each other in the way Digman proposed. Evidence to the contrary would falsify a biological model that predicts these relationships.

The main empirical prediction of the model is that Stability and Plasticity predict variation in self-ratings of conformity.

“Based on this model,we hypothesize that Stability will positively predict conformity (as indicated by socially desirable responding) and that Plasticity will negatively predict
conformity” (p. 533)

The authors claim to have found support for this prediction.

“A structural equation model indicates that conformity is indeed positively related to Stability
(university sample: b =0.98; community sample: b =0.69; P<0.01 for both) and negatively related to Plasticity (university sample: b= -0.48, P<0.07; community sample: b= -0.42, P<0.05).”

Readers familiar with structural equation modeling may be surprised by the strong relationship between Stability and Conformity, especially in the student sample. A standardized parameter of .98 implies that these constructs are nearly perfectly correlated. Relationships of this magnitude are usually not a cause of celebration. They either imply a lack of discriminant validity (i.e., two measures are actually measuring the same construct) or model misspecification.

To understand what is going on in this study, it is helpful to inspect the actual pattern in the data. Fortunately, it was a common practice in personality psychology to share this information in the form of the raw correlation matrices even before open science became the norm in other fields of psychology. We can therefore inspect the published correlation matrix.

First, the two conformity measures (1. Impression Management, 2. Lie Scale) show a moderate correlation, r .53, indicating that they measure a common construct.

Second, both conformity measures show sizeable correlations with the Stability traits Emotional Stability/Neuroticism, r1 = -.37, .36, r2 = .24, -.31, Agreeableness, r1 = .33, .42, r2 = .36, .31, and Conscientiousness, r1 = .33, .38, r2 = .33, .39. In contrast, conformity measures are unrelated to the Plasticity traits, Extraversion/Surgency, r1 = -.05, -.05, r2 = .03, .04 and Openness/Intellect r1 = .01, -.10, .04, -.13. The latter finding raises concerns about the negative relationship between the Plasticity factor and Conformity factor in DeYoung and Peterson’s model and by extension their theory that predicted this negative relationship.

Third, we can examine the correlations among the Big Five measures. According to Digman’s model, Stability and Plasticity are expected to be independent. Accordingly cross-meta-factor correlations (e.g., Extraversion & Agreeableness or Emotional Stability & Openness) should be close to zero. Inspection of Table 1 shows that this is not the case. For example, TDA Surgency correlates r = .23 with TDA Agreeableness, r = .19 with TDA Conscientiousness, r = .16 with NEO Conscientiousness, and r = -.39 with NEO Neuroticism. These correlations need to be modeled to have a good fitting model.

Fourth, we can examine whether the pattern of correlations confirms the key prediction of Digman’s model. Namely Stability traits should be more strongly correlated with each other than with Plasticity traits and vice versa. The comparison of these correlations follows Campbell and Fiske’s (1959) approach to examine convergent and discriminant validity. It is easy to see that the pattern of correlations does not fully support the predicted structure. For example, the Plasticity correlations of TDA Surgency with TDA Intellect, r = .21 and NEO Openness, r = .23 are weaker than the correlation with TDA Emotional Stability, r = .27, and NEO Neuroticism, r = -.39. Results like these raise concerns that the published model misrepresents the actual pattern in the data.

The published model is shown in Figure 2. As noted before, the high relationship between the Stability factor and the Conformity factor is a concern. A similar concern arises from the high loading of Extraversion on the Plasticity factor, b = .95. Accordingly, Plasticity is nearly identical with Extraversion.

It is well known that even well-fitting models do not proof that the proposed model generated the observed pattern of correlations. It is good practice to compare preferred models to plausible alternative models. Model comparison can be used to weed out bad models, but the winner may still not be the right model. That is, we can falsify false models, but we cannot verify the right model.

I first fitted a measurement model to the correlations among the Big Five indicators in Table 1. It is noteworthy that the authors were unable to fit a model to the data in Table 1.

“While it would have been an attractive possibility to use the two measures of each Big Five trait for Sample 1 in order to create a hierarchical factor model,with latent variables for Stability and Plasticity derived from latent variables for each of the Big Five, the many intercorrelations
among the 10 Big Five scales rendered such a model impractical” (p. 542).

There justification makes no sense to anybody who is familiar with structural equation modeling and there are published models with 2, 3, or 4 indicators to create a measurement model of the Big Five factors (Anusic et al., 2009). To achieve satisfactory fit, it is necessary to allow for some secondary loadings and correlated residuals. These parameters reflect the fact that Big Five scales are impure indicators of the Big Five factors that are contaminated with specific item content. Purists may object to the exploratory approach, but they would have to terminate modeling because a simple structure model does not have satisfactory fit. Thus, the only way to proceed and to test the model is to modify the model to have adequate fit and to conduct further tests with better data in the future.

Modification of the measurement model was terminated when no major modification indices were present, chi2 < 10. Final model fit was acceptable, CFI = .989, RMSEA = .055.

All primary loadings were high, b > .7. All secondary loadings were below .3. Notable correlated residuals were present for TDA conscientiousness (con) and TDA agreeableness (agr) and NEO conscientiousness (neoc) and Neo Neuroticism (neon). Neuroticism was reverse scored so that higher scores reflect Emotional Stability.

The correlations among the Big Five factors show generally positive correlations, which is a typical finding. There is some evidence for convergent and discriminant validity of the meta-traits. The highest correlations are for Agreeableness and Emotional Stability, r = .406 (Stability), Conscientiousness and Emotional Stability r = .379 (Stability), and Openness and Extraversion, r = .351 (Plasticity), and Agreeableness and Conscientiousness, r = .323 (Stability).

However, a model that tried to model the Big Five correlations with two independent meta-traits reduced model fit, CFI = .972, RMSEA = .074. As can be seen in Figure 1, DeYoung and Peterson solved this problem by letting the Stability and Plasticity factor correlate without providing a theoretical explanation for this correlation. Adding this correlation to the model improved model fit.

It is now possible to add conformity to the model to reproduce the published results. Model fit remained acceptable, but the standardized effect of Stability on Conformity exceeded 1, b = 1.30. This problem could be solved by relaxing the equality constraint for the loadings of Extraversion and Openness on Plasticity, which was needed in the model without a criterion. However, even this model had the problem that the residual variance in conformity was negative. The reason is hat the model is misspecified.

The key problem with this model is the ad-hoc, atheoretical correlation between the two higher order factors. With the help of hindsight, we know from multi-trait multi-method studies that correlations among all Big Five traits are an artifact of response styles (Biesanz & West, 2004). One of these studies was even published by DeYoung (2006), so there should be no disagreement with him. Anusic et al. (2009) showed that we do not need multi-method data to control for these rating biases. Instead a method-factor can e added to the model. I have improved on Anusic et al.’s approach and started to model this method factor as a factor that has a direct influence on indicators. As a result, Big Five factors are independent of method variance. In this model, stability and plasticity are independent if they are identified.

Figure 2 shows the results.

In tis model, plasticity was no longer a significant predictor of conformity, b = -.064, but the small sample size does not provide precise effect size estimates , 95%CI = -.389 to .261. The standardized coefficient for Stability remained greater than 1, b = 1.124, but the 95%CI included 1, 95%CI = .905 to 1.343.

This pointed towards another crucial problem with DeYoung and Peterson’s model. Their model assumes that the unique variance of Neuroticism, Agreeableness, and Conscientiousness is unrelated to conformity. This assumption might be false. An alternative model would still assume that Stability is related to Conformity, but that this relationship is indirect; that is, it is mediated by the Big Five factors. This model fitted the data slightly better, but fit cannot distinguish between these two models, CFI = .975, RMSEA = .059.

More importantly, in this model the residual variance in conformity was positive, suggesting that conformity is not fully explained by the Big Five factors. About one-quarter of the variance in conformity was unexplained, uniqueness = 28%. The total indirect effect of Stability on Conformity was b = .61, implying that .61^2 = 27% of the variance in Conformity were explained by Stability. This implies that the remaining (1-.28) – .27 = 45% of the explained variance in Conformity are explained by unique variance in the Big Five Stability Factors (Neuroticism, Agreeableness, & Conscientiousness).

The new analyses of the results suggest that the published model is misleading in several ways.

1. Plasticity is not a negative predictor of Conformity

2. Stability explains 30% of the variance in Conformity not 100%.

3. The correlations of Agreeableness, Conscientiousness, and Neuroticism with Conformity are not spurious (i.e., Stability is a third variable). Instead, Agreeableness, Conscientiousness, and Neuroticism mediate the relationship of Stability with Conformity.

4. The published model overestimates the amount of shared variance among Big Five factors because it does not control for response biases and made false assumptions about causality.

Does it matter?

The discussion section of the article used the model to make wide-reaching claims about personality, drawing explicitly on the finding that plasticity is a negative predictor of conformity.

As shown here, these conclusions are based on a false model. At best, we can conclude from tis article that (a) the meta-traits were still identified even after response styles were controlled and (b) conformity measures appear to be related to stability factors and not plasticity factors. However, since the publication of this article, better studies with multi-method data have examined how Big Five factors are correlated (Anusic et al., 2009; Biesanz & West, 2004; DeYoung, 2006). These studies show mixed results and are still limited by the use of scale scores as indicators of the Big Five factors. Thus, it remains unclear whether meta-traits really exist and how much variance in the Big Five traits they explain.

The existence of meta-traits is also not very important for studies that try to predict criterion variables like conformity from the Big Five. There is no theoretical justification to assume that the unique variance components of the Big Five are unrelated to the criterion. As a result, the Big Five can be used as predictors and any effect of the meta-traits would show up as an indirect effect that is mediated by the Big Five.

This model was also used by another Jordan Peterson student, in a study that predicted environmental concerns from the Big Five (Hirsch, 2010).

The most notable finding is that neuroticism, agreeableness, and conscientiousness are all positive predictor of environmental concerns. This is a problem for a model that assumes stability is a positive predictor because this would imply a negative relationship between neuroticism and environmental concerns (negative loading on stability * positive effect on environmental concerns = negative correlation between neuroticism and environmental concerns). Once more, we see that it is unreasonable to assume that the unique variances of the Big Five are unrelated to criterion variables. Criterion variables cannot be used to validate the meta-traits. What would be needed are causal factors that produce the shared variance among Big Five traits. However, it has been difficult to find specific causes of personality variation. Thus, the only evidence for these factors is limited to patterns of correlations among Big Five measures. Even if these correlations are real, they do not imply that the unique variances in the Big Five are irrelevant. Thus, from a practical point of view it is irrelevant whether the Big Five are modeled as correlated factors or with meta-traits that explain these correlations in terms of some hypothetical common causes.

Estimating The Reproducibility of Psychological Science in 2021

Psychology is the science of human affect, behavior, and cognition. Since the beginning of psychological science, psychologists have debated fundamental questions about their science. Recently, these self-reflections of a science about itself have been called meta-science. Many meta-psychological discussions are theoretical. However, some meta-psychological articles rely on empirical data. For example, Cohen’s (1961) seminal investigation of statistical power in social and clinical psychology provided an empirical estimate of statistical power to detect small, moderate, or large effect sizes.

Another empirical meta-psychological contribution was the reproducibility project by the Open Science Collaboration (2015). The project reported the outcome of 100 replication attempts of studies that were published in 2008 in three journals. The key finding was that out of 97 original studies that reported a statistically significant result, only 36 replication studies reproduced a statistically significant result; a success rate of 37%.

The low success rate has been widely cited as evidence that psychological science has a replication crisis. It also justifies concerns about the credibility of other significant results that may have an equally low probability of a successful replication.

Optimists point out that some psychology journals have implemented reforms that may have raised the replicability of published findings such as (a) requesting a priori sample size justifications with power analysis and pre-registration of data analysis plans. It is also possible that psychologists voluntarily changed their research practices after they became aware of the low replicability of their findings. Finally, some psychologists may have lowered the significance criterion to reduce the risk of false positive results that do not replicate. Yet, as of today no empirical evidence exists that these reforms have made a notable, practically significant contribution to the replication rate in psychological science.

In this blog post, I provide a new empirical estimate of the replicability of psychological science that relies on published statistical results. It is possible to predict the outcome of replication studies based on published statistical results because both the results of the original study and the outcome of the replication study are a function of statistical power and sampling error (Brunner & Schimmack, 2020). Studies with higher statistical power are more likely to produce smaller p-values and are more likely to replicate. As a result, smaller p-values imply higher replicability. The statistical challenge is only to find a model that can use published p-values to make predictions about the success rate of replication studies. Brunner and Schimmack (2020) developed and validated z-curve as a method that can estimate the expected replication rate (ERR) based on the distribution of significant p-values after converting them into z-scores. Although the method performs well in simulation studies, it is also necessary to validate the method against the outcome of actual replication studies. The results from the OSC reproducibility project provide an opportunity to do so.

For any empirical study it is necessary to clearly define populations and to ensure adequate sampling from populations. In the present context, populations are quantitative results that were used to test statistical hypotheses like F-tests, t-tests, or other tests. Most articles report several statistical tests for each sample. These statistical tests differ in importance. The reproducibility project focused on one statistical result to evaluate whether a replication study was successful. This result was typically chosen because it was deemed the most important or at least one of the most important results. These results are often called focal or critical. Ideally, statistical prediction models that aim to predict the replicability of focal tests would rely on coding of focal hypothesis tests. The main problem with this approach is that coding of focal hypothesis tests requires trained coders and is time consuming.

An alternative approach uses automatic extraction of test-statistics from published articles. The advantage of this approach that it is quick and produces large representative samples of results published in statistical journals. The key disadvantage is that this approach samples from the population of all test-statistics that are detected by the extraction method and that this population is different from the population of focal hypothesis tests. Therefore, the predictions by the statistical model can be biased to the extent that the two populations have different replication rates. This does not mean that these estimates are useless. Rather, a comparison with actual replication rates can be used to correct for this bias and make more accurate predictions about the replication rate of focal hypothesis tests. Ideally, these estimates can be validated in the future using hand-coding of focal hypothesis tests and actual reproducibility projects of articles published in 2021.

Validation of Z-Curve Predictions with the Reproducibility Project Results

The Reproducibility Project invited replications of studies published in three journals: Psychological Science, Journal of Experimental Psychology: Learning, Memory, and Cognition, and Journal of Personality and Social Psychology. Only articles published in 2008 were eligible.

To predict (the data preceded the criterion) the 37% success rate, I downloaded all articles from these three journals and searched the articles for reported chi2, t-test, F-test, and z-test results. Only results reported in the text were included. The extraction method found 10,951 statistical results. Test results were converted into absolute z-scores. The histogram of z-scores is shown in Figure 1.

Visual inspection of the z-curve shows that the peak of the distribution is right at the criterion for statistical significance (z = 1.96, p = .05 two-sided). The figure also shows clear evidence of selection for significance which is a well-known practice in psychology journals (Sterling, 1959; Sterling et al., 1995). The key result is the expected replication rate of 64%. This result is much higher than the actual replication rate of 37%. There are several explanations for this discrepancy. One explanation is of course that the estimate is based on a different population of test statistics. However, even predictions of hand-coded test results overestimate replication outcomes (Schimmack, 2020). Another reason is that z-curve predictions are based on the idealistic assumption that replication studies reproduce the original study exactly. However, it is likely that replication studies deviate at least in some minor details from the original studies. When selection for significance is present, this variation across studies implies regression to the mean and a lower replication rate. Thus, z-curve predictions are expected to overestimate actual success rates when selection for significance is present, which it clearly is.

Bartos and Schimmack (2021) suggested that the expected discovery rate (EDR) could be a better estimate of actual replication studies. The difference between EDR and ERR is again a difference between populations. The EDR is based on the population of all studies that were conducted. The ERR is based on the population of studies that were conducted and produced a significant result. The significance filter would select studies with higher power. However, if effect sizes vary across replications of the same study, this selection is imperfect and a study with less power could be selected because the study had an unusually large effect size that cannot be replicated. if the selection process is totally unable to distinguish between studies with higher or lower power, the success rate would match the EDR estimate. The EDR estimate of 26% is lower than the actual success rate of 37%, but not by much.

Based on these results, we see that z-curve correctly predicts that the actual success rate is higher than the EDR and lower than the ERR estimate. Using the average of the two estimates as the best prediction, we get a prediction of 45%, compared to the actual outcome of 37%. The remaining discrepancy could be partially due to the difference in populations of test statistics. Moreover, the 37% estimate is based on a small sample of studies and the difference may just be a chance finding.

In conclusion, the present results suggest that the average of the ERR and EDR estimates of z-curve models based on automatically extracted test statistics can predict actual replication outcomes. Of course this conclusion is based on a single observation (N = 1), but th problem is that there are no other actual replication outcomes that could be used to cross-validate this result.

Replication in 2010

Research practices did not change notably between 2008 and 2010. Furthermore, 2008 was arbitrarily selected by the Open Science Collaboration to estimate the reproducibility of psychological science. The following results therefore replicate the z-curve prediction based on a new dataset and examine the generalizabilty of the reproducibility project results across time.

The results are very similar, indeed, and the predicted replication rate for actual replication studies is (62 + 22)/2 = 42%.

Expanding the Sample of Journals

The aim of the OSC project was to estimate the reproducibility of psychological science. A major limitation of the study was the focus on three journals that publish mostly experimental studies in cognitive and social psychology. However, psychology is a much more diverse science that studies development, personality, mental health, and the application of psychology in many areas. It is currently unclear whether replication rates in these other areas of psychology differ from those in experimental cognitive and social psychology. To examine this question, I downloaded all articles from 121 psychology journals published in 2010 (a list of the journals can be found here). Automatic extraction of test statistics from these journals produced 109,117 z-scores. Figure 3 shows the z-curve plot.

The observed discovery rate (i.e., the percentage of statistically significant results) is identical for the 3 OSC journals and the 120 journals, 72%. The expected discovery rate is slightly higher for the 120 journals, 28% vs. 22. The expected replication rate is also slightly higher for the broader set of journals, 67% vs. 62%. This implies a slightly higher estimate of the success rate if a random sample of studies from all areas of psychology were replicated, 48% vs. 42%. However, the difference of 6 percentage points is small and not substantially meaningful. Based on these results, it is reasonable to generalize the results of the OSC project to psychology in general. This average estimate may hide differences between disciplines. For example, the OSC project found that cognitive studies were more replicable than social studies, 50% vs. 25%. Results for individual journals also suggest differences between other disciplines.

Replicability in 2021

Undergraduate students at the University of Toronto Mississauga (Alisar Abdelrahman, Azeb Aboye, Whitney Buluma, Bill Chen, Sahrang Dorazahi, Samer El-Galmady, Surbhi Gupta, Ismail Kanca, Mustafa Khekani, Safana Nasir, Amber Saib, Mohammad Shahan, Swikriti Singh, Stuti Patel, Chuanyu Wu) downloaded all articles published in the same 121 journals in 2021. Automatic extraction of test statistics produced 161,361 z-scores. The increase is due to the publication of more articles in some of the journals.

The expected discovery rate increased from 28% in 2010 to 43% in 2021. The expected replication rate showed a smaller increased from 67% to 72%. Based on these results, the z-curve model predicts that replications of a representative sample of focal hypothesis tests from 2021 would produce (43 + 72)/2 = 58%.


The key finding is that replicability in psychological science has increased in response to evidence that the replication rate in psychology is too low. However, replicability has increased only slightly and even a replication rate of 60% suggests that many published studies are underpowered. Moreover, results in 2021 still show evidence of selection for significance; the ODR is 68%, but the EDR is only 48%. It is well known that underpowered studies that are selected for significance produce inflated effect size estimates. Thus, readers of published articles need to be careful when they interpret effect size estimates. Moreover, results of single studies are insufficient to draw strong conclusions. Replication studies are needed to provide conclusive evidence of robust effects.

In sum, these results suggest that psychological science is improving. Whether the amount of improvement over the span of a decade is sufficient is open to subjective interpretation. At least, there is some improvement. This is noteworthy because many previous meta-psychological studies found no signs of improvement in response to concerns abou tthe replicability of published results (Sedlmeier & Gigerenzer, 1989). The present results show that meta-psychology can provide empirical information about psychological science nearly in real time. However, statistical predictions need to be complemented by actual replication studies and meta-psychologists should conduct a new reproducibiltiy project with a broader range of journals and articles published in the past years. This blog post pre-registers the prediction that the success rate will be higher than the 37% rate in the original reproducibility project and a success rate between 50-60%.

Science is self-correcting: JPSP-PPID is not

With over 7,000 citations at the end of 2021, Ryff and Keyes (1995) article is one of the most highly cited articles in the Journal of Personality and Social Psychology. A trend analysis shows that citations are still increasing with over 800 citations in the past two years.

Most of these citations are reference to the use of Ryff’s measure of psychological well-being that uncritically accept Ryff’s assertion that her PWB measure is a valid measure of psychological well-being. The abstract implies that the authors provided empirical support for Ryff’s theory of psychological well-being.

Contemporary psychologists contrast Ryff’s psychological well-being (PWB) with Diener’s (1984) subjective well-being (SWB). In an article with over 1,000 citations, Ryff and Keyes (2002) tried to examine how PWB and SWB are empirically related. This attempt resulted in a two-factor model that postulates that SWB and PWB are related, but distinct forms of well-being.

The general acceptance of this model shows that most psychologists lack proper training in the interpretation of structural equation models (Borsboom, 2006), although graphic representations of these models make SEM accessible to readers who are not familiar with matrix algebra. To interpret an SEM model, it is only necessary to know that boxes represent measured variables, ovals represent unmeasured constructs, directed straight arrows represent an assumption that one construct has a causal influence on another construct, and curved bidrectional arrows imply an unmeasured common cause.

Starting from the top, we see that the model implies that an unmeasured common cause produces a strong correlation between two unmeasured variables that are labelled Psychological Well-Being and Subjective Well-Being. These labels imply that the constructs PWB and SWB are represented by unmeasured variables. The direct causal arrows from these unmeasured variables to the measured variables imply that PWB and SWB can be measured because the measured variables reflect the unmeasured variables to some extent. This is called a reflective measurement model (Borsboom et al., 2003). For example, autonomy is a measure of PWB because .38^2 = 14% of the variance in autonomy scores reflect PWB. Of course, this makes autonomy a poor indicator of PWB because the remaining 86% of the variance do not reflect the influence of PWB. This variance in autonomy is caused by other unmeasured influences and is called unique variance, residual variance, or disturbance. It is often omitted from SEM figures because it is assumed that this variance is simply irrelevant measurement error. I added it here because Ryff and users of her measure clearly do not think that 86% of the variance in the autonomy scale is just measurement error. In fact, the scale scores of autonomy are often used as if they are a 100% valid measure of autonomy. The proper interpretation of the model is therefore that autonomy is measured with high validity, but that variation in autonomy is only a poor indicator of psychological well-being.

Examination of the factor loadings (i.e., the numbers next to the arrows from PWB to the six indicators) shows that personal relationships has the highest validity as a measure of PWB, but even for personal relationships, the amount of PWB variance is only .66^2 = 44%.

In a manuscript (doc) that was desk-rejected by JPSP, we challenged this widely accepted model of PWB. We argued that the reflective model does not fit Ryff’s own theory of PWB. In a nutshell, Ryff’s theory of PWB is one of many list-theories of well-being (Sumner, 1996). The theory lists a number of attributes that are assumed to be necessary and sufficient for high well-being.

This theory of well-being implies a different measurement model in which arrows point from the measured variables to the construct of PWB. In psychometrics, these models are called formative measurement models. There is nothing unobserved about formative constructs. They are merely a combination of the measured constructs. The simplest way to integrate information about the components of PWB is to average them. If assumptions about importance are added, the construct could be a weighted average. This model is shown in Figure 2.

The key problem for this model is that it makes no predictions about the pattern of correlations among the measured variables. For example, Ryff’s theory does not postulate whether an increase in autonomy produces an increase in personal growth or a decrease in personal relations. At best, the distinction between PWB and SWB might imply that changes in PWB components are independent of changes in SWB components, but this assumption is highly questionable. For example, some studies suggest that positive relationships improve subjective well-being (Schimmack & Lucas, 2010).

To conclude, JPSP has published two highly cited articles that fitted a reflective measurement model to PWB indicators. In the desk-rejected manuscript, Jason Payne and I presented a new model that is grounded in theories of well-being and that treats PWB dimensions like autonomy and positive relations as possible components of a good life. Our model also clarified the confusion about Diener’s (1984) model of subjective well-being.

Ryff et al.’s (2002) two-factor model of well-being was influenced by Ryan and Deci’s (2001) distinction between two broad traditions in well-being research. “one dealing with happiness (hedonic well-being), and one dealing with human potential (eudaimonic well-being; Ryan &
Deci, 2001; see also Waterman, 1993)” (Ryff et al., 2002, p. 1007). We argued that this dichotomy overlooks another important distinction between well-being theories, namely the distinction between subjective and objective theories of well-being (Sumner, 1996). The key difference between objective and subjective theories of well-being is that objective theories aim to specify universal aspects of a good life that are based on philosophical analyses of the good life. In contrast, subjective theories reject the notion that universal criteria of a good life exist and leave it to individuals to create their own evaluation standards of a good life (Cantril., 1965). Unfortunately, Diener’s tripartite model of SWB is difficult to classify because it combines objective and subjective indicators. Whereas life-evaluations like life-satisfaction judgments are clearly subjective indicators, the amount of positive affect and negative affect implies a hedonistic conception of well-being. Diener never resolved this contradiction (Busseri & Sadava, 2011), but his writing made it clear that Diener stressed subjectivity as an essential component of well-being.

It is therefore incorrect to characterize Diener’s concept of SWB as a hedonic or hedonistic conception of well-being. The key contribution of Diener was to introduce psychologists to subjective conceptions of well-being and to publish the most widely used subjective measure of well-being, namely the Satisfaction with Life Scale. In my opinion, the inclusion of PA and NA in the tripartite model was a mistake because it does not allow individuals to choose what they want to do with their lives. Even Diener himself published articles that suggested positive affect and negative affect are not essential for all people (Suh, Diener, Oishi, & Triandis, 1998). At the very least, it remains an empirical question how important positive affect and negative affect are for subjective life evaluations and whether other aspects of a good life are even more important. At least, this question can be empirically tested by examining how much eudaimonic and hedonic measures of well-being contribute to variation in subjective measures of well-being. This question leads to a model in which life-satisfaction judgments are a criterion variable and the other variables are predictor variables.

The most surprising finding was that environmental mastery was a strong unique predictor and a much stronger predictor than positive affect or negative affect (direct effect, b = .66).

In our model, we also allowed for the possibility that PWB attributes influence subjective well-being by increasing positive affect or decreasing negative affect. The total effect is a very strong relationship, b = .78, with more than 50% of the variance in life-satisfaction being explained by a single PWB dimension, namely environmental mastery.

Other noteworthy findings were that none of the other PWB attribute made a positive (direct or indirect) contribution to life-satisfaction judgments. Autonomy even was a negative predictor. The effects of positive affect and negative affect were statistically significant, but small. This suggests that PA and NA are meaningful indicators of subjective well-being because the reflect a good life, but provide no evidence for hedonic theories of well-being that suggest positive affect increases well-being no matter how it is elicited.

These results are dramatically different from the published model in JPSP. In that model an unmeasured construct, SWB, causes variation in Environmental Mastery. In our model, environmental mastery is a strong cause of the only subjective indicator of well-being, namely life-satisfaction judgments. Whereas the published model implies that feeling good makes people have environmental mastery, our model suggests that having control over one’s life increases well-being. Call us crazy, but we think the latter model makes more sense.

So, why was our ms. desk rejected without peer-review from experts in well-being research? I post the full decision letter below, but I want to highlight the only comment about our actual work.

A related concern has to do with a noticeable gap between your research question, theoretical framework, and research design. The introduction paints your question in broad strokes only, but my understanding is that you attempt to refine our understanding of the structure of well-being, which could be an important contribution to the literature. However, the introduction does not provide a clear rationale for the alternative model presented. Perhaps even more important, the cross-sectional correlational study of one U.S. sample is not suited to provide strong conclusions about the structure of well-being. At the very least, I would have expected to see model comparison tests to compare the fit of the presented model with those of alternative models. In addition, I would have liked to see a replication in an independent sample as well as more critical tests of the discriminant validity and links between these factors, perhaps in longitudinal data, through the prediction of critical outcomes, or by using behavioral genetic data to establish the genetic and environmental architecture of these factors. Put another way, independent of the validity of the Ryff / Keyes model, the presented theory and data did not convince me that your model is a better presentation of the structure of well-being.

Bleidorn’s comments show that even prominent personality researchers lack basic understanding of psychometrics and construct validation. For example, it is not clear how longitudinal data can provide answers to questions about construct validity. Examining change is of course useful, but without a valid measure of a construct it is not clear what change in scale scores means. Construct validation precedes studies of stability and change. Similarly, it is only relevant to examine nature and nurture questions with a clear phenotype. Bleidorn completely ignores our distinction between hedonic and subjective well-being and the fact that we are the first to examine the relationship between PWB attributes and life-satisfaction.

As psychometricians have pointed out, personality psychologists often ignore measurement questions and are often content with averaged self-report ratings as operationalized constructs that do not require further validation. We think that this blind empiricism is preventing personality psychology from making real progress. It is depressing to see that even the new generation of personality psychologists shows no interest in improving construct validity of foundational constructs. Fortunately, JPSP-PPID publishes only about 50 articles a year and there are other outlets to publish our work. Unfortunately, JPSP has a reputation to publish only the best work, but this is prestige is not warranted by the actual quality of published articles. For example, the obsession with longitudinal data is not warranted given evidence that about 80% of the variance in personality measures is stable trait variance that does not change. Repeatedly measuring this trait variance does not add to our understanding of stable traits.


To conclude, JPSP has published two cross-sectional articles of the structure of well-being that continue to be highly cited. We find major problems with the models in these articles, but JPSP is not interested in publishing a criticism of these articles. To reiterate, the main problem is that Diener’s SWB model is treated as if it is an objective hedonic theory of well-being, when the core aspect of the model is that well-being is subjective and not objective. We thought at least the main editor Rich Lucas, a former Diener student, would understand this point, but expectations are the mother of disappointment. Of course, we could be wrong about some minor or major issues, but the lack of interest in these foundational questions shows just how far psychology is from being a real science. A real science develops valid measures before it examines real questions. Psychologists invent measures and study their measures without evidence that their measures reflect important constructs like well-being. Not surprisingly, psychology has produced no consensual theory of well-being that could help people live better lives. This does not stop psychologists from making proclamations about ways to lead a happy or good life. The problem is that these recommendations are all contingent on researchers’ preferred definition of well-being and the measures associated with that tradition/camp/belief system. In this way, psychology is more like (other) religions and less like a science.

Decision Letter

I am writing about your manuscript “Two Concepts of Wellbeing: The Relation Between Psychological and Subjective Wellbeing”, submitted for publication in the Journal of Personality and Social Psychology (JPSP). I have read the manuscript carefully myself, as has the lead Editor at JPSP, Rich Lucas. We read the manuscript independently and then consulted with each other about whether the manuscript meets the threshold for full review. Based on our joint consultation, I have made the decision to reject your paper without sending it for external review. The Editor and I shared a number of concerns about the manuscript that make it unlikely to be accepted for publication and that reduce its potential contribution to the literature. I will elaborate on these concerns below. Due to the high volume of submissions and limited pages available to JPSP, we must limit our acceptances to manuscripts for which there is a general consensus that the contribution is of an important and highly significant level. 

  1. Most importantly, papers that rely solely on cross-sectional designs and self-report questionnaire techniques are less and less likely to be accepted here as the number of submissions increases. In fact, such papers are almost always rejected without review at this journal. Although such studies provide an important first step in the understanding of a construct or phenomenon, they have some important limitations. Therefore, we have somewhat higher expectations regarding the size and the novelty of the contribution that such studies can make. To pass threshold at JPSP, I think you would need to expand this work in some way, either by using longitudinal data or or by going further in your investigation of the processes underlying these associations. I want to be clear; I agree that studies like this have value (and I also conduct studies using these methods myself), it is just that many submissions now go beyond these approaches in some way, and because competition for space here is so high, those submissions are prioritized.
  2. A related concern has to do with a noticeable gap between your research question, theoretical framework, and research design. The introduction paints your question in broad strokes only, but my understanding is that you attempt to refine our understanding of the structure of well-being, which could be an important contribution to the literature. However, the introduction does not provide a clear rationale for the alternative model presented. Perhaps even more important, the cross-sectional correlational study of one U.S. sample is not suited to provide strong conclusions about the structure of well-being. At the very least, I would have expected to see model comparison tests to compare the fit of the presented model with those of alternative models. In addition, I would have liked to see a replication in an independent sample as well as more critical tests of the discriminant validity and links between these factors, perhaps in longitudinal data, through the prediction of critical outcomes, or by using behavioral genetic data to establish the genetic and environmental architecture of these factors. Put another way, independent of the validity of the Ryff / Keyes model, the presented theory and data did not convince me that your model is a better presentation of the structure of well-being.
  3. The use of a selected set of items rather than the full questionnaires raises concerns about over-fitting and complicate comparisons with other studies in this area. I recommend using complete questionnaires and – should you decide to collect more data – additional measures of well-being to capture the universe of well-being content as best as you can. 
  4. I noticed that you tend to use causal language in the description of correlations, e.g. between personality traits and well-being measures. As you certainly know, the data presented here do not permit conclusions about the temporal or causal influence of e.g., neuroticism on negative affect or vice versa and I recommend changing this language to better reflect the correlational nature of your data.     

In closing, I am sorry that I cannot be more positive about the current submission. I hope my comments prove helpful to you in your future research efforts. I wish you the very best of luck in your continuing scholarly endeavors and hope that you will continue to consider JPSP as an outlet for your work.

Wiebke Bleidorn, PhD
Associate Editor
Journal of Personality and Social Psychology: Personality Processes and Individual Differences

Estimating the False Positive Risk in Psychological Science

Abstract: At most one-quarter of published significant results in psychology journals are false positive results. This is surprising news after a decade of false positive paranoia. However, the low positive rate is not a cause for celebration. It mainly reflects the low priori probability that the nil-hypothesis is true (Cohen, 1994). To produce meaningful results, psychologists need to maintain low false positive risks when they test stronger hypotheses that specify a minimum effect size.


Like many other sciences, psychological science relies on null-hypothesis significance testing as the main statistical approach to draw inferences from data. This approach can be dated back to Fisher’s first manual for empirical researchers how to conduct statistical analyses. If the observed test-statistic produces a p-value below .05, the null-hypothesis can be rejected in favor of the alternative hypothesis that the population effect size is not zero. Many criticism of this statistical approach have failed to change research practices.

Cohen (1994) wrote a sarcastic article about NHST with the title “The Earth is round, p < .05.” In this article, Cohen made the bold claim “my work on power analysis has led me to realize that the nil-hypothesis is always false.” In other words, population effect sizes are unlikely to be exactly zero. Thus, rejecting the nil-hypothesis with a p-value below .05 only tells us something we already know. Moreover, when sample sizes are small, we often end up with p-values greater than .05 that do not allow us to reject a false null-hypothesis. I cite this article only to point out that in the 1990s, meta-psychologists were concerned with low statistical power because it produces many false negative results. In contrast, significant results were considered to be true positive findings. Although often meaningless (e.g., the amount of explained variance is greater than zero), they were not wrong.

Since then, psychology has encountered a racial shift in concerns about false positive results (i.e., significant p-values when the nil-hypothesis is true). I conducted an informal survey on social media. Only 23.7% of twitter respondents echoed Cohen’s view that false positive results are rare (less than 25%). The majority (52.6%) of respondents assumed that more than half of all published significant results are false positives.

The results were a bit different for the poll in the Psychological Methods Discussion Group on Facebook. Here the majority opted for 25 to 50 percent false positive results.

The shift from the 1990s to the 2020s can be explained by the replication crisis in social psychology that has attracted a lot of attention and has been generalized to all areas of psychology (Open Science Collaboration, 2015). Arguably, the most influential article that contributed to concerns about false positive results in psychology is Simmons, Nelsons, and Simonsohn’s (2011) article titled “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant” that has been cited 3,203. The key contribution of this article was to show that the use of questionable research practices that psychologists use to obtain p-values below .05 (e.g., using multiple dependent variables) can increase the risk of a false positive result from 5% to over 60%. Moreover, anonymous surveys suggested that researchers often engage in these practices (John et al., 2012). However, even massive use of QRPs will not produce a massive amount of false positive results, if most null-hypotheses are true. In this case, QRPs will inflate the effect size estimates (that nobody pays attention to, anyways), but the rate of false positive results will remain low if most tested hypotheses are true.

Some scientists have argued that scientists are much more likely to make false assumptions (e.g., the Earth is flat) than Cohen envisioned. Ioannidis (2005) famously declared that Most published results are false. He based this claim on hypothetical scenarios that produce more than 50% false positive results when 90% of studies test a true null-hypothesis. This assumption is a near complete reversal of Cohen’s assumption that we can nearly always assume that the effect size is not zero. The problem is that the actual rate of true and false hypotheses is unknown. Thus, estimates of false positive rates are essentially projective tests of gullibility and cynicism.

To provide psychologists with scientific information about the false positive risk in their science, we need a scientific method that can estimate the false discovery risk based on actual data rather than hypothetical scenarios. There have been several attempts to do so. So far, the most prominent study was Leek and Jager’s (2014) estimate of the false discovery rate in medicine. They obtained an estimate of 14%. Simulation studies showed some problems with their estimation model, but the superior z-curve method replicated the original result with a false discovery risk of 13%. This result is much more in line with Cohen’s view that most null-hypotheses are false (typically effect sizes are not zero) than with Ioannidis’s claim that the null-hypothesis is true in 90% of all significance tests.

In psychology, the focus has been on replication rates. The shocking finding was that only 25% of significant results in social psychology could be replicated in an honest and unbiased attempt to reproduce the original study (Open Science Collaboration, 2015). This low replication rate leaves ample room for false positive results, but it is unclear how many of the non-significant results were caused by a true null-hypothesis and how many were caused by low statistical power to detect an effect size greater than zero. Thus, this project provides no information about the false positive risk in psychological science.

Another noteworthy project used a representative sample of test results in social psychology journals (Motyl et al., 2017). This project produced over 1,000 p-values that were examined using a number of statistical tools available at that time. The key result was that there was clear evidence of publication bias. That is, focal hypothesis tests nearly always rejected the null-hypothesis, a finding that has been observed since the beginning of social psychology (Sterling, 1959). However, the actual power of studies to do so was much lower; a finding that is consistent with Cohen’s (1961) seminal analysis of power. However, the results provided no information about the false positive risk. Yet, this valuable dataset could be analyzed with statistical tools that estimate the false discovery risk (Schimmack, 2021). However, the number of significant p-values was too small to produce an informative estimate of the false discovery risk (k = 678; 95CI = .09 to .82).


A decade after the “False Positive Psychology” article rocked psychological science, it remains unclear how much false positive results contribute to replication failures in psychology. To answer this question, we report the results of a z-curve analysis of 1,857 significant p-values that were obtained from hand-coding a representative sample of studies that were published between 2009 and 2014. The years 2013 and 2014 were included to incorporate Motyl et al.’s data. All other coding efforts focussed on the years 2009 and 2010, before concerns about replication failures could have changed research practices. In marked contrast to previous initiatives, the aim was to cover all areas of psychology. To obtain a broad range of disciplines in psychology, a list of 120 journals was compiled (Schimmack, 2021). These journals are the top journals of their disciplines with high impact factors. Students had some freedom in picking journals of their choice. For each journal, articles were selected based on a fixed sampling scheme to code articles 1, 3, 6, and 10 for every set of 10 articles (1,3,6,10,11,13…). The project is ongoing and the results reported below should be considered preliminary. Yet, they do present the first estimate of the false discovery risk in psychological science.

The results replicate many other findings that focal statistical tests are selected because they reject the null-hypothesis. Eighty-one percent of all tests had a p-value below .05. When marginally significant results are included as well, the observed discovery rate increases to 90%. However, the statistical power of studies does not warrant such high success rates. The z-curve estimate of mean power before selection for significance is only 31%; 95%CI = 19% to 37%. This statistic is called the expected discovery rate (EDR) because mean power is equivalent to the long-run percentage of significant results. Based on an insight by Soric (1989), we can use the EDR to quantify the maximum percentage of results that can be false positives, using the formula: FDR = (1/EDR – 1)*(alpha/(1-alpha)). The point estimate of the EDR of 31% corresponds to a point estimate of the False Discovery Risk of 12%. The 95%CI ranges from 8% to 28%. It is important to distinguish between the risk and rate of false positives. Soric’s method assumes that true hypotheses are tested with 100% power. This is an unrealistic assumption. When power is lower the false positive rate will be lower than the false positive risk. Thus, we can conclude from these results that it is unlikely that more than 25% of published significant results in psychology journals are false positive results.

One concerns about these results is that the number of test statistics differed across journals and that Motyl et al.’s large set of results from social psychology could have biased the results. We therefore also analyzed the data by journal and then computed the mean FDR and its 95%CI. This approach produced an even lower FDR estimate of 11%, 95%CI = 9% to 15%.

While a FDR of less than 25% may seem good news in a field that is suffering from false positive paranoia, it is still unacceptably high to ensure that published results can be trusted. Fortunately, there is a simple solution to this problem because Soric’s formula shows that the false discovery risk depends on alpha. Lowering alpha to .01 is sufficient to produce a false discovery risk below 5%. Although this seems like a small adjustment, it results in the loss of 37% significant results with p-values between .05 and .01. This recommendation is consistent with two papers that have argued against the blind use of Fisher’s alpha level of .05 (Benjamin et al., 2017; Lakens et al., 2018). The cost of lowering alpha to .005 would be to loss another 10% of significant findings (ODR = 47%).

Limitations and Future Directions

No study is perfect. As many women know, the first time is rarely the best time (Higgins et al., 2010). Similarly, this study has some limitations that need to be addressed in future studies.

The main limitation of this study is that the coded statistical tests may not be representative of psychological science. However, the random sampling from journals and the selection of a broad range of journals suggests that sampling bias has a relatively small effect on the results. A more serious problem is that there is likely to be heterogeneity across disciplines or even journals within disciplines. Larger samples are needed to test those moderator effects.

Another problem is that z-curve estimates of the EDR and FDR make assumptions about the selection process that may differ from the actual selection process. The best way to address this problem is to promote open science practices that reduce the selective publishing of statistically significant results.

Eventually, it will be necessary to conduct empirical tests with a representative sample of results published in psychology akin to the reproducibility project (Open Science Collaboration, 2015). At a first step, studies can be replicated with the original sample sizes. Results that are successfully replicated do not require further investigation. Replication failures need to be followed up with studies that can provide evidence for the null-hypothesis using equivalence testing with a minimum effect size that would be relevant (Lakens, Scheel, and Isager, 2018). This is the only way to estimate the false positive risk by means of replication studies.

Implications: What Would Cohen Say

The finding that most published results are not false may sound like good news for psychology. However, Cohen would merely point out that that a low rate of false positive results merely reflect the fact that the nil-hypothesis is rarely true. If some hypotheses were true and others were false, NHST (without QRPs) could be used to distinguish between them. However, if most effect sizes are greater than zero, not much is learned from statistical significance. The problem is not p-values or dichotomous think. The problem is that nobody is testing risky hypothesis that an effect size is of a minimum size, and decides in favor of the null-hypothesis when the data show the population effect size is not exactly zero, but practically meaningless (e.g., experimental ego-depletion effects are less than 1/10th of a standard deviation). Even specifying H0 as r < .05 or d < .01 would lower the discovery rates and increase the false discovery risk, while increasing the value of a statistically significance.

Cohen’s clear distinction between the null-hypothesis and the nil-hypothesis made it clear that nil-hypothesis testing is a ritual with little scientific value, while null-hypothesis testing is needed to advance psychological science. The past decade has been a distraction by suggesting that nil-hypothesis testing is meaningful, but only if open science practices are used to prevent false positive results. However, open science practices do not change the fundamental problem of nil-hypothesis testing that Cohen and others identified more than two decades ago. It is often said that science is self-correcting, but psychologists have not corrected the way they formulate their hypotheses. If psychology wants to be a science, they need to specify hypotheses that are worthy of empirical falsification. I am getting to old and cynical (much like my hero Cohen in the 1990s) to believe in change in my life-time, but I can write this message in a bottle and hope one day a new generation may find it and do something with it.

Open Science: Inside Peer-Review at PSPB

We submitted a ms. that showed problems with the validity of the race IAT as a measure of African Americans’ unconscious attitudes to PSPB (Schimmack & Howard, 2020). After waiting patiently for three months, we received the following decision letter from the acting editor Dr. Corinne Moss-Racusin at Personality and Social Psychology Bulletin. She assures us that she independently read our manuscript carefully – twice; once before and once after reading the reviews. This is admirable. Yet it is surprising that her independent reading of our manuscript places her in strong agreement with the reviewers. Somebody with less research experience might feel devastated by the independent evaluation by three experts that our work is “of low quality.” Fortunately, it is possible to evaluate the contribution of our manuscript from another independent perspective, namely the strength of the science.

The key claim of our ms. is simple. John Jost, Brian Nosek, and Mahzarin Banaji wrote a highly cited article that contained the claim that a large percentage of members of disadvantaged groups have an implicit preference for the out-group. As recently as 2019, Jost repeated this claim and used the term self-hatred to refer to implicit preferences for the in-group (Jost, 2019).

We expressed our doubt about this claim when the disadvantaged group are African Americans. Our main concern was that any claims about African Americans’ implicit preferences require a valid measure of African Americans’ preferences. The claim that a large number of African Americans have an implicit preference for the White outgroup rests entirely on results obtained with the Implicit Association Test (Jost, Nosek, & Banaji, 2004). However, since the 2004 publication, the validity of the race IAT as a measure of implicit preferences has been questioned in numerous publications, including my recent demonstration that implicit and explicit measures of prejudice lack discriminant validity (Schimmack, 2021). Even the author of the IAT is no longer supporting the claim that the race IAT is a measure of some implicit, hidden attitudes (Greenwald & Banaji, 2017). Aside from revisiting Jost et al.’s (2004) findings in light of doubts about the race IAT, we also conducted the first attempt at validating the race IAT for Black participants. Apparently, reading the article twice did not help the action editor of PSPB to notice this new empirical contribution, even though it is highlighted in Figure 2. The key finding here is that we were able to identify an in-group preference factor because several explicit and implicit measures showed convergent validity (factor ig). For example, the evaluative priming task showed some validity with a factor loading of .42 in the Black sample. However, the race IAT failed to show any relationship with the in-group factor (p > .05). It was also unrelated to the out-group factor. Thus, the race IAT lacks convergent validity as a measure of in-group and out-group preferences among African Americans in this sample. Neither the two reviewers, nor the acting editor challenge this finding. They do not even comment on it. Instead, they proclaim that this research is of low quality. I beg to differ. Based on any sensible understanding of the scientific method, it is unscientific to make claims about African Americans’ preferences based on a measure that has not been validated. It is even more unscientific to double down on a false claim when evidence is presented that the measure lacks validity.

Of course, one can question whether PSPB should publish this finding. After all, PSPB prides itself on being the flagship journal of the Society for Personality and Social Psychology (Robinson et al., 2021). Maybe valid measurement of African Americans’ attitudes is not relevant enough to meet the high standards of a 20% acceptance rate. However, Robinson et al. (2021) launched a diversity initiative in response to awareness that psychology has a diversity problem.

Maybe it will take some time before PSPB can find some associate editors to handle manuscripts that address diversity issues and are concerned with the well-being of African Americans. Meanwhile, we are going to find another outlet to publish our critique of Jost and colleagues unscientific claim that many African Americans hold negative views of their in-group that they are not aware of and can only be revealed by their scores on the race IAT.

Editorial Decision Letter from PSPB

Re: “The race Implicit Association Test is Biased: Most African Americans Have Positive Attitudes towards their In-Group” (MS # PSPB-21-365)

Dear Dr. Schimmack:

Thank you for submitting your manuscript for consideration to Personality and Social Psychology Bulletin. I would like to apologize for the slight delay in getting this decision out to you. Both of my very young children have been home with me for the past month, due to Covid exposures at their schools. As their primary caregiver, this has created considerable difficulties. I appreciate your understanding as we all work to navigate these difficult and unprecedented times.

I have now obtained evaluations of the paper from two experts who are well-qualified to review work in this area.  Furthermore, I read your paper carefully and independently, both before and after looking at the reviews.

I found the topic of your work to be important and timely—indeed, I read the current paper with great interest. Disentangling in-group and out-group racial biases, among both White and Black participants (within the broader context of exploring System Justification Theory) is a compelling goal. Further, I strongly agree with you that exploring whether Black participants’ in-group attitudes have been systematically misrepresented by the (majority White) scientific community is of critical importance.

Unfortunately, as you will see, both reviewers have significant, well-articulated concerns that prevent them from supporting publication of the manuscript. For example, reviewer 1 stated that “Overall, I found this article to be of low quality. It argues against an argument that researchers haven’t made and landed on conclusions that their data doesn’t support.” Further, reviewer 2 (whose review is appropriately signed) wrote clearly that, “The purpose of this submission, it seems to me, is not to illuminate anything, really, and indeed very little, if anything, is illuminated. The purpose of the paper, it seems, is to create the appearance of something scandalous and awful and perhaps even racist in the research literature when, in fact, the substantive results obtained here are very similar to what has been found before. And if the authors really want to declare that the race-based IAT is a completely useless measure, they have a lot more work to do than re-analyzing previously published data from one relatively small study.”

See Reviewer 2’s comments and my response here

My own reading of your paper places me in strong agreement with the reviewer’s evaluations. I am sorry to report that I will not be able to accept your paper for publication in PSPB.

The reviewers’ comments are, in my several years of experience as an editor, unusually thorough and detailed. Thus, I will not reiterate them here.  Nevertheless, issues of primary concern involved both conceptual and empirical aspects of the manuscript. Although some of these issues might be addressed, to some degree, with some considerable re-thinking and re-writing, many cannot be addressed without more data and theoretical overhaul.

I was struck by the degree to which claims appear to stray quite far from both the published literature and the data at hand. As just one example, the section on “African American’s Resilience in a Culture of Oppression” (pp. 5-6) cites no published work whatsoever. Rather, you note that your skepticism regarding key components of SJT is based on “the lived experience of the second author,” which you then summarize. While individual case studies such as this can certainly be compelling, there are clear questions pertaining to generalizability and scientific merit, and the inability to independently validate or confirm this anecdotal evidence. While you do briefly acknowledge this, you proceed to make broad claims—such as “No one in her family or among her Black friends showed signs that they preferred to be White or like White people more than Black people. In small towns, the lives of Black and White people are more similar than in big cities. Therefore, the White out-group was not all that different from the Black in-group,” again without citing any evidence. I found it problematic to ground these bold claims and critiques largely in anecdote. Further, this raises serious concerns—as reviewer 2 articulates in some detail—that the current work may distort the current state of the science by exaggerating or mischaracterizing the nature of existing claims.

Let me say this clearly: I am strongly in favor of work that attempts to refine existing theoretical perspectives, and/or critique established methods, measures, and paradigms. I am not an IAT “purist” by any stretch, nor has my own recent work consistently included implicit measures. Indeed, as noted above, I read the current work with great interest and openness. Unfortunately, like both reviewers, I cannot support its publication in the current form.

I would sincerely encourage you to consider whether the future of this line of work could involve 1. Additional experiments, 2. Larger and more diverse samples, 3. True and transparent collaboration (whether “adversarial” or not) with colleagues from different ideological/empirical perspectives, and 4. Ensuring that claims align much more closely to what is narrowly warranted by the data at hand. Unfortunately, as it stands, the potential contributions of this work appear to be far overshadowed by its problematic elements.

I understand that you will likely be disappointed by my decision, but I urge you to pay careful attention to the reviewers’ constructive comments, as they may help you revise this manuscript or design further research.  Please understand that my decision was rendered with the recognition that the page limitations of the journal dictate that only a small percentage of submitted manuscripts can be accepted.  PSPB receives more than 700 submissions per year, but only publishes approximately 125 papers each year.  Papers without major flaws are often not accepted by PSPB because the magnitude of the contribution is not sufficient to warrant publication.  With careful revision, I think this paper might be appropriate for a more specialized journal, and I wish you success in finding an appropriate outlet for your work.

I am sorry that I cannot provide a more favorable response to your submission.  However, I do hope that you will again consider PSPB as your research progresses.

P.S. I asked the acting editor to clarify her comments and her views about the validity of the race IAT as a measure of African Americans’ unconscious preferences. They declined to comment further.

Inside Anonymous Peer Review

After a desk-rejection for JPSP, my co-author and I submitted our ms. to PSPB (see blog After several months, we received the expected rejection. But it was not all in vane. We received a detailed review that shows how little social psychologists really care about African Americans even when they claim to study racism and discrimination.

As peer-reviews are considered copyrighted material belonging to the reviewer, I cannot share the review in full. Rather I will highlight important sections that show how little authors with the authority of an expert reviewer pay attention to inconvenient scientific criticism of their work.

Here is the key issue. Our paper provides new evidence that the race IAT is an invalid measure of African Americans’ attitudes towards their own group and the White out-group. This new evidence is based on a reanalysis of the data that were used by Bar-Anan and Nosek (2014) to claim that the race IAT is the most valid measure to study African Americans’ implicit attitudes. Here is wat the reviewer had to say about this.

(6) It has been a while since I read the Bar-Anan and Nosek (2014) article, but my memory for it is incompatible with the claim that those authors were foolish enough to simply assume that the most valid implicit measures was the one that produced the biggest difference between Whites and Blacks in terms of in-group bias, as the present authors claim (pp. 7-8).

Would you kill Dumbledore if he asked you to?

So, the reviewer relies on his foggy memory to question our claim instead of retrieving a pdf file and checking for himself. New York University should be proud of this display of scholarship. I hope Jost made sure to get his Publons credit. Here is the relevant section from Bar-Anan and Nosek (2014 p. 675;

A lazy recollection is used to dismiss the results of a new statistical analysis. This is how closed, confidential, back-room, peer-review works, which means it does not work. It does not serve the purpose to present all scientific arguments in the open and let data decide between opposing views. Pre-publication peer-review is not a reliable and credible mechanism to advance science. For this reason, I will publish as much as possible in open-peer review journals (e..g, Meta-Psychology). Open science without open exchange of ideas and conflicts is not open, trustworthy, or credible.

Psychology Intelligence Agency

I always wanted to be James Bond, but being 55 now it is clear that I will never get a license to kill or work for a government intelligence agency. However, the world has changed and there are other ways to spy on dirty secrets of evil villains.

I have started to focus on the world of psychological science, which I know fairly well because I was a psychological scientist for many years. During my time as a psychologist, I learned about many of the dirty tricks that psychologists use to publish articles to further their careers without advancing understanding of human behavior, thoughts, and feelings.

However, so far the general public, government agencies, or government funding agencies that hand out taxpayers’ money to psychological scientists have not bothered to monitor the practices of psychological scientists. They still believe that psychological scientists can control themselves (e.g., peer review). As a result, bad practices persist because the incentives favor behaviors that lead to publication of many articles even if these articles make no real contribution to science. I therefore decided to create my own Psychological Intelligence Agency (PIA). Of course, I cannot give myself a license to kill, and I have no legal authority to enforce laws that do not exist. However, I can gather intelligence (information) and share this information with the general public. This is less James Bond and more CIA that also shares some of its intelligence with the public (CIA factbook), or the website Retraction Watch that keeps track of article retractions.

Some of the projects that I have started are:

Replicability Rankings of Psychology Journals
Keeping track of the power (expected discovery rate, expected replication rate) and the false discovery risk of test results published in over 100 psychology journals from 2010 to 2020.

Personalized Criteria of Statistical Significance
It is problematic to use the standard criterion of significance (alpha = .05) when this criterion leads to few discoveries because researchers test many false hypotheses or test true hypotheses with low power. When discovery rates are low, alpha should be set to a lower value (e.g., .01, .005, .001). Here I used estimates of authors’ discovery rate to recommend an appropriate alpha level to interpret their results.

Quantitative Book Reviews
Popular psychology books written by psychological scientists (e.g., Nobel Laureate Daniel Kahneman) reach a wide audience and are assumed to be based on solid scientific evidence. Using statistical examinations of the sources cited in these books, I provide information about the robustness of the scientific evidence to the general public. (see also “Before you know it“)

Citation Watch
Science is supposed to be self-correcting. However, psychological scientists often cite outdated references that fit their theory without citing newer evidence that their claims may be false (a practice known as cherry picking citations). Citation watch reveals these bad practice, by linking articles with misleading citations to articles that question the claims supported by cherry picked citations.

Whether all of this intelligence gathering will have a positive effect depends on how many people actually care about the scientific integrity of psychological science and the credibility of empirical claims. Fortunately, some psychologists are willing to learn from past mistakes and are improving their research practices (Bill von Hippel).

You Can Lead a Horse To Water, But... - Meaning, Origin

What would Cohen say to 184 Significance Tests in 1 Article

I was fortunate enough to read Jacob Cohen’s articles early on in my career to avoid many of the issues that plague psychological science. One of his important lessons was that it is better to test a few (or better one) hypothesis in one large sample (Cohen, 1990) than to conduct many tests in small samples.

The reason is simple. Even if a theory makes a correct prediction, sampling error may produce a non-significant result, especially in small samples where sampling error is large. This type of error is known as type-II error, beta, or a false negative. The probability of obtaining the desired and correct outcome of a significant result, when a hypothesis is true is called power. The problem of testing multiple hypotheses is that the cumulative or total power of finding evidence for all correct hypotheses decreases with the number of tests. Even if a single test has 80% power (i.e., the probability of a significant result for a correct hypothesis is 80 percent), the probability of providing evidence for 10 correct hypotheses is only .8^10 = .11%. The expected value is that 2 of the 10 tests produce a type-II error (Schimmack, 2012).

Cohen (1961) also noted that the average power of statistical tests is well below 80%. For a medium/average effect size, power was around 50%. Now imagine that a researcher tests 10 true hypotheses with 50% power. The expected value is that 5 tests produce a significant result (p < .05) and 5 studies produce a type-II error (p > .05). The interpretation of the article will focus on the significant results, but they were selected basically by a coin flip. The next study will produce a different set of 5 significant studies.

To avoid type-II errors researchers could conduct a priori power analysis to ensure that they have enough power. However, this is rarely done with the explanation that a priori power analysis requires knowledge about the population effect size, which is unknown. However, it is possible to estimate the typical power of studies by keeping track of the percentage of significant results. Because power determines the rate of significant results, the rate of significant results is an estimate of average power. The main problem with this simple method of estimating power is that researchers often do not report all of their results. Especially before the replication crisis became apparent, psychologists tended to publish only significant results. As a result, it is largely unknown how much power actual studies in psychology have and whether power increased since Cohen (1961) estimated power to be around 50%.

Here I illustrate a simple way to estimate actual power of studies with a recent multi-study article that reported a total of 184 significance tests (more were reported in a supplement, but were not coded)! Evidently, Cohen’s important insights remain neglected, especially in journals that pride themselves on rigorous examination of hypotheses (Kardas, Kumar, & Epley, 2021).

Figure 2 shows the first rows of the coding spreadsheet (Spreadsheet).

Each row shows one specific statistical test. The column “HO rejected” reflects how authors interpreted a result. Broadly this decision is based on the p < .05 rule, but sometimes authors are willing to treat values just above .05 as sufficient evidence which is often called marginal significance. The column p < .05 strictly follows the p < .05 rule. The averages in the top row show that there are 77% significant results using authors’ rules and 71% using the p < .05 rule. This shows that 6% of the p-values were interpreted as marginally significant.

All test-values or point estimates with confidence intervals are converted into exact two-sided p-values. The two-sided p-values are then converted into z-scores using the inverse normal formula; z = -qnorm(2). Observed power is then estimated for the standard criterion of significance; alpha = .05, which corresponds to a z-score of 1.96. The formula for observed power is pnorm(z, 1.96). The top row shows that mean observed power is 69%. This is close to the 71% percentage with the strict p < .05 rule, but a bit lower than the 77% when marginally significant results are included. This simple comparison shows that marginally significant results inflate the percentage of significant results.

The inflation column keeps track of the consistency between the outcome of a significance test and the power estimate. When power is practically 1, a significant result is expected and inflation is zero. However, when power is only 60%, there is a 40% chance of a type-II error and authors were lucky if they got a significant result. This can happen in a single test, but not in the long run. Average inflation is a measure of how lucky authors were if they got more significant results than the power of their studies allows. Using the authors 77% success rate and estimated power of 69%, we have an inflation of 8%. This is a small bias, and we already saw that interpretation of marginal results accounts for most of it.

The last column is called the Replication Index (R-Index). It simply subtracts the inflation from the observed power estimate. The reason is that observed power is an inflated estimate of power when there are too many significant results. The R-Index is called an index because the formula is just an approximate correction for selection for significance. Later I show the results with a better method. However, the Index can clearly distinguish between junk science (R-Index below 50) and credible evidence. Based on the present results, the R-Index of 62 shows that the article reported some credible findings. Moreover, the R-Index now underestimates power because the rate of p-values below .05 is consistent with observed power. The inflation is just due to the interpretation of marginal results as significant. In short, the main conclusion from this simple analysis of test statistics in a single article is that the authors conducted studies with an average power of about 70%. This is expected to produce type-II errors, sometimes with p-values close to .05 and sometimes with p-values well above .1. This could mean that nearly a quarter of the published results are type-II errors.

but what about type-I errors?

Cohen was concerned about the problem that many underpowered studies fail to reject true hypotheses. However, the replication crisis shifted the focus from false negative results to false-positive results. An influential article by Simmons et al. (2011) suggested that many if not most published results might be false positive results. The authors also developed a statistical tool that examines whether a set of significant results is entirely based on false positive results called p-curve. The next figure shows the output of the p-curve app for the 130 significant results (only significant results are considered because p-values greater than .05 cannot be false positives).

The graph shows that there a lot more p-values below .01 (78%) than p-values between .04 and .05 (2%). This distribution of p-values is inconsistent with the hypothesis that all significant results are false positives. In addition, the program estimates that the average power of the 130 studies with significant results is 99%! As a result, there can be no false positives that would produce an estimate of 5% power. It is noteworthy that the p-curve analysis did not spot the inflation of significant results by interpreting marginally significant results because these results are omitted from the p-curve analysis. It is rather unlikely that the average power of studies is 99%. In fact, simulation studies have shown that the power estimates of p-curve are often inflated when studies are heterogeneous (Brunner, 2018; Brunner & Schimmack, 2020). The p-curve authors are aware of this bug, but have done nothing to fix it (Datacolada, 2018).

A better statistical method to analyze p-values is z-curve, which relies on the z-scores that were obtained from the p-values in the spreadsheet. However, the z-curve package for R can also read p-values. The next Figure shows a histogram of all 184 (significant and non-significant) values up to a value of 6. Values over 6 are not shown and are all treated as studies with perfect power.

The expected discovery rate corresponds to the power estimate in p-curve. It is notably lower than 99% and the 95%CI excludes a value of 99%. This finding simply shows once again that p-curve estimates are inflated.

The observed discovery rate is simply the same percentage that was computed on the spreadsheet using a strict p < .05 rule. The expected discovery rate is an estimate of the average power for all studies, including non-significant results that is corrected for any potential inflation. It is 62%, which matches the R-Index in the spreadsheet.

The comparison of the observed discovery rate of 71% and the expected discovery rate of 62% suggests that there is some overreporting of significant results. However, the 95%CI around the EDR estimate ranges from 27% to 88%. Thus, sampling error alone may explain this discrepancy.

An EDR of 62% implies that only a small number of significant results can be false positives. The point estimate is just 2%, but the 95%CI allows for up to 14% false positives. Thus, the reported results are unlikely to be false positives, but effect sizes could be inflated because selection for significance with modest power inflates effect size estimates.

There is also notable evidence of heterogeneity. The distribution of z-scores is much flatter than a standard normal distribution that is expected if all studies had the same power. This means that some results might be more credible than others. Therefore I conducted some moderator analyses.

One key hypothesis in the article was that shallow and deep conversations differ in important ways. Several studies tested this by comparing shallow and deep conversations. Fifty-four analyses included a contrast between shallow and deep conversations as a main effect or in an interaction. The expected replication rate is unchanged. The expected discovery rate is a bit higher, but surprisingly, the observed discovery rate is lower. Visual inspection of the z-curve plot shows an unusually high number of marginally significant results. This is further evidence to distrust marginally significant results. However, overall these results suggest that shallow and deep conversations differ.

Several analyses tested mediation, which can require large samples to have adequate power. Not surprisingly, the 39 mediation tests have only a replication rate of 53%. There is also some suggestion of bias, with an observed discovery rate of 51% and an expected discovery rate of only 25%, but the 95%CI around the point estimate is wide and includes 51%. The low expected discovery rate implies that the false discovery risk is 16%, which is unacceptably high.

One solution to the high false discovery risk is to lower the criterion for significance. The next conventional level is alpha = .01. The next figure shows the results for this criterion value (the red solid line has moved to z = 2.58.

Now the observed discovery rate is in line with the expected discovery rate (28% vs. 27%) and the false discovery risk has been lowered to 3%. However, the expected replication rate (for alpha = .01) is only 36%. Thus, follow-up studies need to increase sample sizes to replicate these mediation effects.


A post-hoc power-analysis of this recent article shows that psychologists still have not learned Cohen’s lesson that he shared in 1990 (more than 30 years ago). Conducting many significance tests with modest statistical power produces a confusing pattern of significant and non-significant results that is strongly influenced by sampling error. Rather than reporting results of individual studies, the authors should have reported meta-analytic results for tests of the same hypothesis. However, to end on a positive note, the studies are not p-hacked and the risk of false positives is low. Thus, the results provide some credible findings that can be used to conduct confirmatory tests of the hypothesis that deeper conversations are more awkward, but also more rewarding. I hope these analyses show that a deep dive into the statistical results reported in an article can also be rewarding.