Dr. Ulrich Schimmack Blogs about Replicability

For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication” (Cohen, 1994).

DEFINITION OF REPLICABILITYIn empirical studies with sampling error, replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion (Schimmack, 2017). 

See Reference List at the end for peer-reviewed publications.

Mission Statement

The purpose of the R-Index blog is to increase the replicability of published results in psychological science and to alert consumers of psychological research about problems in published articles.

To evaluate the credibility or “incredibility” of published research, my colleagues and I developed several statistical tools such as the Incredibility Test (Schimmack, 2012); the Test of Insufficient Variance (Schimmack, 2014), and z-curve (Version 1.0; Brunner & Schimmack, 2020; Version 2.0, Bartos & Schimmack, 2021). 

I have used these tools to demonstrate that several claims in psychological articles are incredible (a.k.a., untrustworthy), starting with Bem’s (2011) outlandish claims of time-reversed causal pre-cognition (Schimmack, 2012). This article triggered a crisis of confidence in the credibility of psychology as a science. 

Over the past decade it has become clear that many other seemingly robust findings are also highly questionable. For example, I showed that many claims in Nobel Laureate Daniel Kahneman’s book “Thinking: Fast and Slow” are based on shaky foundations (Schimmack, 2020).  An entire book on unconscious priming effects, by John Bargh, also ignores replication failures and lacks credible evidence (Schimmack, 2017).  The hypothesis that willpower is fueled by blood glucose and easily depleted is also not supported by empirical evidence (Schimmack, 2016). In general, many claims in social psychology are questionable and require new evidence to be considered scientific (Schimmack, 2020).  

Each year I post new information about the replicability of research in 120 Psychology Journals (Schimmack, 2021).  I also started providing information about the replicability of individual researchers and provide guidelines how to evaluate their published findings (Schimmack, 2021). 

Replication is essential for an empirical science, but it is not sufficient. Psychology also has a validation crisis (Schimmack, 2021).  That is, measures are often used before it has been demonstrate how well they measure something. For example, psychologists have claimed that they can measure individuals’ unconscious evaluations, but there is no evidence that unconscious evaluations even exist (Schimmack, 2021a, 2021b). 

If you are interested in my story how I ended up becoming a meta-critic of psychological science, you can read it here (my journey). 

References

Brunner, J., & Schimmack, U. (2020). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta-Psychology, 4, MP.2018.874, 1-22
https://doi.org/10.15626/MP.2018.874

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17, 551–566
http://dx.doi.org/10.1037/a0029487

Schimmack, U. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychology/Psychologie canadienne, 61(4), 364–376. 
https://doi.org/10.1037/cap0000246

Replicability Rankings of Psychological Science 2021

A major replication project of 100 studies published in 2008 showed that only 37% of published significant results could be replicated (i.e., the replication study also produced a significant result). This finding has raised concerns about the replicability of psychological science (OSC, 2015).

Conducting actual replication studies to examine the credibility of published results is effortful and costly. Therefore, my colleagues and I developed a cheaper and faster method to estimate the replicability of published results on the basis of published test statistics (Schimmack, 2022). The method produces two estimates of replicability that represent the best possible and the worst possible scenario. The expected replication rate (ERR) assumes that it is possible to replicate studies exactly. When this is not the case, selection for significance and regression to the mean will lead to lower success rates in actual replication studies. The expected discovery rate (EDR) is an estimate of the actual number of statistical results that researchers obtain in their laboratories. If selection for significance is ineffective in reducing the risk of false positive results or in selecting studies with more power, replication studies are expected to be no more successful than original studies (Brunner & Schimmack, 2020). In the absence of any further information, I am using the average of the EDR and ERR as the best prediction of the outcome of actual replication studies. I call this index the Actual Replicability Prediction (ARP). Whereas previous rankings relied exclusively on the ERR, the 2021 rankings start using the ARP to rank the replicability of journals.

Figure 1 shows the average ARP, ERR, and EDR for a broad range of psychology journals. Given the large number of test statistics in each year (k > 100,000), the estimates are very precise. The dashed lines show the 95%CI around the linear trend line. The results show that the ARP has increased from around 50% to slightly under 60%. This finding shows that results published in psychological journals have become a bit more replicable, although this prediction needs to be verified with actual replication studies.

However, the increase is not uniform across journals. Whereas some journals in social psychology showed some big increases, other journals show no changes. The big increases in social psychology are in part due to very low replication rates in this field before 2015 (OSC, 2015). For readers of journals changes are less important than actual replication rates. Table 1 shows the rankings of journals. Predicted replication rates range from an astonishing 97% in the Journal of Individual Differences to a disappointing 37% for Annals of Behavioural Medicine. Of course, results for 2021 are influenced by sampling error. More detailed information about previous years and trends can be found by clicking on the Journal Name.

For now, you can compare these results to previous results using prior rankings from 2020, 2019, or 2018 (these posts only report the ERR).

Rank   JournalARP 2021EDR 2021ERR 2021
1Journal of Individual Differences979697
2Journal of Occupational Health Psychology919191
3JPSP-Personality Processes and Individual Differences817686
4Archives of Sexual Behavior787481
5JPSP-Interpersonal Relationships and Group Processes787582
6Sex Roles767379
7Aggressive Behaviours746682
8Evolutionary Psychology747078
9Social Psychology746979
10Journal of Memory and Language736779
11Perception736977
12Self and Identity736977
13Journal of Organizational Psychology715686
14Attention, Perception and Psychophysics706277
15European Journal of Developmental Psychology706774
16Journal of Child Psychology and Psychiatry and Allied Disciplines705882
17Law and Human Behavior706278
18Psychology of Religion and Spirituality706773
19Cognition696078
20Judgment and Decision Making695683
21Political Psychology695880
22Journal of Family Psychology686372
23Journal of Abnormal Psychology665875
24J. of Exp. Psychology – Learning, Memory & Cognition665774
25J. of Exp. Psychology – Human Perception and Performance655377
26Journal of Comparative Psychology644485
27Behavior Therapy635077
28Journal of Research on Adolescence635669
29Quarterly Journal of Experimental Psychology635176
30British Journal of Developmental Psychology624876
31Personality and Individual Differences624678
32Psychology and Aging624975
33Psychonomic Bulletin and Review624975
34Psychological Science624479
35Acta Psychologica614676
36Cognition and Emotion614478
37Journal of Sex Research613884
38Child Development604675
39Cognitive Development604871
40European Journal of Social Psychology604576
41Evolution & Human Behavior593681
42Journal of Experimental Psychology – Applied594473
43Journal of Experimental Child Psychology594671
44JPSP-Attitudes & Social Cognition594375
45Memory and Cognition593880
46Cognitive Therapy and Research584769
47Journal of Experimental Psychology – General584076
48Journal of Experimental Social Psychology583978
49Journal of Health Psychology584274
50Personality and Social Psychology Bulletin583878
51Social Development584670
52Journal of Nonverbal Behavior573678
53Motivation and Emotion574469
54Psychoneuroendocrinology574173
55Social Psychological and Personality Science573678
56Cognitive Behaviour Therapy564072
57Developmental Psychology564369
58Frontiers in Psychology564270
59Consciousness and Cognition553773
60Journal of Applied Psychology553970
61Journal of Behavioral Decision Making553079
62Journal of Cross-Cultural Psychology553970
63Journal of Cognition and Development553673
64Psychophysiology554070
65Addictive Behaviors543376
66Asian Journal of Social Psychology543672
67International Journal of Psychophysiology544068
68British Journal of Psychology533868
69Emotion533472
70Frontiers in Behavioral Neuroscience534363
71Journal of Affective Disorders533175
72Journal of Child and Family Studies533967
73Journal of Cognitive Psychology533175
74Journal of Research in Personality533174
75Memory533373
76Psychology of Men and Masculinity533472
77Psychology of Music534166
78Appetite523271
79British Journal of Social Psychology523667
80Journal of Applied Social Psychology523966
81Journal of Business and Psychology523271
82Organizational Behavior and Human Decision Processes522580
83Psychology and Marketing523667
84Psychological Medicine524263
85Animal Behavior512875
86Behavioural Brain Research513073
87Canadian Journal of Experimental Psychology512577
88Journal of Personality512380
89Journal of Religion and Health513467
90Psychopharmacology513468
91Group Processes & Intergroup Relations502772
92Journal of Social and Personal Relationships502773
93Biological Psychology493267
94Depression & Anxiety493365
95Experimental Psychology492672
96Journal of Consumer Research493761
97Journal of Educational Psychology492376
98Journal of Youth and Adolescence492772
99Behaviour Research and Therapy483165
100Infancy482472
101Journal of Consumer Behaviour482274
102Developmental Psychobiology473459
103Frontiers in Human Neuroscience473064
104Journal of Consulting and Clinical Psychology473361
105Hormones and Behavior462863
106Journal of Anxiety Disorders462073
107Journal of Positive Psychology462467
108Cognitive Psychology452267
109European Journal of Personality453259
110Psychology Crime and Law451971
111Developmental Science442563
112Journal of Consumer Psychology441870
113Journal of Social Psychology431770
114Behavioral Neuroscience421667
115Journal of Happiness Studies421667
116Journal of Occupational and Organizational Psychology422361
117Personal Relationships421768
118Journal of Vocational Behavior411765
119Health Psychology391959
120Journal of Counseling Psychology381661
121Annals of Behavioral Medicine371954
> >

If you liked this post, you might also be interested in “Estimating the False Discovery Risk in Psychological Science‘.”

More Madness than Genius: Meta-Traits of Personality

In 1997, Digman (1997) published an article that aimed to explain correlations among self-rating scales of the Big Five personality traits in terms of two orthogonal higher-order factors. One factor related Extraversion and Openness and the other factor related Emotional Stability (the opposite of Neuroticism), Agreeableness, and Conscientiousness.

This model has had relatively little influence on personality psychology, except for work by Colin DeYoung. The first article on the higher-order factors was published when he was a graduate student with his supervisor Jordan B. Peterson (DeYoung, Peterson, & Higgins, 2002).

In this article, the authors relabel Digman’s factors as stability (Emotional Stability, Agreeableness, & Conscientiousness) and Plasticity (Extraversion & Openness). They suggested that Stability is related to serotonin and Plasticity is related to dopamine.

“We present a biologically predicated model of these two personality factors,relating them to serotonergic and dopaminergic function,an d we label them Stability (Emotional Stability, Agreeableness, and Conscientiousness) and Plasticity (Extraversion and Openness)” (p. 533).

The article, however, does not test relationships between biological markers of these neurotransmitter systems and variation in personality. In this regard, the article merely introduces a hypothesis, but does not provide empirical support for or against it. The only empirical evidence in support of the hypothesis would be that Big Five factors are actually related to each other in the way Digman proposed. Evidence to the contrary would falsify a biological model that predicts these relationships.

The main empirical prediction of the model is that Stability and Plasticity predict variation in self-ratings of conformity.

“Based on this model,we hypothesize that Stability will positively predict conformity (as indicated by socially desirable responding) and that Plasticity will negatively predict
conformity” (p. 533)

The authors claim to have found support for this prediction.

“A structural equation model indicates that conformity is indeed positively related to Stability
(university sample: b =0.98; community sample: b =0.69; P<0.01 for both) and negatively related to Plasticity (university sample: b= -0.48, P<0.07; community sample: b= -0.42, P<0.05).”

Readers familiar with structural equation modeling may be surprised by the strong relationship between Stability and Conformity, especially in the student sample. A standardized parameter of .98 implies that these constructs are nearly perfectly correlated. Relationships of this magnitude are usually not a cause of celebration. They either imply a lack of discriminant validity (i.e., two measures are actually measuring the same construct) or model misspecification.

To understand what is going on in this study, it is helpful to inspect the actual pattern in the data. Fortunately, it was a common practice in personality psychology to share this information in the form of the raw correlation matrices even before open science became the norm in other fields of psychology. We can therefore inspect the published correlation matrix.

First, the two conformity measures (1. Impression Management, 2. Lie Scale) show a moderate correlation, r .53, indicating that they measure a common construct.

Second, both conformity measures show sizeable correlations with the Stability traits Emotional Stability/Neuroticism, r1 = -.37, .36, r2 = .24, -.31, Agreeableness, r1 = .33, .42, r2 = .36, .31, and Conscientiousness, r1 = .33, .38, r2 = .33, .39. In contrast, conformity measures are unrelated to the Plasticity traits, Extraversion/Surgency, r1 = -.05, -.05, r2 = .03, .04 and Openness/Intellect r1 = .01, -.10, .04, -.13. The latter finding raises concerns about the negative relationship between the Plasticity factor and Conformity factor in DeYoung and Peterson’s model and by extension their theory that predicted this negative relationship.

Third, we can examine the correlations among the Big Five measures. According to Digman’s model, Stability and Plasticity are expected to be independent. Accordingly cross-meta-factor correlations (e.g., Extraversion & Agreeableness or Emotional Stability & Openness) should be close to zero. Inspection of Table 1 shows that this is not the case. For example, TDA Surgency correlates r = .23 with TDA Agreeableness, r = .19 with TDA Conscientiousness, r = .16 with NEO Conscientiousness, and r = -.39 with NEO Neuroticism. These correlations need to be modeled to have a good fitting model.

Fourth, we can examine whether the pattern of correlations confirms the key prediction of Digman’s model. Namely Stability traits should be more strongly correlated with each other than with Plasticity traits and vice versa. The comparison of these correlations follows Campbell and Fiske’s (1959) approach to examine convergent and discriminant validity. It is easy to see that the pattern of correlations does not fully support the predicted structure. For example, the Plasticity correlations of TDA Surgency with TDA Intellect, r = .21 and NEO Openness, r = .23 are weaker than the correlation with TDA Emotional Stability, r = .27, and NEO Neuroticism, r = -.39. Results like these raise concerns that the published model misrepresents the actual pattern in the data.

The published model is shown in Figure 2. As noted before, the high relationship between the Stability factor and the Conformity factor is a concern. A similar concern arises from the high loading of Extraversion on the Plasticity factor, b = .95. Accordingly, Plasticity is nearly identical with Extraversion.

It is well known that even well-fitting models do not proof that the proposed model generated the observed pattern of correlations. It is good practice to compare preferred models to plausible alternative models. Model comparison can be used to weed out bad models, but the winner may still not be the right model. That is, we can falsify false models, but we cannot verify the right model.

I first fitted a measurement model to the correlations among the Big Five indicators in Table 1. It is noteworthy that the authors were unable to fit a model to the data in Table 1.

“While it would have been an attractive possibility to use the two measures of each Big Five trait for Sample 1 in order to create a hierarchical factor model,with latent variables for Stability and Plasticity derived from latent variables for each of the Big Five, the many intercorrelations
among the 10 Big Five scales rendered such a model impractical” (p. 542).

There justification makes no sense to anybody who is familiar with structural equation modeling and there are published models with 2, 3, or 4 indicators to create a measurement model of the Big Five factors (Anusic et al., 2009). To achieve satisfactory fit, it is necessary to allow for some secondary loadings and correlated residuals. These parameters reflect the fact that Big Five scales are impure indicators of the Big Five factors that are contaminated with specific item content. Purists may object to the exploratory approach, but they would have to terminate modeling because a simple structure model does not have satisfactory fit. Thus, the only way to proceed and to test the model is to modify the model to have adequate fit and to conduct further tests with better data in the future.

Modification of the measurement model was terminated when no major modification indices were present, chi2 < 10. Final model fit was acceptable, CFI = .989, RMSEA = .055.

All primary loadings were high, b > .7. All secondary loadings were below .3. Notable correlated residuals were present for TDA conscientiousness (con) and TDA agreeableness (agr) and NEO conscientiousness (neoc) and Neo Neuroticism (neon). Neuroticism was reverse scored so that higher scores reflect Emotional Stability.

The correlations among the Big Five factors show generally positive correlations, which is a typical finding. There is some evidence for convergent and discriminant validity of the meta-traits. The highest correlations are for Agreeableness and Emotional Stability, r = .406 (Stability), Conscientiousness and Emotional Stability r = .379 (Stability), and Openness and Extraversion, r = .351 (Plasticity), and Agreeableness and Conscientiousness, r = .323 (Stability).

However, a model that tried to model the Big Five correlations with two independent meta-traits reduced model fit, CFI = .972, RMSEA = .074. As can be seen in Figure 1, DeYoung and Peterson solved this problem by letting the Stability and Plasticity factor correlate without providing a theoretical explanation for this correlation. Adding this correlation to the model improved model fit.

It is now possible to add conformity to the model to reproduce the published results. Model fit remained acceptable, but the standardized effect of Stability on Conformity exceeded 1, b = 1.30. This problem could be solved by relaxing the equality constraint for the loadings of Extraversion and Openness on Plasticity, which was needed in the model without a criterion. However, even this model had the problem that the residual variance in conformity was negative. The reason is hat the model is misspecified.

The key problem with this model is the ad-hoc, atheoretical correlation between the two higher order factors. With the help of hindsight, we know from multi-trait multi-method studies that correlations among all Big Five traits are an artifact of response styles (Biesanz & West, 2004). One of these studies was even published by DeYoung (2006), so there should be no disagreement with him. Anusic et al. (2009) showed that we do not need multi-method data to control for these rating biases. Instead a method-factor can e added to the model. I have improved on Anusic et al.’s approach and started to model this method factor as a factor that has a direct influence on indicators. As a result, Big Five factors are independent of method variance. In this model, stability and plasticity are independent if they are identified.

Figure 2 shows the results.

In tis model, plasticity was no longer a significant predictor of conformity, b = -.064, but the small sample size does not provide precise effect size estimates , 95%CI = -.389 to .261. The standardized coefficient for Stability remained greater than 1, b = 1.124, but the 95%CI included 1, 95%CI = .905 to 1.343.

This pointed towards another crucial problem with DeYoung and Peterson’s model. Their model assumes that the unique variance of Neuroticism, Agreeableness, and Conscientiousness is unrelated to conformity. This assumption might be false. An alternative model would still assume that Stability is related to Conformity, but that this relationship is indirect; that is, it is mediated by the Big Five factors. This model fitted the data slightly better, but fit cannot distinguish between these two models, CFI = .975, RMSEA = .059.

More importantly, in this model the residual variance in conformity was positive, suggesting that conformity is not fully explained by the Big Five factors. About one-quarter of the variance in conformity was unexplained, uniqueness = 28%. The total indirect effect of Stability on Conformity was b = .61, implying that .61^2 = 27% of the variance in Conformity were explained by Stability. This implies that the remaining (1-.28) – .27 = 45% of the explained variance in Conformity are explained by unique variance in the Big Five Stability Factors (Neuroticism, Agreeableness, & Conscientiousness).

The new analyses of the results suggest that the published model is misleading in several ways.

1. Plasticity is not a negative predictor of Conformity

2. Stability explains 30% of the variance in Conformity not 100%.

3. The correlations of Agreeableness, Conscientiousness, and Neuroticism with Conformity are not spurious (i.e., Stability is a third variable). Instead, Agreeableness, Conscientiousness, and Neuroticism mediate the relationship of Stability with Conformity.

4. The published model overestimates the amount of shared variance among Big Five factors because it does not control for response biases and made false assumptions about causality.

Does it matter?

The discussion section of the article used the model to make wide-reaching claims about personality, drawing explicitly on the finding that plasticity is a negative predictor of conformity.

As shown here, these conclusions are based on a false model. At best, we can conclude from tis article that (a) the meta-traits were still identified even after response styles were controlled and (b) conformity measures appear to be related to stability factors and not plasticity factors. However, since the publication of this article, better studies with multi-method data have examined how Big Five factors are correlated (Anusic et al., 2009; Biesanz & West, 2004; DeYoung, 2006). These studies show mixed results and are still limited by the use of scale scores as indicators of the Big Five factors. Thus, it remains unclear whether meta-traits really exist and how much variance in the Big Five traits they explain.

The existence of meta-traits is also not very important for studies that try to predict criterion variables like conformity from the Big Five. There is no theoretical justification to assume that the unique variance components of the Big Five are unrelated to the criterion. As a result, the Big Five can be used as predictors and any effect of the meta-traits would show up as an indirect effect that is mediated by the Big Five.

This model was also used by another Jordan Peterson student, in a study that predicted environmental concerns from the Big Five (Hirsch, 2010).

The most notable finding is that neuroticism, agreeableness, and conscientiousness are all positive predictor of environmental concerns. This is a problem for a model that assumes stability is a positive predictor because this would imply a negative relationship between neuroticism and environmental concerns (negative loading on stability * positive effect on environmental concerns = negative correlation between neuroticism and environmental concerns). Once more, we see that it is unreasonable to assume that the unique variances of the Big Five are unrelated to criterion variables. Criterion variables cannot be used to validate the meta-traits. What would be needed are causal factors that produce the shared variance among Big Five traits. However, it has been difficult to find specific causes of personality variation. Thus, the only evidence for these factors is limited to patterns of correlations among Big Five measures. Even if these correlations are real, they do not imply that the unique variances in the Big Five are irrelevant. Thus, from a practical point of view it is irrelevant whether the Big Five are modeled as correlated factors or with meta-traits that explain these correlations in terms of some hypothetical common causes.

Estimating The Reproducibility of Psychological Science in 2021

Psychology is the science of human affect, behavior, and cognition. Since the beginning of psychological science, psychologists have debated fundamental questions about their science. Recently, these self-reflections of a science about itself have been called meta-science. Many meta-psychological discussions are theoretical. However, some meta-psychological articles rely on empirical data. For example, Cohen’s (1961) seminal investigation of statistical power in social and clinical psychology provided an empirical estimate of statistical power to detect small, moderate, or large effect sizes.

Another empirical meta-psychological contribution was the reproducibility project by the Open Science Collaboration (2015). The project reported the outcome of 100 replication attempts of studies that were published in 2008 in three journals. The key finding was that out of 97 original studies that reported a statistically significant result, only 36 replication studies reproduced a statistically significant result; a success rate of 37%.

The low success rate has been widely cited as evidence that psychological science has a replication crisis. It also justifies concerns about the credibility of other significant results that may have an equally low probability of a successful replication.

Optimists point out that some psychology journals have implemented reforms that may have raised the replicability of published findings such as (a) requesting a priori sample size justifications with power analysis and pre-registration of data analysis plans. It is also possible that psychologists voluntarily changed their research practices after they became aware of the low replicability of their findings. Finally, some psychologists may have lowered the significance criterion to reduce the risk of false positive results that do not replicate. Yet, as of today no empirical evidence exists that these reforms have made a notable, practically significant contribution to the replication rate in psychological science.

In this blog post, I provide a new empirical estimate of the replicability of psychological science that relies on published statistical results. It is possible to predict the outcome of replication studies based on published statistical results because both the results of the original study and the outcome of the replication study are a function of statistical power and sampling error (Brunner & Schimmack, 2020). Studies with higher statistical power are more likely to produce smaller p-values and are more likely to replicate. As a result, smaller p-values imply higher replicability. The statistical challenge is only to find a model that can use published p-values to make predictions about the success rate of replication studies. Brunner and Schimmack (2020) developed and validated z-curve as a method that can estimate the expected replication rate (ERR) based on the distribution of significant p-values after converting them into z-scores. Although the method performs well in simulation studies, it is also necessary to validate the method against the outcome of actual replication studies. The results from the OSC reproducibility project provide an opportunity to do so.

For any empirical study it is necessary to clearly define populations and to ensure adequate sampling from populations. In the present context, populations are quantitative results that were used to test statistical hypotheses like F-tests, t-tests, or other tests. Most articles report several statistical tests for each sample. These statistical tests differ in importance. The reproducibility project focused on one statistical result to evaluate whether a replication study was successful. This result was typically chosen because it was deemed the most important or at least one of the most important results. These results are often called focal or critical. Ideally, statistical prediction models that aim to predict the replicability of focal tests would rely on coding of focal hypothesis tests. The main problem with this approach is that coding of focal hypothesis tests requires trained coders and is time consuming.

An alternative approach uses automatic extraction of test-statistics from published articles. The advantage of this approach that it is quick and produces large representative samples of results published in statistical journals. The key disadvantage is that this approach samples from the population of all test-statistics that are detected by the extraction method and that this population is different from the population of focal hypothesis tests. Therefore, the predictions by the statistical model can be biased to the extent that the two populations have different replication rates. This does not mean that these estimates are useless. Rather, a comparison with actual replication rates can be used to correct for this bias and make more accurate predictions about the replication rate of focal hypothesis tests. Ideally, these estimates can be validated in the future using hand-coding of focal hypothesis tests and actual reproducibility projects of articles published in 2021.

Validation of Z-Curve Predictions with the Reproducibility Project Results

The Reproducibility Project invited replications of studies published in three journals: Psychological Science, Journal of Experimental Psychology: Learning, Memory, and Cognition, and Journal of Personality and Social Psychology. Only articles published in 2008 were eligible.

To predict (the data preceded the criterion) the 37% success rate, I downloaded all articles from these three journals and searched the articles for reported chi2, t-test, F-test, and z-test results. Only results reported in the text were included. The extraction method found 10,951 statistical results. Test results were converted into absolute z-scores. The histogram of z-scores is shown in Figure 1.

Visual inspection of the z-curve shows that the peak of the distribution is right at the criterion for statistical significance (z = 1.96, p = .05 two-sided). The figure also shows clear evidence of selection for significance which is a well-known practice in psychology journals (Sterling, 1959; Sterling et al., 1995). The key result is the expected replication rate of 64%. This result is much higher than the actual replication rate of 37%. There are several explanations for this discrepancy. One explanation is of course that the estimate is based on a different population of test statistics. However, even predictions of hand-coded test results overestimate replication outcomes (Schimmack, 2020). Another reason is that z-curve predictions are based on the idealistic assumption that replication studies reproduce the original study exactly. However, it is likely that replication studies deviate at least in some minor details from the original studies. When selection for significance is present, this variation across studies implies regression to the mean and a lower replication rate. Thus, z-curve predictions are expected to overestimate actual success rates when selection for significance is present, which it clearly is.

Bartos and Schimmack (2021) suggested that the expected discovery rate (EDR) could be a better estimate of actual replication studies. The difference between EDR and ERR is again a difference between populations. The EDR is based on the population of all studies that were conducted. The ERR is based on the population of studies that were conducted and produced a significant result. The significance filter would select studies with higher power. However, if effect sizes vary across replications of the same study, this selection is imperfect and a study with less power could be selected because the study had an unusually large effect size that cannot be replicated. if the selection process is totally unable to distinguish between studies with higher or lower power, the success rate would match the EDR estimate. The EDR estimate of 26% is lower than the actual success rate of 37%, but not by much.

Based on these results, we see that z-curve correctly predicts that the actual success rate is higher than the EDR and lower than the ERR estimate. Using the average of the two estimates as the best prediction, we get a prediction of 45%, compared to the actual outcome of 37%. The remaining discrepancy could be partially due to the difference in populations of test statistics. Moreover, the 37% estimate is based on a small sample of studies and the difference may just be a chance finding.

In conclusion, the present results suggest that the average of the ERR and EDR estimates of z-curve models based on automatically extracted test statistics can predict actual replication outcomes. Of course this conclusion is based on a single observation (N = 1), but th problem is that there are no other actual replication outcomes that could be used to cross-validate this result.

Replication in 2010

Research practices did not change notably between 2008 and 2010. Furthermore, 2008 was arbitrarily selected by the Open Science Collaboration to estimate the reproducibility of psychological science. The following results therefore replicate the z-curve prediction based on a new dataset and examine the generalizabilty of the reproducibility project results across time.

The results are very similar, indeed, and the predicted replication rate for actual replication studies is (62 + 22)/2 = 42%.

Expanding the Sample of Journals

The aim of the OSC project was to estimate the reproducibility of psychological science. A major limitation of the study was the focus on three journals that publish mostly experimental studies in cognitive and social psychology. However, psychology is a much more diverse science that studies development, personality, mental health, and the application of psychology in many areas. It is currently unclear whether replication rates in these other areas of psychology differ from those in experimental cognitive and social psychology. To examine this question, I downloaded all articles from 121 psychology journals published in 2010 (a list of the journals can be found here). Automatic extraction of test statistics from these journals produced 109,117 z-scores. Figure 3 shows the z-curve plot.

The observed discovery rate (i.e., the percentage of statistically significant results) is identical for the 3 OSC journals and the 120 journals, 72%. The expected discovery rate is slightly higher for the 120 journals, 28% vs. 22. The expected replication rate is also slightly higher for the broader set of journals, 67% vs. 62%. This implies a slightly higher estimate of the success rate if a random sample of studies from all areas of psychology were replicated, 48% vs. 42%. However, the difference of 6 percentage points is small and not substantially meaningful. Based on these results, it is reasonable to generalize the results of the OSC project to psychology in general. This average estimate may hide differences between disciplines. For example, the OSC project found that cognitive studies were more replicable than social studies, 50% vs. 25%. Results for individual journals also suggest differences between other disciplines.

Replicability in 2021

Undergraduate students at the University of Toronto Mississauga (Alisar Abdelrahman, Azeb Aboye, Whitney Buluma, Bill Chen, Sahrang Dorazahi, Samer El-Galmady, Surbhi Gupta, Ismail Kanca, Mustafa Khekani, Safana Nasir, Amber Saib, Mohammad Shahan, Swikriti Singh, Stuti Patel, Chuanyu Wu) downloaded all articles published in the same 121 journals in 2021. Automatic extraction of test statistics produced 161,361 z-scores. The increase is due to the publication of more articles in some of the journals.

The expected discovery rate increased from 28% in 2010 to 43% in 2021. The expected replication rate showed a smaller increased from 67% to 72%. Based on these results, the z-curve model predicts that replications of a representative sample of focal hypothesis tests from 2021 would produce (43 + 72)/2 = 58%.

Interpretation

The key finding is that replicability in psychological science has increased in response to evidence that the replication rate in psychology is too low. However, replicability has increased only slightly and even a replication rate of 60% suggests that many published studies are underpowered. Moreover, results in 2021 still show evidence of selection for significance; the ODR is 68%, but the EDR is only 48%. It is well known that underpowered studies that are selected for significance produce inflated effect size estimates. Thus, readers of published articles need to be careful when they interpret effect size estimates. Moreover, results of single studies are insufficient to draw strong conclusions. Replication studies are needed to provide conclusive evidence of robust effects.

In sum, these results suggest that psychological science is improving. Whether the amount of improvement over the span of a decade is sufficient is open to subjective interpretation. At least, there is some improvement. This is noteworthy because many previous meta-psychological studies found no signs of improvement in response to concerns abou tthe replicability of published results (Sedlmeier & Gigerenzer, 1989). The present results show that meta-psychology can provide empirical information about psychological science nearly in real time. However, statistical predictions need to be complemented by actual replication studies and meta-psychologists should conduct a new reproducibiltiy project with a broader range of journals and articles published in the past years. This blog post pre-registers the prediction that the success rate will be higher than the 37% rate in the original reproducibility project and a success rate between 50-60%.

Estimating the False Discovery Risk of Psychology Science

Abstract

Since 2011, the credibility of psychological science is in doubt. A major concern is that questionable research practices could have produced many false positive results, and it has been suggested that most published results are false. Here we present an empirical estimate of the false discovery risk using a z-curve analysis of randomly selected p-values from a broad range of journals that span most disciplines in psychology. The results suggest that no more than a quarter of published results could be false positives. We also show that the false positive risk can be reduced to less than 5% by using alpha = .01 as the criterion for statistical significance. This remedy can restore confidence in the direction of published effects. However, published effect sizes cannot be trusted because the z-curve analysis shows clear evidence of selection for significance that inflates effect size estimates.

Introduction

Several events in the early 2010s led to a credibility crisis in psychology. As journals selectively publish only statistically significant results, statistical significance loses its, well, significance. Every published focal hypothesis will be statistically significant, and it is unclear which of these results are true positives and which are false positives.

A key article that contributed to the credibility crisis was Simmons, Nelson, & Simonsohn’s article “False Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant”

The title made a bold statement that it is easy to obtain statistically significant results even when the null-hypothesis is true. This led to concerns that many, if not most, published results are indeed false positive results. Many meta-psychological articles quoted Simmons et al.’s (2011) article to suggest that there is a high risk or even a high rate of false positive results in the psychological literature; including my own 2012 article.

“Researchers can use questionable research practices (e.g., snooping, not reporting failed studies, dropping dependent variables, etc.; Simmons et al., 2011; Strube, 2006) to dramatically increase the chances of obtaining a false-positive result” (Schimmack, 2012, p. 552, 248 citations)

The Appendix lists citations from influential meta-psychological articles that imply a high false positive risk in the psychological literature. Only one article suggested that fears about high false positive rates may be unwarranted (Strobe & Strack, 2014). In contrast, other articles have suggested that false positive rates might be as high as 50% or more (Szucs & Ioannidis, 2017).

There have been two noteworthy attempts at estimating the false discovery rate in psychology. Szucs and Ioannidis (2017) automatically extracted p-values from five psychology journals and estimated the average power of extracted t-tests. They then used this power estimate in combination with the assumption that psychologists discover one true, non-zero, effect for every 13 true null-hypotheses to suggest that the false discovery rate in psychology exceeds 50%. The problem with this estimate is that it relies on the questionable assumption that psychologists tests a very small percentage of true hypotheses.

The other article tried to estimate the false positive rate based on 70 of the 100 studies that were replicated in the Open Science Collaboration project (Open Science Collaboration, 2015). The statistical model estimated that psychologists test 93 true null-hypotheses for every 7 true effects (true positives), and that true effects are tested with 75% power (Johnson et al., 2017). This yields a false positive rate of about 50%. The main problem with this study is the reliance on a small, unrepresentative sample of studies that focused heavily on experimental social psychology, a field that triggered concerns about the credibility of psychology in general (Schimmack, 2020). Another problem is that point estimates based on a small sample are unreliable.

To provide new and better information about the false positive risk in psychology, we conducted a new investigation that addresses three limitations of the previous studies. First, we used hand-coding of focal hypothesis tests, rather than automatic extraction of all test-statistics. Second, we sampled from a broad range of journals that cover all areas of psychology rather than focusing narrowly on experimental psychology. Third, we used a validated method to estimate the false discovery risk based on an estimate of the expected discovery rate (Bartos & Schimmack, 2021). In short, the false discovery risk decreases as a monotonic function of the number of discoveries (i.e., p-values below .05) (Soric, 1989).

Z-curve relies on the observation that false positives and true positives produce different distributions of p-values. To fit a model to distributions of significant p-values, z-curve transforms p-values into absolute z-scores. We illustrate z-curve with two simulation studies. The first simulation is based on Simmons et al.’s (2011) scenario in which the combination of four questionable research practices inflates the false positive risk from 5% to 60%. In our simulation, we assumed an equal number of true null-hypotheses (effect size d = 0) and true hypotheses with small to moderate effect sizes (d = .2 to .5). The use of questionable research practices also increases the chances of getting a significant result for true hypotheses. In our simulation, the probability to get significance with true H0 was 58%, whereas the probability to get significance with true H1 was .93. Given the 1:1 ratio of H0 and H1 that were tested, this yields a false discovery rate of 39%.

Figure 1 shows that questionable research practices produce a steeply declining z-curve. Based on this shape, z-curve estimates a discovery rate of 5%, with a 95%CI ranging from 5% to 10%. This translates into estimates of the false discovery risk of 100% with a 95%CI ranging from 46% to 100% (Soric, 1989). The reason why z-curve provides a conservative estimate of the false discovery risk is that p-hacking changes the shape of the distribution in a way that produces even more z-values just above 1.96 than mere selection for significance would produce. In other words, p-hacking destroys evidential value when true hypotheses are being tested. It is not necessary to simulate scenarios in which even more true null-hypotheses are being tested because this would make the z-curve even steeper. Thus, Figure 1 provides a prediction for our z-curve analyses based on actual data, if psychologists heavily rely on Simmons et al.’s recipe to produce significant results.

Figure 2 is based on a simulation of Johnson et al.’s (2013) scenario with a 9% discovery rate (9 true hypotheses for very 100 hypothesis tests), a false discovery rate of 50%, and power to detect true effects of 75% (Figure 2). Johnson et al. did not assume or model p-hacking.

The z-curve for this scenario also shows a steep decline that can be attributed to the high percentage of false positive results. However, there is also a notable tail with z-values greater than 3 that reflects the influence of true hypotheses with adequate power. In this scenario, the expected discovery rate is higher with a 95%CI ranging from 7% to 20%. This translates into a 95%CI for the false discovery risk ranging from 21% to 71% (Soric, 1989). This interval contains the true value of 50%, although the point estimate, 34% underestimates the true value. Thus, we recommend to use the upper limit of the 95%CI as an estimate of the maximum false discovery rate that is consistent with data.

We now turn to real data. Figure 3 shows a z-curve analysis of Kühberger, Frity, and Scherndl (2014) data. The authors conducted an audit of psychological research by randomly sampling 1,000 English language articles published in the year 2007 that were listed in PsychInfo. This audit produced 344 significant p-values that could be subjected to a z-curve analysis. The results differ notably from the previous results. The expected discovery rate is higher and implies a much smaller false discovery risk of only 9%. However, due to the small set of studies, the confidence interval is wide and allows for nearly 50% false positive results.

To produce a larger set of test-statistics, my students and I have hand-coded over 1,000 randomly selected articles from a broad range of journals (Schimmack, 2021). These data were combined with Motyl et al.’s (2017) coding of social psychology journals. The time period spans the years 2008 to 2014, with a focus on the year 2010 and 2009. This dataset produced 1,715 significant p-values. The estimated false discovery risk is similar to the estimate for Kühberger et al.’s (2014) studies. Although the point estimate for the false discovery risk is a bit higher, 12%, the upper bound of the 95%CI is lower because the confidence interval is tighter.

Given the similarity of the results, we combined the two datasets to obtain an even more precise estimate of the false discovery risk based on 2,059 significant p-values. However, the upper limit of the 95%CI decreased only slightly from 30% to 26%.

The most important conclusion from these findings is that concerns about the amount of false positive results have exaggerated assumptions about the prevalence of false positive results in psychology journals. The present results suggest that at most a quarter of published results are false positives and that actual z-curves are very different from those implied by the influential simulation studies of Simmons et al. (2011). Our empirical results show no evidence that massive p-hacking is a common practice.

However, a false positive rate of 25% is still unacceptably high. Fortunately, there is an easy solution to this problem because the false discovery rate depends on the significance threshold. Based on their pessimistic estimates, Johnson et al. (2015) suggested to lower alpha to .005 or even .001. However, these stringent criteria would render most published results statistically non-significant. We suggest to lower alpha to .01. Figure 6 shows the rational for this recommendation by fitting z-curve with alpha = .01 (i.e., the red vertical line that represents the significance criterion is moved from 1.96 to 2.58.

Lowering alpha to .01, lowers the percentage of significant results from 83% (not counting marginally significant, p < .1, results) to 53%. Thus, the expected discovery decreases, but the more stringent criterion for significance lowers the false discovery risk to 4% and even the upper limit of the 95%CI is just 4%.

It is likely that discovery rates vary across journals and disciplines (Schimmack, 2021). In the future, it may be possible to make more specific recommendations for different disciplines or journals based on their discovery rates. Journals that publish riskier hypotheses tests or studies with modest power would need a more stringent significance criterion to maintain an acceptable false discovery risk.

An alpha level of .01 is also recommended by Simmons et al.’s (2011) simulation studies of p-hacking. Massive p-hacking that inflates the false positive risk from 5% to 61% produces only 22% false positives with alpha = .01. Milder forms of p-hacking inflates the false positive risk produces only a probability of 8% to obtain a p-value below .01. Ideally, open science practices like pre-registration will curb the use of questionable practices in the future. Increasing sample sizes will also help to lower the false positive risk. A z-curve analysis of new studies can be used to estimate the current false discovery risk and may suggest that even the traditional alpha level of .05 is able to maintain a false discovery risk below 5%.

While the present results may be considered good news relative to the scenario that most published results cannot be trusted, the results do not change the fact that some areas of psychology have a replication crisis (Open Science Collaboration, 2015). The z-curve results show clear evidence of selection for significance, which leads to inflated effect size estimates. Studies suggest that effect sizes are often inflated by more than 100% (Open Science Collaboration, 2015). Thus, published effect size estimates cannot be trusted even if p-values below .01 show the correct sign of an effect. The present results also imply that effect size meta-analyses that did not correct for publication bias produce inflated effect size estimates. For these reasons, many meta-analyses have to be reexamined and use statistical tools that correct for publication bias.

Appendix

“Given that these publishing biases are pervasive across scientific practice, it is possible that false positives heavily contaminate the neuroscience literature as well, and this problem may
affect at least as much, if not even more so, the most prominent journals” (Button et al., 2013; 3,316 citations).

“In a theoretical analysis, Ioannidis estimated that publishing and analytic practices make it likely that more than half of research results are false and therefore irreproducible” (Open Science Collaboration, 2015, aac4716-1)

“There is increasing concern that most current published research findings are false. (Ioannidis,
2005, abstract)” (Cumming, 2014, p7, 1,633 citations).

“In a recent article, Simmons, Nelson, and Simonsohn (2011) showed how, due to the misuse of statistical tools, significant results could easily turn out to be false positives (i.e., effects considered significant whereas the null hypothesis is actually true). (Leys et al., 2013, p. 765, 1,406 citations)

“During data analysis it can be difficult for researchers to recognize P-hacking or data dredging because confirmation and hindsight biases can encourage the acceptance of outcomes that fit expectations or desires as appropriate, and the rejection of outcomes that do not as the result of suboptimal designs or analyses. Hypotheses may emerge that fit the data and are then reported without indication or recognition of their post hoc origin. This, unfortunately, is not scientific discovery, but self-deception. Uncontrolled, it can dramatically increase the false discovery rate” (Munafò et al., 2017, p. 2, 1,010 citations)

Just how dramatic these effects can be was demonstrated by Simmons, Nelson, and Simonsohn (2011) in a series of experiments and simulations that showed how greatly QRPs increase the likelihood of finding support for a false hypothesis. (John et al., 2012, p. 524, 877 citations).

“Simonsohn’s simulations have shown that changes in a few data-analysis
decisions can increase the
false-positive rate in a single study to 60%” (Nuzzo, 2014, 799 citations).

“the publication of an important article in Psychological Science showing how easily researchers can, in the absence of any real effects, nonetheless obtain statistically significant differences through various questionable research practices (QRPs) such as exploring multiple dependent variables or covariates and only reporting these when they yield significant results (Simmons, Nelson, & Simonsohn, 2011)” (Pashler & Wagenmakers, 2012, p. 528, 736 citations)

“Even seemingly conservative levels of p-hacking make it easy for researchers to find statistically significant support for nonexistent effects. Indeed, p-hacking can allow researchers to get most studies to reveal significant relationships between truly unrelated variables (Simmons et al., 2011).” (Simonsohn, Nelson, & Simmons, 2014, p. 534, 656 citations)

“Recent years have seen intense interest in the reproducibility of scientific results and the degree to which some problematic, but common, research practices may be responsible for high rates of false findings in the scientific literature, particularly within psychology but also more generally” (Poldrack et al., 2017, p. 115, 475 citations)

“especially in an environment in which multiple comparisons or researcher dfs (Simmons, Nelson, & Simonsohn, 2011) make it easy for researchers to find large and statistically significant effects that could arise from noise alone” (Gelman & Carlin,

“In an influential recent study, Simmons and colleagues demonstrated that even a moderate amount of flexibility in analysis choice—for example, selecting from among two DVs or
optionally including covariates in a regression analysis— could easily produce false-positive rates in excess of 60%, a figure they convincingly argue is probably a conservative
estimate (Simmons et al., 2011).” (Yarkoni & Westfall, 2017, p. 1103, 457 citations)

“In the face of human biases and the vested interest of the experimenter, such freedom of analysis provides access to a Pandora’s box of tricks that can be used to achieve any desired result (e.g., John et al., 2012; Simmons, Nelson, & Simonsohn, 2011″ (Wagenmakers et al., 2012, p. 633, 425 citations)

“Simmons et al. (2011) illustrated how easy it is to inflate Type I error rates when researchers employ hidden degrees of freedom in their analyses and design of studies (e.g., selecting the most desirable outcomes, letting the sample size depend on results of significance tests).” (Bakker et al., 2012, p. 545, 394 citations).

“Psychologists have recently become increasingly concerned about the likely overabundance of false positive results in the scientific literature. For example, Simmons, Nelson, and Simonsohn (2011) state that “In many cases, a researcher is more likely to falsely find
evidence that an effect exists than to correctly find evidence that it does not
” (p. 1359)” (Maxwell, Lau, & Howard, 2015, p. 487,

“More-over, the highest impact journals famously tend to favor highly surprising results; this makes it easy to see how the proportion of false positive findings could be even higher in such journals.” (Pashler & Harris, 2012, p. 532, 373 citations)

“There is increasing concern that many published results are false positives [1,2] (but see [3]).” (Head et al., 2015, p. 1, 356 citations)

“Quantifying p-hacking is important because publication of false positives hinders scientific
progress” (Head et al., 2015, p. 2, 356 citations).

“To be sure, methodological discussions are important for any discipline, and both fraud and dubious research procedures are damaging to the image of any field and potentially undermine confidence in the validity of social psychological research findings. Thus far, however, no solid data exist on the prevalence of such research practices in either social or any other area of psychology.” (Strobe & Strack, 2014, p. 60, 291 citations)

“Assuming a realistic range of prior probabilities for null hypotheses, false report probability is likely to exceed 50% for the whole literature” (Szucs & Ioannidis, 2017, p. 1, 269 citations)

“Notably, if we consider the recent estimate of 13:1 H0:H1 odds [30], then FRP exceeds 50% even in the absence of bias” (Szucs & Ioannidis, 2017, p. 12, 269 citations)

“In all, the combination of low power, selective reporting, and other biases and errors that have been well documented suggest that high FRP can be expected in cognitive neuroscience and psychology. For example, if we consider the recent estimate of 13:1 H0:H1 odds [30], then
FRP exceeds 50% even in the absence of bias.” (Szucs & Ioannidis, 2017, p. 15, 269 citations)

“Many prominent researchers believe that as much as half of the scientific literature—not only in medicine, by also in psychology and other fields—may be wrong [11,13–15]” (Smaldino & McElreath, 2016, p. 2, 251 citations).

“Researchers can use questionable research practices (e.g., snooping, not reporting failed studies, dropping dependent variables, etc.; Simmons et al., 2011; Strube, 2006) to dramatically increase the chances of obtaining a false-positive result” (Schimmack, 2012, p. 552, 248 citations)

“A more recent article compellingly demonstrated how flexibility in data collection, analysis, and reporting can dramatically increase false-positive rates (Simmons, Nelson, & Simonsohn, 2011).” (Dick et al., 2015, p. 43, 208 citations)

“In 2011, we wrote “False-Positive Psychology” (Simmons et al. 2011), an article reporting the surprisingly severe consequences of selectively reporting data and analyses, a practice that we later called p-hacking. In that article, we showed that conducting multiple analyses on the same data set and then reporting only the one(s) that obtained statistical significance (e.g., analyzing multiple measures but reporting only one) can dramatically increase the likelihood of publishing a false-positive finding. Independently and nearly simultaneously, John et al. (2012) documented that a large fraction of psychological researchers admitted engaging in precisely the forms of p-hacking that we had considered. Identifying these realities—that researchers engage in p-hacking and that p-hacking makes it trivially easy to accumulate significant evidence for a false hypothesisopened psychologists’ eyes to the fact that many published findings, and even whole literatures, could be false positive.” (Nelson, Simmons, & Simonsohn, 2018, 204 citations).

“As Simmons et al.(2011) concluded—reflecting broadly on the state of the discipline—“it is unacceptably easy to publish ‘statistically significant’ evidence consistent with any hypothesis”(p.1359)” (Earp & Trafimov, 2015, p. 4, 200 citations)

“The second, related set of events was the publication of articles by a series of authors (Ioannidis 2005, Kerr 1998, Simmons et al. 2011, Vul et al. 2009) criticizing questionable research practices (QRPs) that result in grossly inflated false positive error rates in the psychological literature” (Shrout & Rodgers, 2018, p. 489, 195 citations).

“Let us add a new dimension, which was brought up in a seminal publication of Simmons, Nelson & Simonsohn (2011). They stated that researchers actually have so much flexibility in deciding how to analyse their data that this flexibility allows them to coax statistically significant results from nearly any data set” (Forstmeier, Wagenmakers, & Parker, 2017, p. 1945, 173 citations)

“Publication bias (Ioannidis, 2005) and flexibility during data analyses (Simmons, Nelson, & Simonsohn, 2011) create a situation in which false positives are easy to publish, whereas contradictory null findings do not reach scientific journals (but see Nosek & Lakens, in press)” (Lakens & Evers, 2014, p. 278, 139 citations)

“Recent reports hold that allegedly common research practices allow psychologists to support just about any conclusion (Ioannidis, 2005; Simmons, Nelson, & Simonsohn, 2011).” (Koole & Lakens, 2012, p. 608, 139 citations)

“Researchers then may be tempted to write up and concoct papers around the significant results and send them to journals for publication. This outcome selection seems to be widespread practice in psychology [12], which implies a lot of false positive results in the literature and a massive overestimation of ES, especially in meta-analyses” (

“Researcher df, or researchers’ behavior directed at obtaining statistically significant results (Simonsohn, Nelson, & Simmons, 2013), which is also known as p-hacking or questionable research practices in the context of null hypothesis significance testing (e.g., O’Boyle, Banks, & Gonzalez-Mulé, 2014), results in a higher frequency of studies with false positives (Simmons et al., 2011) and inflates genuine effects (Bakker et al., 2012).” (van Assen, van Aert, & Wicherts, p. 294, 133 citations)

“The scientific community has witnessed growing concern about the high rate of false positives and unreliable results within the psychological literature, but the harmful impact
of false negatives has been largely ignored” (Vadillo, Konstantinidis, & Shanks, p. 87, 131 citations)

“Much of the debate has concerned habits (such as “phacking” and the filedrawer effect) which can boost the prevalence of false positives in the published literature (Ioannidis, Munafò, Fusar-Poli, Nosek, & David, 2014; Simmons, Nelson, & Simonsohn, 2011).” (Vadillo, Konstantinidis, & Shanks, p. 87, 131 citations)

“Simmons, Nelson, and Simonsohn (2011) showed that researchers without scruples can nearly always find a p < .05 in a data set if they set their minds to it.” (Crandall & Sherman, 2014, p. 96, 114 citations)

Science is self-correcting: JPSP-PPID is not

With over 7,000 citations at the end of 2021, Ryff and Keyes (1995) article is one of the most highly cited articles in the Journal of Personality and Social Psychology. A trend analysis shows that citations are still increasing with over 800 citations in the past two years.

Most of these citations are reference to the use of Ryff’s measure of psychological well-being that uncritically accept Ryff’s assertion that her PWB measure is a valid measure of psychological well-being. The abstract implies that the authors provided empirical support for Ryff’s theory of psychological well-being.

Contemporary psychologists contrast Ryff’s psychological well-being (PWB) with Diener’s (1984) subjective well-being (SWB). In an article with over 1,000 citations, Ryff and Keyes (2002) tried to examine how PWB and SWB are empirically related. This attempt resulted in a two-factor model that postulates that SWB and PWB are related, but distinct forms of well-being.

The general acceptance of this model shows that most psychologists lack proper training in the interpretation of structural equation models (Borsboom, 2006), although graphic representations of these models make SEM accessible to readers who are not familiar with matrix algebra. To interpret an SEM model, it is only necessary to know that boxes represent measured variables, ovals represent unmeasured constructs, directed straight arrows represent an assumption that one construct has a causal influence on another construct, and curved bidrectional arrows imply an unmeasured common cause.

Starting from the top, we see that the model implies that an unmeasured common cause produces a strong correlation between two unmeasured variables that are labelled Psychological Well-Being and Subjective Well-Being. These labels imply that the constructs PWB and SWB are represented by unmeasured variables. The direct causal arrows from these unmeasured variables to the measured variables imply that PWB and SWB can be measured because the measured variables reflect the unmeasured variables to some extent. This is called a reflective measurement model (Borsboom et al., 2003). For example, autonomy is a measure of PWB because .38^2 = 14% of the variance in autonomy scores reflect PWB. Of course, this makes autonomy a poor indicator of PWB because the remaining 86% of the variance do not reflect the influence of PWB. This variance in autonomy is caused by other unmeasured influences and is called unique variance, residual variance, or disturbance. It is often omitted from SEM figures because it is assumed that this variance is simply irrelevant measurement error. I added it here because Ryff and users of her measure clearly do not think that 86% of the variance in the autonomy scale is just measurement error. In fact, the scale scores of autonomy are often used as if they are a 100% valid measure of autonomy. The proper interpretation of the model is therefore that autonomy is measured with high validity, but that variation in autonomy is only a poor indicator of psychological well-being.

Examination of the factor loadings (i.e., the numbers next to the arrows from PWB to the six indicators) shows that personal relationships has the highest validity as a measure of PWB, but even for personal relationships, the amount of PWB variance is only .66^2 = 44%.

In a manuscript (doc) that was desk-rejected by JPSP, we challenged this widely accepted model of PWB. We argued that the reflective model does not fit Ryff’s own theory of PWB. In a nutshell, Ryff’s theory of PWB is one of many list-theories of well-being (Sumner, 1996). The theory lists a number of attributes that are assumed to be necessary and sufficient for high well-being.

This theory of well-being implies a different measurement model in which arrows point from the measured variables to the construct of PWB. In psychometrics, these models are called formative measurement models. There is nothing unobserved about formative constructs. They are merely a combination of the measured constructs. The simplest way to integrate information about the components of PWB is to average them. If assumptions about importance are added, the construct could be a weighted average. This model is shown in Figure 2.

The key problem for this model is that it makes no predictions about the pattern of correlations among the measured variables. For example, Ryff’s theory does not postulate whether an increase in autonomy produces an increase in personal growth or a decrease in personal relations. At best, the distinction between PWB and SWB might imply that changes in PWB components are independent of changes in SWB components, but this assumption is highly questionable. For example, some studies suggest that positive relationships improve subjective well-being (Schimmack & Lucas, 2010).

To conclude, JPSP has published two highly cited articles that fitted a reflective measurement model to PWB indicators. In the desk-rejected manuscript, Jason Payne and I presented a new model that is grounded in theories of well-being and that treats PWB dimensions like autonomy and positive relations as possible components of a good life. Our model also clarified the confusion about Diener’s (1984) model of subjective well-being.

Ryff et al.’s (2002) two-factor model of well-being was influenced by Ryan and Deci’s (2001) distinction between two broad traditions in well-being research. “one dealing with happiness (hedonic well-being), and one dealing with human potential (eudaimonic well-being; Ryan &
Deci, 2001; see also Waterman, 1993)” (Ryff et al., 2002, p. 1007). We argued that this dichotomy overlooks another important distinction between well-being theories, namely the distinction between subjective and objective theories of well-being (Sumner, 1996). The key difference between objective and subjective theories of well-being is that objective theories aim to specify universal aspects of a good life that are based on philosophical analyses of the good life. In contrast, subjective theories reject the notion that universal criteria of a good life exist and leave it to individuals to create their own evaluation standards of a good life (Cantril., 1965). Unfortunately, Diener’s tripartite model of SWB is difficult to classify because it combines objective and subjective indicators. Whereas life-evaluations like life-satisfaction judgments are clearly subjective indicators, the amount of positive affect and negative affect implies a hedonistic conception of well-being. Diener never resolved this contradiction (Busseri & Sadava, 2011), but his writing made it clear that Diener stressed subjectivity as an essential component of well-being.

It is therefore incorrect to characterize Diener’s concept of SWB as a hedonic or hedonistic conception of well-being. The key contribution of Diener was to introduce psychologists to subjective conceptions of well-being and to publish the most widely used subjective measure of well-being, namely the Satisfaction with Life Scale. In my opinion, the inclusion of PA and NA in the tripartite model was a mistake because it does not allow individuals to choose what they want to do with their lives. Even Diener himself published articles that suggested positive affect and negative affect are not essential for all people (Suh, Diener, Oishi, & Triandis, 1998). At the very least, it remains an empirical question how important positive affect and negative affect are for subjective life evaluations and whether other aspects of a good life are even more important. At least, this question can be empirically tested by examining how much eudaimonic and hedonic measures of well-being contribute to variation in subjective measures of well-being. This question leads to a model in which life-satisfaction judgments are a criterion variable and the other variables are predictor variables.

The most surprising finding was that environmental mastery was a strong unique predictor and a much stronger predictor than positive affect or negative affect (direct effect, b = .66).

In our model, we also allowed for the possibility that PWB attributes influence subjective well-being by increasing positive affect or decreasing negative affect. The total effect is a very strong relationship, b = .78, with more than 50% of the variance in life-satisfaction being explained by a single PWB dimension, namely environmental mastery.

Other noteworthy findings were that none of the other PWB attribute made a positive (direct or indirect) contribution to life-satisfaction judgments. Autonomy even was a negative predictor. The effects of positive affect and negative affect were statistically significant, but small. This suggests that PA and NA are meaningful indicators of subjective well-being because the reflect a good life, but provide no evidence for hedonic theories of well-being that suggest positive affect increases well-being no matter how it is elicited.

These results are dramatically different from the published model in JPSP. In that model an unmeasured construct, SWB, causes variation in Environmental Mastery. In our model, environmental mastery is a strong cause of the only subjective indicator of well-being, namely life-satisfaction judgments. Whereas the published model implies that feeling good makes people have environmental mastery, our model suggests that having control over one’s life increases well-being. Call us crazy, but we think the latter model makes more sense.

So, why was our ms. desk rejected without peer-review from experts in well-being research? I post the full decision letter below, but I want to highlight the only comment about our actual work.

A related concern has to do with a noticeable gap between your research question, theoretical framework, and research design. The introduction paints your question in broad strokes only, but my understanding is that you attempt to refine our understanding of the structure of well-being, which could be an important contribution to the literature. However, the introduction does not provide a clear rationale for the alternative model presented. Perhaps even more important, the cross-sectional correlational study of one U.S. sample is not suited to provide strong conclusions about the structure of well-being. At the very least, I would have expected to see model comparison tests to compare the fit of the presented model with those of alternative models. In addition, I would have liked to see a replication in an independent sample as well as more critical tests of the discriminant validity and links between these factors, perhaps in longitudinal data, through the prediction of critical outcomes, or by using behavioral genetic data to establish the genetic and environmental architecture of these factors. Put another way, independent of the validity of the Ryff / Keyes model, the presented theory and data did not convince me that your model is a better presentation of the structure of well-being.

Bleidorn’s comments show that even prominent personality researchers lack basic understanding of psychometrics and construct validation. For example, it is not clear how longitudinal data can provide answers to questions about construct validity. Examining change is of course useful, but without a valid measure of a construct it is not clear what change in scale scores means. Construct validation precedes studies of stability and change. Similarly, it is only relevant to examine nature and nurture questions with a clear phenotype. Bleidorn completely ignores our distinction between hedonic and subjective well-being and the fact that we are the first to examine the relationship between PWB attributes and life-satisfaction.

As psychometricians have pointed out, personality psychologists often ignore measurement questions and are often content with averaged self-report ratings as operationalized constructs that do not require further validation. We think that this blind empiricism is preventing personality psychology from making real progress. It is depressing to see that even the new generation of personality psychologists shows no interest in improving construct validity of foundational constructs. Fortunately, JPSP-PPID publishes only about 50 articles a year and there are other outlets to publish our work. Unfortunately, JPSP has a reputation to publish only the best work, but this is prestige is not warranted by the actual quality of published articles. For example, the obsession with longitudinal data is not warranted given evidence that about 80% of the variance in personality measures is stable trait variance that does not change. Repeatedly measuring this trait variance does not add to our understanding of stable traits.

Conclusion

To conclude, JPSP has published two cross-sectional articles of the structure of well-being that continue to be highly cited. We find major problems with the models in these articles, but JPSP is not interested in publishing a criticism of these articles. To reiterate, the main problem is that Diener’s SWB model is treated as if it is an objective hedonic theory of well-being, when the core aspect of the model is that well-being is subjective and not objective. We thought at least the main editor Rich Lucas, a former Diener student, would understand this point, but expectations are the mother of disappointment. Of course, we could be wrong about some minor or major issues, but the lack of interest in these foundational questions shows just how far psychology is from being a real science. A real science develops valid measures before it examines real questions. Psychologists invent measures and study their measures without evidence that their measures reflect important constructs like well-being. Not surprisingly, psychology has produced no consensual theory of well-being that could help people live better lives. This does not stop psychologists from making proclamations about ways to lead a happy or good life. The problem is that these recommendations are all contingent on researchers’ preferred definition of well-being and the measures associated with that tradition/camp/belief system. In this way, psychology is more like (other) religions and less like a science.

Decision Letter

I am writing about your manuscript “Two Concepts of Wellbeing: The Relation Between Psychological and Subjective Wellbeing”, submitted for publication in the Journal of Personality and Social Psychology (JPSP). I have read the manuscript carefully myself, as has the lead Editor at JPSP, Rich Lucas. We read the manuscript independently and then consulted with each other about whether the manuscript meets the threshold for full review. Based on our joint consultation, I have made the decision to reject your paper without sending it for external review. The Editor and I shared a number of concerns about the manuscript that make it unlikely to be accepted for publication and that reduce its potential contribution to the literature. I will elaborate on these concerns below. Due to the high volume of submissions and limited pages available to JPSP, we must limit our acceptances to manuscripts for which there is a general consensus that the contribution is of an important and highly significant level. 
 

  1. Most importantly, papers that rely solely on cross-sectional designs and self-report questionnaire techniques are less and less likely to be accepted here as the number of submissions increases. In fact, such papers are almost always rejected without review at this journal. Although such studies provide an important first step in the understanding of a construct or phenomenon, they have some important limitations. Therefore, we have somewhat higher expectations regarding the size and the novelty of the contribution that such studies can make. To pass threshold at JPSP, I think you would need to expand this work in some way, either by using longitudinal data or or by going further in your investigation of the processes underlying these associations. I want to be clear; I agree that studies like this have value (and I also conduct studies using these methods myself), it is just that many submissions now go beyond these approaches in some way, and because competition for space here is so high, those submissions are prioritized.
  2. A related concern has to do with a noticeable gap between your research question, theoretical framework, and research design. The introduction paints your question in broad strokes only, but my understanding is that you attempt to refine our understanding of the structure of well-being, which could be an important contribution to the literature. However, the introduction does not provide a clear rationale for the alternative model presented. Perhaps even more important, the cross-sectional correlational study of one U.S. sample is not suited to provide strong conclusions about the structure of well-being. At the very least, I would have expected to see model comparison tests to compare the fit of the presented model with those of alternative models. In addition, I would have liked to see a replication in an independent sample as well as more critical tests of the discriminant validity and links between these factors, perhaps in longitudinal data, through the prediction of critical outcomes, or by using behavioral genetic data to establish the genetic and environmental architecture of these factors. Put another way, independent of the validity of the Ryff / Keyes model, the presented theory and data did not convince me that your model is a better presentation of the structure of well-being.
  3. The use of a selected set of items rather than the full questionnaires raises concerns about over-fitting and complicate comparisons with other studies in this area. I recommend using complete questionnaires and – should you decide to collect more data – additional measures of well-being to capture the universe of well-being content as best as you can. 
  4. I noticed that you tend to use causal language in the description of correlations, e.g. between personality traits and well-being measures. As you certainly know, the data presented here do not permit conclusions about the temporal or causal influence of e.g., neuroticism on negative affect or vice versa and I recommend changing this language to better reflect the correlational nature of your data.     

In closing, I am sorry that I cannot be more positive about the current submission. I hope my comments prove helpful to you in your future research efforts. I wish you the very best of luck in your continuing scholarly endeavors and hope that you will continue to consider JPSP as an outlet for your work.


Sincerely,
Wiebke Bleidorn, PhD
Associate Editor
Journal of Personality and Social Psychology: Personality Processes and Individual Differences

Estimating the False Positive Risk in Psychological Science

Abstract: At most one-quarter of published significant results in psychology journals are false positive results. This is surprising news after a decade of false positive paranoia. However, the low positive rate is not a cause for celebration. It mainly reflects the low priori probability that the nil-hypothesis is true (Cohen, 1994). To produce meaningful results, psychologists need to maintain low false positive risks when they test stronger hypotheses that specify a minimum effect size.

Introduction

Like many other sciences, psychological science relies on null-hypothesis significance testing as the main statistical approach to draw inferences from data. This approach can be dated back to Fisher’s first manual for empirical researchers how to conduct statistical analyses. If the observed test-statistic produces a p-value below .05, the null-hypothesis can be rejected in favor of the alternative hypothesis that the population effect size is not zero. Many criticism of this statistical approach have failed to change research practices.

Cohen (1994) wrote a sarcastic article about NHST with the title “The Earth is round, p < .05.” In this article, Cohen made the bold claim “my work on power analysis has led me to realize that the nil-hypothesis is always false.” In other words, population effect sizes are unlikely to be exactly zero. Thus, rejecting the nil-hypothesis with a p-value below .05 only tells us something we already know. Moreover, when sample sizes are small, we often end up with p-values greater than .05 that do not allow us to reject a false null-hypothesis. I cite this article only to point out that in the 1990s, meta-psychologists were concerned with low statistical power because it produces many false negative results. In contrast, significant results were considered to be true positive findings. Although often meaningless (e.g., the amount of explained variance is greater than zero), they were not wrong.

Since then, psychology has encountered a racial shift in concerns about false positive results (i.e., significant p-values when the nil-hypothesis is true). I conducted an informal survey on social media. Only 23.7% of twitter respondents echoed Cohen’s view that false positive results are rare (less than 25%). The majority (52.6%) of respondents assumed that more than half of all published significant results are false positives.

The results were a bit different for the poll in the Psychological Methods Discussion Group on Facebook. Here the majority opted for 25 to 50 percent false positive results.

The shift from the 1990s to the 2020s can be explained by the replication crisis in social psychology that has attracted a lot of attention and has been generalized to all areas of psychology (Open Science Collaboration, 2015). Arguably, the most influential article that contributed to concerns about false positive results in psychology is Simmons, Nelsons, and Simonsohn’s (2011) article titled “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant” that has been cited 3,203. The key contribution of this article was to show that the use of questionable research practices that psychologists use to obtain p-values below .05 (e.g., using multiple dependent variables) can increase the risk of a false positive result from 5% to over 60%. Moreover, anonymous surveys suggested that researchers often engage in these practices (John et al., 2012). However, even massive use of QRPs will not produce a massive amount of false positive results, if most null-hypotheses are true. In this case, QRPs will inflate the effect size estimates (that nobody pays attention to, anyways), but the rate of false positive results will remain low if most tested hypotheses are true.

Some scientists have argued that scientists are much more likely to make false assumptions (e.g., the Earth is flat) than Cohen envisioned. Ioannidis (2005) famously declared that Most published results are false. He based this claim on hypothetical scenarios that produce more than 50% false positive results when 90% of studies test a true null-hypothesis. This assumption is a near complete reversal of Cohen’s assumption that we can nearly always assume that the effect size is not zero. The problem is that the actual rate of true and false hypotheses is unknown. Thus, estimates of false positive rates are essentially projective tests of gullibility and cynicism.

To provide psychologists with scientific information about the false positive risk in their science, we need a scientific method that can estimate the false discovery risk based on actual data rather than hypothetical scenarios. There have been several attempts to do so. So far, the most prominent study was Leek and Jager’s (2014) estimate of the false discovery rate in medicine. They obtained an estimate of 14%. Simulation studies showed some problems with their estimation model, but the superior z-curve method replicated the original result with a false discovery risk of 13%. This result is much more in line with Cohen’s view that most null-hypotheses are false (typically effect sizes are not zero) than with Ioannidis’s claim that the null-hypothesis is true in 90% of all significance tests.

In psychology, the focus has been on replication rates. The shocking finding was that only 25% of significant results in social psychology could be replicated in an honest and unbiased attempt to reproduce the original study (Open Science Collaboration, 2015). This low replication rate leaves ample room for false positive results, but it is unclear how many of the non-significant results were caused by a true null-hypothesis and how many were caused by low statistical power to detect an effect size greater than zero. Thus, this project provides no information about the false positive risk in psychological science.

Another noteworthy project used a representative sample of test results in social psychology journals (Motyl et al., 2017). This project produced over 1,000 p-values that were examined using a number of statistical tools available at that time. The key result was that there was clear evidence of publication bias. That is, focal hypothesis tests nearly always rejected the null-hypothesis, a finding that has been observed since the beginning of social psychology (Sterling, 1959). However, the actual power of studies to do so was much lower; a finding that is consistent with Cohen’s (1961) seminal analysis of power. However, the results provided no information about the false positive risk. Yet, this valuable dataset could be analyzed with statistical tools that estimate the false discovery risk (Schimmack, 2021). However, the number of significant p-values was too small to produce an informative estimate of the false discovery risk (k = 678; 95CI = .09 to .82).

Results

A decade after the “False Positive Psychology” article rocked psychological science, it remains unclear how much false positive results contribute to replication failures in psychology. To answer this question, we report the results of a z-curve analysis of 1,857 significant p-values that were obtained from hand-coding a representative sample of studies that were published between 2009 and 2014. The years 2013 and 2014 were included to incorporate Motyl et al.’s data. All other coding efforts focussed on the years 2009 and 2010, before concerns about replication failures could have changed research practices. In marked contrast to previous initiatives, the aim was to cover all areas of psychology. To obtain a broad range of disciplines in psychology, a list of 120 journals was compiled (Schimmack, 2021). These journals are the top journals of their disciplines with high impact factors. Students had some freedom in picking journals of their choice. For each journal, articles were selected based on a fixed sampling scheme to code articles 1, 3, 6, and 10 for every set of 10 articles (1,3,6,10,11,13…). The project is ongoing and the results reported below should be considered preliminary. Yet, they do present the first estimate of the false discovery risk in psychological science.

The results replicate many other findings that focal statistical tests are selected because they reject the null-hypothesis. Eighty-one percent of all tests had a p-value below .05. When marginally significant results are included as well, the observed discovery rate increases to 90%. However, the statistical power of studies does not warrant such high success rates. The z-curve estimate of mean power before selection for significance is only 31%; 95%CI = 19% to 37%. This statistic is called the expected discovery rate (EDR) because mean power is equivalent to the long-run percentage of significant results. Based on an insight by Soric (1989), we can use the EDR to quantify the maximum percentage of results that can be false positives, using the formula: FDR = (1/EDR – 1)*(alpha/(1-alpha)). The point estimate of the EDR of 31% corresponds to a point estimate of the False Discovery Risk of 12%. The 95%CI ranges from 8% to 28%. It is important to distinguish between the risk and rate of false positives. Soric’s method assumes that true hypotheses are tested with 100% power. This is an unrealistic assumption. When power is lower the false positive rate will be lower than the false positive risk. Thus, we can conclude from these results that it is unlikely that more than 25% of published significant results in psychology journals are false positive results.

One concerns about these results is that the number of test statistics differed across journals and that Motyl et al.’s large set of results from social psychology could have biased the results. We therefore also analyzed the data by journal and then computed the mean FDR and its 95%CI. This approach produced an even lower FDR estimate of 11%, 95%CI = 9% to 15%.

While a FDR of less than 25% may seem good news in a field that is suffering from false positive paranoia, it is still unacceptably high to ensure that published results can be trusted. Fortunately, there is a simple solution to this problem because Soric’s formula shows that the false discovery risk depends on alpha. Lowering alpha to .01 is sufficient to produce a false discovery risk below 5%. Although this seems like a small adjustment, it results in the loss of 37% significant results with p-values between .05 and .01. This recommendation is consistent with two papers that have argued against the blind use of Fisher’s alpha level of .05 (Benjamin et al., 2017; Lakens et al., 2018). The cost of lowering alpha to .005 would be to loss another 10% of significant findings (ODR = 47%).

Limitations and Future Directions

No study is perfect. As many women know, the first time is rarely the best time (Higgins et al., 2010). Similarly, this study has some limitations that need to be addressed in future studies.

The main limitation of this study is that the coded statistical tests may not be representative of psychological science. However, the random sampling from journals and the selection of a broad range of journals suggests that sampling bias has a relatively small effect on the results. A more serious problem is that there is likely to be heterogeneity across disciplines or even journals within disciplines. Larger samples are needed to test those moderator effects.

Another problem is that z-curve estimates of the EDR and FDR make assumptions about the selection process that may differ from the actual selection process. The best way to address this problem is to promote open science practices that reduce the selective publishing of statistically significant results.

Eventually, it will be necessary to conduct empirical tests with a representative sample of results published in psychology akin to the reproducibility project (Open Science Collaboration, 2015). At a first step, studies can be replicated with the original sample sizes. Results that are successfully replicated do not require further investigation. Replication failures need to be followed up with studies that can provide evidence for the null-hypothesis using equivalence testing with a minimum effect size that would be relevant (Lakens, Scheel, and Isager, 2018). This is the only way to estimate the false positive risk by means of replication studies.

Implications: What Would Cohen Say

The finding that most published results are not false may sound like good news for psychology. However, Cohen would merely point out that that a low rate of false positive results merely reflect the fact that the nil-hypothesis is rarely true. If some hypotheses were true and others were false, NHST (without QRPs) could be used to distinguish between them. However, if most effect sizes are greater than zero, not much is learned from statistical significance. The problem is not p-values or dichotomous think. The problem is that nobody is testing risky hypothesis that an effect size is of a minimum size, and decides in favor of the null-hypothesis when the data show the population effect size is not exactly zero, but practically meaningless (e.g., experimental ego-depletion effects are less than 1/10th of a standard deviation). Even specifying H0 as r < .05 or d < .01 would lower the discovery rates and increase the false discovery risk, while increasing the value of a statistically significance.

Cohen’s clear distinction between the null-hypothesis and the nil-hypothesis made it clear that nil-hypothesis testing is a ritual with little scientific value, while null-hypothesis testing is needed to advance psychological science. The past decade has been a distraction by suggesting that nil-hypothesis testing is meaningful, but only if open science practices are used to prevent false positive results. However, open science practices do not change the fundamental problem of nil-hypothesis testing that Cohen and others identified more than two decades ago. It is often said that science is self-correcting, but psychologists have not corrected the way they formulate their hypotheses. If psychology wants to be a science, they need to specify hypotheses that are worthy of empirical falsification. I am getting to old and cynical (much like my hero Cohen in the 1990s) to believe in change in my life-time, but I can write this message in a bottle and hope one day a new generation may find it and do something with it.

Open Science: Inside Peer-Review at PSPB

We submitted a ms. that showed problems with the validity of the race IAT as a measure of African Americans’ unconscious attitudes to PSPB (Schimmack & Howard, 2020). After waiting patiently for three months, we received the following decision letter from the acting editor Dr. Corinne Moss-Racusin at Personality and Social Psychology Bulletin. She assures us that she independently read our manuscript carefully – twice; once before and once after reading the reviews. This is admirable. Yet it is surprising that her independent reading of our manuscript places her in strong agreement with the reviewers. Somebody with less research experience might feel devastated by the independent evaluation by three experts that our work is “of low quality.” Fortunately, it is possible to evaluate the contribution of our manuscript from another independent perspective, namely the strength of the science.

The key claim of our ms. is simple. John Jost, Brian Nosek, and Mahzarin Banaji wrote a highly cited article that contained the claim that a large percentage of members of disadvantaged groups have an implicit preference for the out-group. As recently as 2019, Jost repeated this claim and used the term self-hatred to refer to implicit preferences for the in-group (Jost, 2019).

We expressed our doubt about this claim when the disadvantaged group are African Americans. Our main concern was that any claims about African Americans’ implicit preferences require a valid measure of African Americans’ preferences. The claim that a large number of African Americans have an implicit preference for the White outgroup rests entirely on results obtained with the Implicit Association Test (Jost, Nosek, & Banaji, 2004). However, since the 2004 publication, the validity of the race IAT as a measure of implicit preferences has been questioned in numerous publications, including my recent demonstration that implicit and explicit measures of prejudice lack discriminant validity (Schimmack, 2021). Even the author of the IAT is no longer supporting the claim that the race IAT is a measure of some implicit, hidden attitudes (Greenwald & Banaji, 2017). Aside from revisiting Jost et al.’s (2004) findings in light of doubts about the race IAT, we also conducted the first attempt at validating the race IAT for Black participants. Apparently, reading the article twice did not help the action editor of PSPB to notice this new empirical contribution, even though it is highlighted in Figure 2. The key finding here is that we were able to identify an in-group preference factor because several explicit and implicit measures showed convergent validity (factor ig). For example, the evaluative priming task showed some validity with a factor loading of .42 in the Black sample. However, the race IAT failed to show any relationship with the in-group factor (p > .05). It was also unrelated to the out-group factor. Thus, the race IAT lacks convergent validity as a measure of in-group and out-group preferences among African Americans in this sample. Neither the two reviewers, nor the acting editor challenge this finding. They do not even comment on it. Instead, they proclaim that this research is of low quality. I beg to differ. Based on any sensible understanding of the scientific method, it is unscientific to make claims about African Americans’ preferences based on a measure that has not been validated. It is even more unscientific to double down on a false claim when evidence is presented that the measure lacks validity.

Of course, one can question whether PSPB should publish this finding. After all, PSPB prides itself on being the flagship journal of the Society for Personality and Social Psychology (Robinson et al., 2021). Maybe valid measurement of African Americans’ attitudes is not relevant enough to meet the high standards of a 20% acceptance rate. However, Robinson et al. (2021) launched a diversity initiative in response to awareness that psychology has a diversity problem.

Maybe it will take some time before PSPB can find some associate editors to handle manuscripts that address diversity issues and are concerned with the well-being of African Americans. Meanwhile, we are going to find another outlet to publish our critique of Jost and colleagues unscientific claim that many African Americans hold negative views of their in-group that they are not aware of and can only be revealed by their scores on the race IAT.

Editorial Decision Letter from PSPB

Re: “The race Implicit Association Test is Biased: Most African Americans Have Positive Attitudes towards their In-Group” (MS # PSPB-21-365)

Dear Dr. Schimmack:

Thank you for submitting your manuscript for consideration to Personality and Social Psychology Bulletin. I would like to apologize for the slight delay in getting this decision out to you. Both of my very young children have been home with me for the past month, due to Covid exposures at their schools. As their primary caregiver, this has created considerable difficulties. I appreciate your understanding as we all work to navigate these difficult and unprecedented times.

I have now obtained evaluations of the paper from two experts who are well-qualified to review work in this area.  Furthermore, I read your paper carefully and independently, both before and after looking at the reviews.

I found the topic of your work to be important and timely—indeed, I read the current paper with great interest. Disentangling in-group and out-group racial biases, among both White and Black participants (within the broader context of exploring System Justification Theory) is a compelling goal. Further, I strongly agree with you that exploring whether Black participants’ in-group attitudes have been systematically misrepresented by the (majority White) scientific community is of critical importance.

Unfortunately, as you will see, both reviewers have significant, well-articulated concerns that prevent them from supporting publication of the manuscript. For example, reviewer 1 stated that “Overall, I found this article to be of low quality. It argues against an argument that researchers haven’t made and landed on conclusions that their data doesn’t support.” Further, reviewer 2 (whose review is appropriately signed) wrote clearly that, “The purpose of this submission, it seems to me, is not to illuminate anything, really, and indeed very little, if anything, is illuminated. The purpose of the paper, it seems, is to create the appearance of something scandalous and awful and perhaps even racist in the research literature when, in fact, the substantive results obtained here are very similar to what has been found before. And if the authors really want to declare that the race-based IAT is a completely useless measure, they have a lot more work to do than re-analyzing previously published data from one relatively small study.”

See Reviewer 2’s comments and my response here

My own reading of your paper places me in strong agreement with the reviewer’s evaluations. I am sorry to report that I will not be able to accept your paper for publication in PSPB.

The reviewers’ comments are, in my several years of experience as an editor, unusually thorough and detailed. Thus, I will not reiterate them here.  Nevertheless, issues of primary concern involved both conceptual and empirical aspects of the manuscript. Although some of these issues might be addressed, to some degree, with some considerable re-thinking and re-writing, many cannot be addressed without more data and theoretical overhaul.

I was struck by the degree to which claims appear to stray quite far from both the published literature and the data at hand. As just one example, the section on “African American’s Resilience in a Culture of Oppression” (pp. 5-6) cites no published work whatsoever. Rather, you note that your skepticism regarding key components of SJT is based on “the lived experience of the second author,” which you then summarize. While individual case studies such as this can certainly be compelling, there are clear questions pertaining to generalizability and scientific merit, and the inability to independently validate or confirm this anecdotal evidence. While you do briefly acknowledge this, you proceed to make broad claims—such as “No one in her family or among her Black friends showed signs that they preferred to be White or like White people more than Black people. In small towns, the lives of Black and White people are more similar than in big cities. Therefore, the White out-group was not all that different from the Black in-group,” again without citing any evidence. I found it problematic to ground these bold claims and critiques largely in anecdote. Further, this raises serious concerns—as reviewer 2 articulates in some detail—that the current work may distort the current state of the science by exaggerating or mischaracterizing the nature of existing claims.

Let me say this clearly: I am strongly in favor of work that attempts to refine existing theoretical perspectives, and/or critique established methods, measures, and paradigms. I am not an IAT “purist” by any stretch, nor has my own recent work consistently included implicit measures. Indeed, as noted above, I read the current work with great interest and openness. Unfortunately, like both reviewers, I cannot support its publication in the current form.

I would sincerely encourage you to consider whether the future of this line of work could involve 1. Additional experiments, 2. Larger and more diverse samples, 3. True and transparent collaboration (whether “adversarial” or not) with colleagues from different ideological/empirical perspectives, and 4. Ensuring that claims align much more closely to what is narrowly warranted by the data at hand. Unfortunately, as it stands, the potential contributions of this work appear to be far overshadowed by its problematic elements.

I understand that you will likely be disappointed by my decision, but I urge you to pay careful attention to the reviewers’ constructive comments, as they may help you revise this manuscript or design further research.  Please understand that my decision was rendered with the recognition that the page limitations of the journal dictate that only a small percentage of submitted manuscripts can be accepted.  PSPB receives more than 700 submissions per year, but only publishes approximately 125 papers each year.  Papers without major flaws are often not accepted by PSPB because the magnitude of the contribution is not sufficient to warrant publication.  With careful revision, I think this paper might be appropriate for a more specialized journal, and I wish you success in finding an appropriate outlet for your work.

I am sorry that I cannot provide a more favorable response to your submission.  However, I do hope that you will again consider PSPB as your research progresses.

P.S. I asked the acting editor to clarify her comments and her views about the validity of the race IAT as a measure of African Americans’ unconscious preferences. They declined to comment further.

Inside Anonymous Peer Review

After a desk-rejection for JPSP, my co-author and I submitted our ms. to PSPB (see blog https://replicationindex.com/2021/07/28/the-race-implicit-association-test-is-biased/). After several months, we received the expected rejection. But it was not all in vane. We received a detailed review that shows how little social psychologists really care about African Americans even when they claim to study racism and discrimination.

As peer-reviews are considered copyrighted material belonging to the reviewer, I cannot share the review in full. Rather I will highlight important sections that show how little authors with the authority of an expert reviewer pay attention to inconvenient scientific criticism of their work.

Here is the key issue. Our paper provides new evidence that the race IAT is an invalid measure of African Americans’ attitudes towards their own group and the White out-group. This new evidence is based on a reanalysis of the data that were used by Bar-Anan and Nosek (2014) to claim that the race IAT is the most valid measure to study African Americans’ implicit attitudes. Here is wat the reviewer had to say about this.

(6) It has been a while since I read the Bar-Anan and Nosek (2014) article, but my memory for it is incompatible with the claim that those authors were foolish enough to simply assume that the most valid implicit measures was the one that produced the biggest difference between Whites and Blacks in terms of in-group bias, as the present authors claim (pp. 7-8).

Would you kill Dumbledore if he asked you to?

So, the reviewer relies on his foggy memory to question our claim instead of retrieving a pdf file and checking for himself. New York University should be proud of this display of scholarship. I hope Jost made sure to get his Publons credit. Here is the relevant section from Bar-Anan and Nosek (2014 p. 675; https://link.springer.com/article/10.3758/s13428-013-0410-6).

A lazy recollection is used to dismiss the results of a new statistical analysis. This is how closed, confidential, back-room, peer-review works, which means it does not work. It does not serve the purpose to present all scientific arguments in the open and let data decide between opposing views. Pre-publication peer-review is not a reliable and credible mechanism to advance science. For this reason, I will publish as much as possible in open-peer review journals (e..g, Meta-Psychology). Open science without open exchange of ideas and conflicts is not open, trustworthy, or credible.

Psychology Intelligence Agency

I always wanted to be James Bond, but being 55 now it is clear that I will never get a license to kill or work for a government intelligence agency. However, the world has changed and there are other ways to spy on dirty secrets of evil villains.

I have started to focus on the world of psychological science, which I know fairly well because I was a psychological scientist for many years. During my time as a psychologist, I learned about many of the dirty tricks that psychologists use to publish articles to further their careers without advancing understanding of human behavior, thoughts, and feelings.

However, so far the general public, government agencies, or government funding agencies that hand out taxpayers’ money to psychological scientists have not bothered to monitor the practices of psychological scientists. They still believe that psychological scientists can control themselves (e.g., peer review). As a result, bad practices persist because the incentives favor behaviors that lead to publication of many articles even if these articles make no real contribution to science. I therefore decided to create my own Psychological Intelligence Agency (PIA). Of course, I cannot give myself a license to kill, and I have no legal authority to enforce laws that do not exist. However, I can gather intelligence (information) and share this information with the general public. This is less James Bond and more CIA that also shares some of its intelligence with the public (CIA factbook), or the website Retraction Watch that keeps track of article retractions.

Some of the projects that I have started are:

Replicability Rankings of Psychology Journals
Keeping track of the power (expected discovery rate, expected replication rate) and the false discovery risk of test results published in over 100 psychology journals from 2010 to 2020.

Personalized Criteria of Statistical Significance
It is problematic to use the standard criterion of significance (alpha = .05) when this criterion leads to few discoveries because researchers test many false hypotheses or test true hypotheses with low power. When discovery rates are low, alpha should be set to a lower value (e.g., .01, .005, .001). Here I used estimates of authors’ discovery rate to recommend an appropriate alpha level to interpret their results.

Quantitative Book Reviews
Popular psychology books written by psychological scientists (e.g., Nobel Laureate Daniel Kahneman) reach a wide audience and are assumed to be based on solid scientific evidence. Using statistical examinations of the sources cited in these books, I provide information about the robustness of the scientific evidence to the general public. (see also “Before you know it“)

Citation Watch
Science is supposed to be self-correcting. However, psychological scientists often cite outdated references that fit their theory without citing newer evidence that their claims may be false (a practice known as cherry picking citations). Citation watch reveals these bad practice, by linking articles with misleading citations to articles that question the claims supported by cherry picked citations.

Whether all of this intelligence gathering will have a positive effect depends on how many people actually care about the scientific integrity of psychological science and the credibility of empirical claims. Fortunately, some psychologists are willing to learn from past mistakes and are improving their research practices (Bill von Hippel).

You Can Lead a Horse To Water, But... - Meaning, Origin