Dr. Ulrich Schimmack Blogs about Replicability

For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication” (Cohen, 1994).

DEFINITION OF REPLICABILITYIn empirical studies with sampling error, replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion (Schimmack, 2017). 

See Reference List at the end for peer-reviewed publications.

Mission Statement

The purpose of the R-Index blog is to increase the replicability of published results in psychological science and to alert consumers of psychological research about problems in published articles.

To evaluate the credibility or “incredibility” of published research, my colleagues and I developed several statistical tools such as the Incredibility Test (Schimmack, 2012); the Test of Insufficient Variance (Schimmack, 2014), and z-curve (Version 1.0; Brunner & Schimmack, 2020; Version 2.0, Bartos & Schimmack, 2021). 

I have used these tools to demonstrate that several claims in psychological articles are incredible (a.k.a., untrustworthy), starting with Bem’s (2011) outlandish claims of time-reversed causal pre-cognition (Schimmack, 2012). This article triggered a crisis of confidence in the credibility of psychology as a science. 

Over the past decade it has become clear that many other seemingly robust findings are also highly questionable. For example, I showed that many claims in Nobel Laureate Daniel Kahneman’s book “Thinking: Fast and Slow” are based on shaky foundations (Schimmack, 2020).  An entire book on unconscious priming effects, by John Bargh, also ignores replication failures and lacks credible evidence (Schimmack, 2017).  The hypothesis that willpower is fueled by blood glucose and easily depleted is also not supported by empirical evidence (Schimmack, 2016). In general, many claims in social psychology are questionable and require new evidence to be considered scientific (Schimmack, 2020).  

Each year I post new information about the replicability of research in 120 Psychology Journals (Schimmack, 2021).  I also started providing information about the replicability of individual researchers and provide guidelines how to evaluate their published findings (Schimmack, 2021). 

Replication is essential for an empirical science, but it is not sufficient. Psychology also has a validation crisis (Schimmack, 2021).  That is, measures are often used before it has been demonstrate how well they measure something. For example, psychologists have claimed that they can measure individuals’ unconscious evaluations, but there is no evidence that unconscious evaluations even exist (Schimmack, 2021a, 2021b). 

If you are interested in my story how I ended up becoming a meta-critic of psychological science, you can read it here (my journey). 

References

Brunner, J., & Schimmack, U. (2020). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta-Psychology, 4, MP.2018.874, 1-22
https://doi.org/10.15626/MP.2018.874

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17, 551–566
http://dx.doi.org/10.1037/a0029487

Schimmack, U. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychology/Psychologie canadienne, 61(4), 364–376. 
https://doi.org/10.1037/cap0000246

Estimating the False Discovery Risk of Psychology Science

Abstract

Since 2011, the credibility of psychological science is in doubt. A major concern is that questionable research practices could have produced many false positive results, and it has been suggested that most published results are false. Here we present an empirical estimate of the false discovery risk using a z-curve analysis of randomly selected p-values from a broad range of journals that span most disciplines in psychology. The results suggest that no more than a quarter of published results could be false positives. We also show that the false positive risk can be reduced to less than 5% by using alpha = .01 as the criterion for statistical significance. This remedy can restore confidence in the direction of published effects. However, published effect sizes cannot be trusted because the z-curve analysis shows clear evidence of selection for significance that inflates effect size estimates.

Introduction

Several events in the early 2010s led to a credibility crisis in psychology. As journals selectively publish only statistically significant results, statistical significance loses its, well, significance. Every published focal hypothesis will be statistically significant, and it is unclear which of these results are true positives and which are false positives.

A key article that contributed to the credibility crisis was Simmons, Nelson, & Simonsohn’s article “False Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant”

The title made a bold statement that it is easy to obtain statistically significant results even when the null-hypothesis is true. This led to concerns that many, if not most, published results are indeed false positive results. Many meta-psychological articles quoted Simmons et al.’s (2011) article to suggest that there is a high risk or even a high rate of false positive results in the psychological literature; including my own 2012 article.

“Researchers can use questionable research practices (e.g., snooping, not reporting failed studies, dropping dependent variables, etc.; Simmons et al., 2011; Strube, 2006) to dramatically increase the chances of obtaining a false-positive result” (Schimmack, 2012, p. 552, 248 citations)

The Appendix lists citations from influential meta-psychological articles that imply a high false positive risk in the psychological literature. Only one article suggested that fears about high false positive rates may be unwarranted (Strobe & Strack, 2014). In contrast, other articles have suggested that false positive rates might be as high as 50% or more (Szucs & Ioannidis, 2017).

There have been two noteworthy attempts at estimating the false discovery rate in psychology. Szucs and Ioannidis (2017) automatically extracted p-values from five psychology journals and estimated the average power of extracted t-tests. They then used this power estimate in combination with the assumption that psychologists discover one true, non-zero, effect for every 13 true null-hypotheses to suggest that the false discovery rate in psychology exceeds 50%. The problem with this estimate is that it relies on the questionable assumption that psychologists tests a very small percentage of true hypotheses.

The other article tried to estimate the false positive rate based on 70 of the 100 studies that were replicated in the Open Science Collaboration project (Open Science Collaboration, 2015). The statistical model estimated that psychologists test 93 true null-hypotheses for every 7 true effects (true positives), and that true effects are tested with 75% power (Johnson et al., 2017). This yields a false positive rate of about 50%. The main problem with this study is the reliance on a small, unrepresentative sample of studies that focused heavily on experimental social psychology, a field that triggered concerns about the credibility of psychology in general (Schimmack, 2020). Another problem is that point estimates based on a small sample are unreliable.

To provide new and better information about the false positive risk in psychology, we conducted a new investigation that addresses three limitations of the previous studies. First, we used hand-coding of focal hypothesis tests, rather than automatic extraction of all test-statistics. Second, we sampled from a broad range of journals that cover all areas of psychology rather than focusing narrowly on experimental psychology. Third, we used a validated method to estimate the false discovery risk based on an estimate of the expected discovery rate (Bartos & Schimmack, 2021). In short, the false discovery risk decreases as a monotonic function of the number of discoveries (i.e., p-values below .05) (Soric, 1989).

Z-curve relies on the observation that false positives and true positives produce different distributions of p-values. To fit a model to distributions of significant p-values, z-curve transforms p-values into absolute z-scores. We illustrate z-curve with two simulation studies. The first simulation is based on Simmons et al.’s (2011) scenario in which the combination of four questionable research practices inflates the false positive risk from 5% to 60%. In our simulation, we assumed an equal number of true null-hypotheses (effect size d = 0) and true hypotheses with small to moderate effect sizes (d = .2 to .5). The use of questionable research practices also increases the chances of getting a significant result for true hypotheses. In our simulation, the probability to get significance with true H0 was 58%, whereas the probability to get significance with true H1 was .93. Given the 1:1 ratio of H0 and H1 that were tested, this yields a false discovery rate of 39%.

Figure 1 shows that questionable research practices produce a steeply declining z-curve. Based on this shape, z-curve estimates a discovery rate of 5%, with a 95%CI ranging from 5% to 10%. This translates into estimates of the false discovery risk of 100% with a 95%CI ranging from 46% to 100% (Soric, 1989). The reason why z-curve provides a conservative estimate of the false discovery risk is that p-hacking changes the shape of the distribution in a way that produces even more z-values just above 1.96 than mere selection for significance would produce. In other words, p-hacking destroys evidential value when true hypotheses are being tested. It is not necessary to simulate scenarios in which even more true null-hypotheses are being tested because this would make the z-curve even steeper. Thus, Figure 1 provides a prediction for our z-curve analyses based on actual data, if psychologists heavily rely on Simmons et al.’s recipe to produce significant results.

Figure 2 is based on a simulation of Johnson et al.’s (2013) scenario with a 9% discovery rate (9 true hypotheses for very 100 hypothesis tests), a false discovery rate of 50%, and power to detect true effects of 75% (Figure 2). Johnson et al. did not assume or model p-hacking.

The z-curve for this scenario also shows a steep decline that can be attributed to the high percentage of false positive results. However, there is also a notable tail with z-values greater than 3 that reflects the influence of true hypotheses with adequate power. In this scenario, the expected discovery rate is higher with a 95%CI ranging from 7% to 20%. This translates into a 95%CI for the false discovery risk ranging from 21% to 71% (Soric, 1989). This interval contains the true value of 50%, although the point estimate, 34% underestimates the true value. Thus, we recommend to use the upper limit of the 95%CI as an estimate of the maximum false discovery rate that is consistent with data.

We now turn to real data. Figure 3 shows a z-curve analysis of Kühberger, Frity, and Scherndl (2014) data. The authors conducted an audit of psychological research by randomly sampling 1,000 English language articles published in the year 2007 that were listed in PsychInfo. This audit produced 344 significant p-values that could be subjected to a z-curve analysis. The results differ notably from the previous results. The expected discovery rate is higher and implies a much smaller false discovery risk of only 9%. However, due to the small set of studies, the confidence interval is wide and allows for nearly 50% false positive results.

To produce a larger set of test-statistics, my students and I have hand-coded over 1,000 randomly selected articles from a broad range of journals (Schimmack, 2021). These data were combined with Motyl et al.’s (2017) coding of social psychology journals. The time period spans the years 2008 to 2014, with a focus on the year 2010 and 2009. This dataset produced 1,715 significant p-values. The estimated false discovery risk is similar to the estimate for Kühberger et al.’s (2014) studies. Although the point estimate for the false discovery risk is a bit higher, 12%, the upper bound of the 95%CI is lower because the confidence interval is tighter.

Given the similarity of the results, we combined the two datasets to obtain an even more precise estimate of the false discovery risk based on 2,059 significant p-values. However, the upper limit of the 95%CI decreased only slightly from 30% to 26%.

The most important conclusion from these findings is that concerns about the amount of false positive results have exaggerated assumptions about the prevalence of false positive results in psychology journals. The present results suggest that at most a quarter of published results are false positives and that actual z-curves are very different from those implied by the influential simulation studies of Simmons et al. (2011). Our empirical results show no evidence that massive p-hacking is a common practice.

However, a false positive rate of 25% is still unacceptably high. Fortunately, there is an easy solution to this problem because the false discovery rate depends on the significance threshold. Based on their pessimistic estimates, Johnson et al. (2015) suggested to lower alpha to .005 or even .001. However, these stringent criteria would render most published results statistically non-significant. We suggest to lower alpha to .01. Figure 6 shows the rational for this recommendation by fitting z-curve with alpha = .01 (i.e., the red vertical line that represents the significance criterion is moved from 1.96 to 2.58.

Lowering alpha to .01, lowers the percentage of significant results from 83% (not counting marginally significant, p < .1, results) to 53%. Thus, the expected discovery decreases, but the more stringent criterion for significance lowers the false discovery risk to 4% and even the upper limit of the 95%CI is just 4%.

It is likely that discovery rates vary across journals and disciplines (Schimmack, 2021). In the future, it may be possible to make more specific recommendations for different disciplines or journals based on their discovery rates. Journals that publish riskier hypotheses tests or studies with modest power would need a more stringent significance criterion to maintain an acceptable false discovery risk.

An alpha level of .01 is also recommended by Simmons et al.’s (2011) simulation studies of p-hacking. Massive p-hacking that inflates the false positive risk from 5% to 61% produces only 22% false positives with alpha = .01. Milder forms of p-hacking inflates the false positive risk produces only a probability of 8% to obtain a p-value below .01. Ideally, open science practices like pre-registration will curb the use of questionable practices in the future. Increasing sample sizes will also help to lower the false positive risk. A z-curve analysis of new studies can be used to estimate the current false discovery risk and may suggest that even the traditional alpha level of .05 is able to maintain a false discovery risk below 5%.

While the present results may be considered good news relative to the scenario that most published results cannot be trusted, the results do not change the fact that some areas of psychology have a replication crisis (Open Science Collaboration, 2015). The z-curve results show clear evidence of selection for significance, which leads to inflated effect size estimates. Studies suggest that effect sizes are often inflated by more than 100% (Open Science Collaboration, 2015). Thus, published effect size estimates cannot be trusted even if p-values below .01 show the correct sign of an effect. The present results also imply that effect size meta-analyses that did not correct for publication bias produce inflated effect size estimates. For these reasons, many meta-analyses have to be reexamined and use statistical tools that correct for publication bias.

Appendix

“Given that these publishing biases are pervasive across scientific practice, it is possible that false positives heavily contaminate the neuroscience literature as well, and this problem may
affect at least as much, if not even more so, the most prominent journals” (Button et al., 2013; 3,316 citations).

“In a theoretical analysis, Ioannidis estimated that publishing and analytic practices make it likely that more than half of research results are false and therefore irreproducible” (Open Science Collaboration, 2015, aac4716-1)

“There is increasing concern that most current published research findings are false. (Ioannidis,
2005, abstract)” (Cumming, 2014, p7, 1,633 citations).

“In a recent article, Simmons, Nelson, and Simonsohn (2011) showed how, due to the misuse of statistical tools, significant results could easily turn out to be false positives (i.e., effects considered significant whereas the null hypothesis is actually true). (Leys et al., 2013, p. 765, 1,406 citations)

“During data analysis it can be difficult for researchers to recognize P-hacking or data dredging because confirmation and hindsight biases can encourage the acceptance of outcomes that fit expectations or desires as appropriate, and the rejection of outcomes that do not as the result of suboptimal designs or analyses. Hypotheses may emerge that fit the data and are then reported without indication or recognition of their post hoc origin. This, unfortunately, is not scientific discovery, but self-deception. Uncontrolled, it can dramatically increase the false discovery rate” (Munafò et al., 2017, p. 2, 1,010 citations)

Just how dramatic these effects can be was demonstrated by Simmons, Nelson, and Simonsohn (2011) in a series of experiments and simulations that showed how greatly QRPs increase the likelihood of finding support for a false hypothesis. (John et al., 2012, p. 524, 877 citations).

“Simonsohn’s simulations have shown that changes in a few data-analysis
decisions can increase the
false-positive rate in a single study to 60%” (Nuzzo, 2014, 799 citations).

“the publication of an important article in Psychological Science showing how easily researchers can, in the absence of any real effects, nonetheless obtain statistically significant differences through various questionable research practices (QRPs) such as exploring multiple dependent variables or covariates and only reporting these when they yield significant results (Simmons, Nelson, & Simonsohn, 2011)” (Pashler & Wagenmakers, 2012, p. 528, 736 citations)

“Even seemingly conservative levels of p-hacking make it easy for researchers to find statistically significant support for nonexistent effects. Indeed, p-hacking can allow researchers to get most studies to reveal significant relationships between truly unrelated variables (Simmons et al., 2011).” (Simonsohn, Nelson, & Simmons, 2014, p. 534, 656 citations)

“Recent years have seen intense interest in the reproducibility of scientific results and the degree to which some problematic, but common, research practices may be responsible for high rates of false findings in the scientific literature, particularly within psychology but also more generally” (Poldrack et al., 2017, p. 115, 475 citations)

“especially in an environment in which multiple comparisons or researcher dfs (Simmons, Nelson, & Simonsohn, 2011) make it easy for researchers to find large and statistically significant effects that could arise from noise alone” (Gelman & Carlin,

“In an influential recent study, Simmons and colleagues demonstrated that even a moderate amount of flexibility in analysis choice—for example, selecting from among two DVs or
optionally including covariates in a regression analysis— could easily produce false-positive rates in excess of 60%, a figure they convincingly argue is probably a conservative
estimate (Simmons et al., 2011).” (Yarkoni & Westfall, 2017, p. 1103, 457 citations)

“In the face of human biases and the vested interest of the experimenter, such freedom of analysis provides access to a Pandora’s box of tricks that can be used to achieve any desired result (e.g., John et al., 2012; Simmons, Nelson, & Simonsohn, 2011″ (Wagenmakers et al., 2012, p. 633, 425 citations)

“Simmons et al. (2011) illustrated how easy it is to inflate Type I error rates when researchers employ hidden degrees of freedom in their analyses and design of studies (e.g., selecting the most desirable outcomes, letting the sample size depend on results of significance tests).” (Bakker et al., 2012, p. 545, 394 citations).

“Psychologists have recently become increasingly concerned about the likely overabundance of false positive results in the scientific literature. For example, Simmons, Nelson, and Simonsohn (2011) state that “In many cases, a researcher is more likely to falsely find
evidence that an effect exists than to correctly find evidence that it does not
” (p. 1359)” (Maxwell, Lau, & Howard, 2015, p. 487,

“More-over, the highest impact journals famously tend to favor highly surprising results; this makes it easy to see how the proportion of false positive findings could be even higher in such journals.” (Pashler & Harris, 2012, p. 532, 373 citations)

“There is increasing concern that many published results are false positives [1,2] (but see [3]).” (Head et al., 2015, p. 1, 356 citations)

“Quantifying p-hacking is important because publication of false positives hinders scientific
progress” (Head et al., 2015, p. 2, 356 citations).

“To be sure, methodological discussions are important for any discipline, and both fraud and dubious research procedures are damaging to the image of any field and potentially undermine confidence in the validity of social psychological research findings. Thus far, however, no solid data exist on the prevalence of such research practices in either social or any other area of psychology.” (Strobe & Strack, 2014, p. 60, 291 citations)

“Assuming a realistic range of prior probabilities for null hypotheses, false report probability is likely to exceed 50% for the whole literature” (Szucs & Ioannidis, 2017, p. 1, 269 citations)

“Notably, if we consider the recent estimate of 13:1 H0:H1 odds [30], then FRP exceeds 50% even in the absence of bias” (Szucs & Ioannidis, 2017, p. 12, 269 citations)

“In all, the combination of low power, selective reporting, and other biases and errors that have been well documented suggest that high FRP can be expected in cognitive neuroscience and psychology. For example, if we consider the recent estimate of 13:1 H0:H1 odds [30], then
FRP exceeds 50% even in the absence of bias.” (Szucs & Ioannidis, 2017, p. 15, 269 citations)

“Many prominent researchers believe that as much as half of the scientific literature—not only in medicine, by also in psychology and other fields—may be wrong [11,13–15]” (Smaldino & McElreath, 2016, p. 2, 251 citations).

“Researchers can use questionable research practices (e.g., snooping, not reporting failed studies, dropping dependent variables, etc.; Simmons et al., 2011; Strube, 2006) to dramatically increase the chances of obtaining a false-positive result” (Schimmack, 2012, p. 552, 248 citations)

“A more recent article compellingly demonstrated how flexibility in data collection, analysis, and reporting can dramatically increase false-positive rates (Simmons, Nelson, & Simonsohn, 2011).” (Dick et al., 2015, p. 43, 208 citations)

“In 2011, we wrote “False-Positive Psychology” (Simmons et al. 2011), an article reporting the surprisingly severe consequences of selectively reporting data and analyses, a practice that we later called p-hacking. In that article, we showed that conducting multiple analyses on the same data set and then reporting only the one(s) that obtained statistical significance (e.g., analyzing multiple measures but reporting only one) can dramatically increase the likelihood of publishing a false-positive finding. Independently and nearly simultaneously, John et al. (2012) documented that a large fraction of psychological researchers admitted engaging in precisely the forms of p-hacking that we had considered. Identifying these realities—that researchers engage in p-hacking and that p-hacking makes it trivially easy to accumulate significant evidence for a false hypothesisopened psychologists’ eyes to the fact that many published findings, and even whole literatures, could be false positive.” (Nelson, Simmons, & Simonsohn, 2018, 204 citations).

“As Simmons et al.(2011) concluded—reflecting broadly on the state of the discipline—“it is unacceptably easy to publish ‘statistically significant’ evidence consistent with any hypothesis”(p.1359)” (Earp & Trafimov, 2015, p. 4, 200 citations)

“The second, related set of events was the publication of articles by a series of authors (Ioannidis 2005, Kerr 1998, Simmons et al. 2011, Vul et al. 2009) criticizing questionable research practices (QRPs) that result in grossly inflated false positive error rates in the psychological literature” (Shrout & Rodgers, 2018, p. 489, 195 citations).

“Let us add a new dimension, which was brought up in a seminal publication of Simmons, Nelson & Simonsohn (2011). They stated that researchers actually have so much flexibility in deciding how to analyse their data that this flexibility allows them to coax statistically significant results from nearly any data set” (Forstmeier, Wagenmakers, & Parker, 2017, p. 1945, 173 citations)

“Publication bias (Ioannidis, 2005) and flexibility during data analyses (Simmons, Nelson, & Simonsohn, 2011) create a situation in which false positives are easy to publish, whereas contradictory null findings do not reach scientific journals (but see Nosek & Lakens, in press)” (Lakens & Evers, 2014, p. 278, 139 citations)

“Recent reports hold that allegedly common research practices allow psychologists to support just about any conclusion (Ioannidis, 2005; Simmons, Nelson, & Simonsohn, 2011).” (Koole & Lakens, 2012, p. 608, 139 citations)

“Researchers then may be tempted to write up and concoct papers around the significant results and send them to journals for publication. This outcome selection seems to be widespread practice in psychology [12], which implies a lot of false positive results in the literature and a massive overestimation of ES, especially in meta-analyses” (

“Researcher df, or researchers’ behavior directed at obtaining statistically significant results (Simonsohn, Nelson, & Simmons, 2013), which is also known as p-hacking or questionable research practices in the context of null hypothesis significance testing (e.g., O’Boyle, Banks, & Gonzalez-Mulé, 2014), results in a higher frequency of studies with false positives (Simmons et al., 2011) and inflates genuine effects (Bakker et al., 2012).” (van Assen, van Aert, & Wicherts, p. 294, 133 citations)

“The scientific community has witnessed growing concern about the high rate of false positives and unreliable results within the psychological literature, but the harmful impact
of false negatives has been largely ignored” (Vadillo, Konstantinidis, & Shanks, p. 87, 131 citations)

“Much of the debate has concerned habits (such as “phacking” and the filedrawer effect) which can boost the prevalence of false positives in the published literature (Ioannidis, Munafò, Fusar-Poli, Nosek, & David, 2014; Simmons, Nelson, & Simonsohn, 2011).” (Vadillo, Konstantinidis, & Shanks, p. 87, 131 citations)

“Simmons, Nelson, and Simonsohn (2011) showed that researchers without scruples can nearly always find a p < .05 in a data set if they set their minds to it.” (Crandall & Sherman, 2014, p. 96, 114 citations)

Science is self-correcting: JPSP-PPID is not

With over 7,000 citations at the end of 2021, Ryff and Keyes (1995) article is one of the most highly cited articles in the Journal of Personality and Social Psychology. A trend analysis shows that citations are still increasing with over 800 citations in the past two years.

Most of these citations are reference to the use of Ryff’s measure of psychological well-being that uncritically accept Ryff’s assertion that her PWB measure is a valid measure of psychological well-being. The abstract implies that the authors provided empirical support for Ryff’s theory of psychological well-being.

Contemporary psychologists contrast Ryff’s psychological well-being (PWB) with Diener’s (1984) subjective well-being (SWB). In an article with over 1,000 citations, Ryff and Keyes (2002) tried to examine how PWB and SWB are empirically related. This attempt resulted in a two-factor model that postulates that SWB and PWB are related, but distinct forms of well-being.

The general acceptance of this model shows that most psychologists lack proper training in the interpretation of structural equation models (Borsboom, 2006), although graphic representations of these models make SEM accessible to readers who are not familiar with matrix algebra. To interpret an SEM model, it is only necessary to know that boxes represent measured variables, ovals represent unmeasured constructs, directed straight arrows represent an assumption that one construct has a causal influence on another construct, and curved bidrectional arrows imply an unmeasured common cause.

Starting from the top, we see that the model implies that an unmeasured common cause produces a strong correlation between two unmeasured variables that are labelled Psychological Well-Being and Subjective Well-Being. These labels imply that the constructs PWB and SWB are represented by unmeasured variables. The direct causal arrows from these unmeasured variables to the measured variables imply that PWB and SWB can be measured because the measured variables reflect the unmeasured variables to some extent. This is called a reflective measurement model (Borsboom et al., 2003). For example, autonomy is a measure of PWB because .38^2 = 14% of the variance in autonomy scores reflect PWB. Of course, this makes autonomy a poor indicator of PWB because the remaining 86% of the variance do not reflect the influence of PWB. This variance in autonomy is caused by other unmeasured influences and is called unique variance, residual variance, or disturbance. It is often omitted from SEM figures because it is assumed that this variance is simply irrelevant measurement error. I added it here because Ryff and users of her measure clearly do not think that 86% of the variance in the autonomy scale is just measurement error. In fact, the scale scores of autonomy are often used as if they are a 100% valid measure of autonomy. The proper interpretation of the model is therefore that autonomy is measured with high validity, but that variation in autonomy is only a poor indicator of psychological well-being.

Examination of the factor loadings (i.e., the numbers next to the arrows from PWB to the six indicators) shows that personal relationships has the highest validity as a measure of PWB, but even for personal relationships, the amount of PWB variance is only .66^2 = 44%.

In a manuscript (doc) that was desk-rejected by JPSP, we challenged this widely accepted model of PWB. We argued that the reflective model does not fit Ryff’s own theory of PWB. In a nutshell, Ryff’s theory of PWB is one of many list-theories of well-being (Sumner, 1996). The theory lists a number of attributes that are assumed to be necessary and sufficient for high well-being.

This theory of well-being implies a different measurement model in which arrows point from the measured variables to the construct of PWB. In psychometrics, these models are called formative measurement models. There is nothing unobserved about formative constructs. They are merely a combination of the measured constructs. The simplest way to integrate information about the components of PWB is to average them. If assumptions about importance are added, the construct could be a weighted average. This model is shown in Figure 2.

The key problem for this model is that it makes no predictions about the pattern of correlations among the measured variables. For example, Ryff’s theory does not postulate whether an increase in autonomy produces an increase in personal growth or a decrease in personal relations. At best, the distinction between PWB and SWB might imply that changes in PWB components are independent of changes in SWB components, but this assumption is highly questionable. For example, some studies suggest that positive relationships improve subjective well-being (Schimmack & Lucas, 2010).

To conclude, JPSP has published two highly cited articles that fitted a reflective measurement model to PWB indicators. In the desk-rejected manuscript, Jason Payne and I presented a new model that is grounded in theories of well-being and that treats PWB dimensions like autonomy and positive relations as possible components of a good life. Our model also clarified the confusion about Diener’s (1984) model of subjective well-being.

Ryff et al.’s (2002) two-factor model of well-being was influenced by Ryan and Deci’s (2001) distinction between two broad traditions in well-being research. “one dealing with happiness (hedonic well-being), and one dealing with human potential (eudaimonic well-being; Ryan &
Deci, 2001; see also Waterman, 1993)” (Ryff et al., 2002, p. 1007). We argued that this dichotomy overlooks another important distinction between well-being theories, namely the distinction between subjective and objective theories of well-being (Sumner, 1996). The key difference between objective and subjective theories of well-being is that objective theories aim to specify universal aspects of a good life that are based on philosophical analyses of the good life. In contrast, subjective theories reject the notion that universal criteria of a good life exist and leave it to individuals to create their own evaluation standards of a good life (Cantril., 1965). Unfortunately, Diener’s tripartite model of SWB is difficult to classify because it combines objective and subjective indicators. Whereas life-evaluations like life-satisfaction judgments are clearly subjective indicators, the amount of positive affect and negative affect implies a hedonistic conception of well-being. Diener never resolved this contradiction (Busseri & Sadava, 2011), but his writing made it clear that Diener stressed subjectivity as an essential component of well-being.

It is therefore incorrect to characterize Diener’s concept of SWB as a hedonic or hedonistic conception of well-being. The key contribution of Diener was to introduce psychologists to subjective conceptions of well-being and to publish the most widely used subjective measure of well-being, namely the Satisfaction with Life Scale. In my opinion, the inclusion of PA and NA in the tripartite model was a mistake because it does not allow individuals to choose what they want to do with their lives. Even Diener himself published articles that suggested positive affect and negative affect are not essential for all people (Suh, Diener, Oishi, & Triandis, 1998). At the very least, it remains an empirical question how important positive affect and negative affect are for subjective life evaluations and whether other aspects of a good life are even more important. At least, this question can be empirically tested by examining how much eudaimonic and hedonic measures of well-being contribute to variation in subjective measures of well-being. This question leads to a model in which life-satisfaction judgments are a criterion variable and the other variables are predictor variables.

The most surprising finding was that environmental mastery was a strong unique predictor and a much stronger predictor than positive affect or negative affect (direct effect, b = .66).

In our model, we also allowed for the possibility that PWB attributes influence subjective well-being by increasing positive affect or decreasing negative affect. The total effect is a very strong relationship, b = .78, with more than 50% of the variance in life-satisfaction being explained by a single PWB dimension, namely environmental mastery.

Other noteworthy findings were that none of the other PWB attribute made a positive (direct or indirect) contribution to life-satisfaction judgments. Autonomy even was a negative predictor. The effects of positive affect and negative affect were statistically significant, but small. This suggests that PA and NA are meaningful indicators of subjective well-being because the reflect a good life, but provide no evidence for hedonic theories of well-being that suggest positive affect increases well-being no matter how it is elicited.

These results are dramatically different from the published model in JPSP. In that model an unmeasured construct, SWB, causes variation in Environmental Mastery. In our model, environmental mastery is a strong cause of the only subjective indicator of well-being, namely life-satisfaction judgments. Whereas the published model implies that feeling good makes people have environmental mastery, our model suggests that having control over one’s life increases well-being. Call us crazy, but we think the latter model makes more sense.

So, why was our ms. desk rejected without peer-review from experts in well-being research? I post the full decision letter below, but I want to highlight the only comment about our actual work.

A related concern has to do with a noticeable gap between your research question, theoretical framework, and research design. The introduction paints your question in broad strokes only, but my understanding is that you attempt to refine our understanding of the structure of well-being, which could be an important contribution to the literature. However, the introduction does not provide a clear rationale for the alternative model presented. Perhaps even more important, the cross-sectional correlational study of one U.S. sample is not suited to provide strong conclusions about the structure of well-being. At the very least, I would have expected to see model comparison tests to compare the fit of the presented model with those of alternative models. In addition, I would have liked to see a replication in an independent sample as well as more critical tests of the discriminant validity and links between these factors, perhaps in longitudinal data, through the prediction of critical outcomes, or by using behavioral genetic data to establish the genetic and environmental architecture of these factors. Put another way, independent of the validity of the Ryff / Keyes model, the presented theory and data did not convince me that your model is a better presentation of the structure of well-being.

Bleidorn’s comments show that even prominent personality researchers lack basic understanding of psychometrics and construct validation. For example, it is not clear how longitudinal data can provide answers to questions about construct validity. Examining change is of course useful, but without a valid measure of a construct it is not clear what change in scale scores means. Construct validation precedes studies of stability and change. Similarly, it is only relevant to examine nature and nurture questions with a clear phenotype. Bleidorn completely ignores our distinction between hedonic and subjective well-being and the fact that we are the first to examine the relationship between PWB attributes and life-satisfaction.

As psychometricians have pointed out, personality psychologists often ignore measurement questions and are often content with averaged self-report ratings as operationalized constructs that do not require further validation. We think that this blind empiricism is preventing personality psychology from making real progress. It is depressing to see that even the new generation of personality psychologists shows no interest in improving construct validity of foundational constructs. Fortunately, JPSP-PPID publishes only about 50 articles a year and there are other outlets to publish our work. Unfortunately, JPSP has a reputation to publish only the best work, but this is prestige is not warranted by the actual quality of published articles. For example, the obsession with longitudinal data is not warranted given evidence that about 80% of the variance in personality measures is stable trait variance that does not change. Repeatedly measuring this trait variance does not add to our understanding of stable traits.

Conclusion

To conclude, JPSP has published two cross-sectional articles of the structure of well-being that continue to be highly cited. We find major problems with the models in these articles, but JPSP is not interested in publishing a criticism of these articles. To reiterate, the main problem is that Diener’s SWB model is treated as if it is an objective hedonic theory of well-being, when the core aspect of the model is that well-being is subjective and not objective. We thought at least the main editor Rich Lucas, a former Diener student, would understand this point, but expectations are the mother of disappointment. Of course, we could be wrong about some minor or major issues, but the lack of interest in these foundational questions shows just how far psychology is from being a real science. A real science develops valid measures before it examines real questions. Psychologists invent measures and study their measures without evidence that their measures reflect important constructs like well-being. Not surprisingly, psychology has produced no consensual theory of well-being that could help people live better lives. This does not stop psychologists from making proclamations about ways to lead a happy or good life. The problem is that these recommendations are all contingent on researchers’ preferred definition of well-being and the measures associated with that tradition/camp/belief system. In this way, psychology is more like (other) religions and less like a science.

Decision Letter

I am writing about your manuscript “Two Concepts of Wellbeing: The Relation Between Psychological and Subjective Wellbeing”, submitted for publication in the Journal of Personality and Social Psychology (JPSP). I have read the manuscript carefully myself, as has the lead Editor at JPSP, Rich Lucas. We read the manuscript independently and then consulted with each other about whether the manuscript meets the threshold for full review. Based on our joint consultation, I have made the decision to reject your paper without sending it for external review. The Editor and I shared a number of concerns about the manuscript that make it unlikely to be accepted for publication and that reduce its potential contribution to the literature. I will elaborate on these concerns below. Due to the high volume of submissions and limited pages available to JPSP, we must limit our acceptances to manuscripts for which there is a general consensus that the contribution is of an important and highly significant level. 
 

  1. Most importantly, papers that rely solely on cross-sectional designs and self-report questionnaire techniques are less and less likely to be accepted here as the number of submissions increases. In fact, such papers are almost always rejected without review at this journal. Although such studies provide an important first step in the understanding of a construct or phenomenon, they have some important limitations. Therefore, we have somewhat higher expectations regarding the size and the novelty of the contribution that such studies can make. To pass threshold at JPSP, I think you would need to expand this work in some way, either by using longitudinal data or or by going further in your investigation of the processes underlying these associations. I want to be clear; I agree that studies like this have value (and I also conduct studies using these methods myself), it is just that many submissions now go beyond these approaches in some way, and because competition for space here is so high, those submissions are prioritized.
  2. A related concern has to do with a noticeable gap between your research question, theoretical framework, and research design. The introduction paints your question in broad strokes only, but my understanding is that you attempt to refine our understanding of the structure of well-being, which could be an important contribution to the literature. However, the introduction does not provide a clear rationale for the alternative model presented. Perhaps even more important, the cross-sectional correlational study of one U.S. sample is not suited to provide strong conclusions about the structure of well-being. At the very least, I would have expected to see model comparison tests to compare the fit of the presented model with those of alternative models. In addition, I would have liked to see a replication in an independent sample as well as more critical tests of the discriminant validity and links between these factors, perhaps in longitudinal data, through the prediction of critical outcomes, or by using behavioral genetic data to establish the genetic and environmental architecture of these factors. Put another way, independent of the validity of the Ryff / Keyes model, the presented theory and data did not convince me that your model is a better presentation of the structure of well-being.
  3. The use of a selected set of items rather than the full questionnaires raises concerns about over-fitting and complicate comparisons with other studies in this area. I recommend using complete questionnaires and – should you decide to collect more data – additional measures of well-being to capture the universe of well-being content as best as you can. 
  4. I noticed that you tend to use causal language in the description of correlations, e.g. between personality traits and well-being measures. As you certainly know, the data presented here do not permit conclusions about the temporal or causal influence of e.g., neuroticism on negative affect or vice versa and I recommend changing this language to better reflect the correlational nature of your data.     

In closing, I am sorry that I cannot be more positive about the current submission. I hope my comments prove helpful to you in your future research efforts. I wish you the very best of luck in your continuing scholarly endeavors and hope that you will continue to consider JPSP as an outlet for your work.


Sincerely,
Wiebke Bleidorn, PhD
Associate Editor
Journal of Personality and Social Psychology: Personality Processes and Individual Differences

Estimating the False Positive Risk in Psychological Science

Abstract: At most one-quarter of published significant results in psychology journals are false positive results. This is surprising news after a decade of false positive paranoia. However, the low positive rate is not a cause for celebration. It mainly reflects the low priori probability that the nil-hypothesis is true (Cohen, 1994). To produce meaningful results, psychologists need to maintain low false positive risks when they test stronger hypotheses that specify a minimum effect size.

Introduction

Like many other sciences, psychological science relies on null-hypothesis significance testing as the main statistical approach to draw inferences from data. This approach can be dated back to Fisher’s first manual for empirical researchers how to conduct statistical analyses. If the observed test-statistic produces a p-value below .05, the null-hypothesis can be rejected in favor of the alternative hypothesis that the population effect size is not zero. Many criticism of this statistical approach have failed to change research practices.

Cohen (1994) wrote a sarcastic article about NHST with the title “The Earth is round, p < .05.” In this article, Cohen made the bold claim “my work on power analysis has led me to realize that the nil-hypothesis is always false.” In other words, population effect sizes are unlikely to be exactly zero. Thus, rejecting the nil-hypothesis with a p-value below .05 only tells us something we already know. Moreover, when sample sizes are small, we often end up with p-values greater than .05 that do not allow us to reject a false null-hypothesis. I cite this article only to point out that in the 1990s, meta-psychologists were concerned with low statistical power because it produces many false negative results. In contrast, significant results were considered to be true positive findings. Although often meaningless (e.g., the amount of explained variance is greater than zero), they were not wrong.

Since then, psychology has encountered a racial shift in concerns about false positive results (i.e., significant p-values when the nil-hypothesis is true). I conducted an informal survey on social media. Only 23.7% of twitter respondents echoed Cohen’s view that false positive results are rare (less than 25%). The majority (52.6%) of respondents assumed that more than half of all published significant results are false positives.

The results were a bit different for the poll in the Psychological Methods Discussion Group on Facebook. Here the majority opted for 25 to 50 percent false positive results.

The shift from the 1990s to the 2020s can be explained by the replication crisis in social psychology that has attracted a lot of attention and has been generalized to all areas of psychology (Open Science Collaboration, 2015). Arguably, the most influential article that contributed to concerns about false positive results in psychology is Simmons, Nelsons, and Simonsohn’s (2011) article titled “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant” that has been cited 3,203. The key contribution of this article was to show that the use of questionable research practices that psychologists use to obtain p-values below .05 (e.g., using multiple dependent variables) can increase the risk of a false positive result from 5% to over 60%. Moreover, anonymous surveys suggested that researchers often engage in these practices (John et al., 2012). However, even massive use of QRPs will not produce a massive amount of false positive results, if most null-hypotheses are true. In this case, QRPs will inflate the effect size estimates (that nobody pays attention to, anyways), but the rate of false positive results will remain low if most tested hypotheses are true.

Some scientists have argued that scientists are much more likely to make false assumptions (e.g., the Earth is flat) than Cohen envisioned. Ioannidis (2005) famously declared that Most published results are false. He based this claim on hypothetical scenarios that produce more than 50% false positive results when 90% of studies test a true null-hypothesis. This assumption is a near complete reversal of Cohen’s assumption that we can nearly always assume that the effect size is not zero. The problem is that the actual rate of true and false hypotheses is unknown. Thus, estimates of false positive rates are essentially projective tests of gullibility and cynicism.

To provide psychologists with scientific information about the false positive risk in their science, we need a scientific method that can estimate the false discovery risk based on actual data rather than hypothetical scenarios. There have been several attempts to do so. So far, the most prominent study was Leek and Jager’s (2014) estimate of the false discovery rate in medicine. They obtained an estimate of 14%. Simulation studies showed some problems with their estimation model, but the superior z-curve method replicated the original result with a false discovery risk of 13%. This result is much more in line with Cohen’s view that most null-hypotheses are false (typically effect sizes are not zero) than with Ioannidis’s claim that the null-hypothesis is true in 90% of all significance tests.

In psychology, the focus has been on replication rates. The shocking finding was that only 25% of significant results in social psychology could be replicated in an honest and unbiased attempt to reproduce the original study (Open Science Collaboration, 2015). This low replication rate leaves ample room for false positive results, but it is unclear how many of the non-significant results were caused by a true null-hypothesis and how many were caused by low statistical power to detect an effect size greater than zero. Thus, this project provides no information about the false positive risk in psychological science.

Another noteworthy project used a representative sample of test results in social psychology journals (Motyl et al., 2017). This project produced over 1,000 p-values that were examined using a number of statistical tools available at that time. The key result was that there was clear evidence of publication bias. That is, focal hypothesis tests nearly always rejected the null-hypothesis, a finding that has been observed since the beginning of social psychology (Sterling, 1959). However, the actual power of studies to do so was much lower; a finding that is consistent with Cohen’s (1961) seminal analysis of power. However, the results provided no information about the false positive risk. Yet, this valuable dataset could be analyzed with statistical tools that estimate the false discovery risk (Schimmack, 2021). However, the number of significant p-values was too small to produce an informative estimate of the false discovery risk (k = 678; 95CI = .09 to .82).

Results

A decade after the “False Positive Psychology” article rocked psychological science, it remains unclear how much false positive results contribute to replication failures in psychology. To answer this question, we report the results of a z-curve analysis of 1,857 significant p-values that were obtained from hand-coding a representative sample of studies that were published between 2009 and 2014. The years 2013 and 2014 were included to incorporate Motyl et al.’s data. All other coding efforts focussed on the years 2009 and 2010, before concerns about replication failures could have changed research practices. In marked contrast to previous initiatives, the aim was to cover all areas of psychology. To obtain a broad range of disciplines in psychology, a list of 120 journals was compiled (Schimmack, 2021). These journals are the top journals of their disciplines with high impact factors. Students had some freedom in picking journals of their choice. For each journal, articles were selected based on a fixed sampling scheme to code articles 1, 3, 6, and 10 for every set of 10 articles (1,3,6,10,11,13…). The project is ongoing and the results reported below should be considered preliminary. Yet, they do present the first estimate of the false discovery risk in psychological science.

The results replicate many other findings that focal statistical tests are selected because they reject the null-hypothesis. Eighty-one percent of all tests had a p-value below .05. When marginally significant results are included as well, the observed discovery rate increases to 90%. However, the statistical power of studies does not warrant such high success rates. The z-curve estimate of mean power before selection for significance is only 31%; 95%CI = 19% to 37%. This statistic is called the expected discovery rate (EDR) because mean power is equivalent to the long-run percentage of significant results. Based on an insight by Soric (1989), we can use the EDR to quantify the maximum percentage of results that can be false positives, using the formula: FDR = (1/EDR – 1)*(alpha/(1-alpha)). The point estimate of the EDR of 31% corresponds to a point estimate of the False Discovery Risk of 12%. The 95%CI ranges from 8% to 28%. It is important to distinguish between the risk and rate of false positives. Soric’s method assumes that true hypotheses are tested with 100% power. This is an unrealistic assumption. When power is lower the false positive rate will be lower than the false positive risk. Thus, we can conclude from these results that it is unlikely that more than 25% of published significant results in psychology journals are false positive results.

One concerns about these results is that the number of test statistics differed across journals and that Motyl et al.’s large set of results from social psychology could have biased the results. We therefore also analyzed the data by journal and then computed the mean FDR and its 95%CI. This approach produced an even lower FDR estimate of 11%, 95%CI = 9% to 15%.

While a FDR of less than 25% may seem good news in a field that is suffering from false positive paranoia, it is still unacceptably high to ensure that published results can be trusted. Fortunately, there is a simple solution to this problem because Soric’s formula shows that the false discovery risk depends on alpha. Lowering alpha to .01 is sufficient to produce a false discovery risk below 5%. Although this seems like a small adjustment, it results in the loss of 37% significant results with p-values between .05 and .01. This recommendation is consistent with two papers that have argued against the blind use of Fisher’s alpha level of .05 (Benjamin et al., 2017; Lakens et al., 2018). The cost of lowering alpha to .005 would be to loss another 10% of significant findings (ODR = 47%).

Limitations and Future Directions

No study is perfect. As many women know, the first time is rarely the best time (Higgins et al., 2010). Similarly, this study has some limitations that need to be addressed in future studies.

The main limitation of this study is that the coded statistical tests may not be representative of psychological science. However, the random sampling from journals and the selection of a broad range of journals suggests that sampling bias has a relatively small effect on the results. A more serious problem is that there is likely to be heterogeneity across disciplines or even journals within disciplines. Larger samples are needed to test those moderator effects.

Another problem is that z-curve estimates of the EDR and FDR make assumptions about the selection process that may differ from the actual selection process. The best way to address this problem is to promote open science practices that reduce the selective publishing of statistically significant results.

Eventually, it will be necessary to conduct empirical tests with a representative sample of results published in psychology akin to the reproducibility project (Open Science Collaboration, 2015). At a first step, studies can be replicated with the original sample sizes. Results that are successfully replicated do not require further investigation. Replication failures need to be followed up with studies that can provide evidence for the null-hypothesis using equivalence testing with a minimum effect size that would be relevant (Lakens, Scheel, and Isager, 2018). This is the only way to estimate the false positive risk by means of replication studies.

Implications: What Would Cohen Say

The finding that most published results are not false may sound like good news for psychology. However, Cohen would merely point out that that a low rate of false positive results merely reflect the fact that the nil-hypothesis is rarely true. If some hypotheses were true and others were false, NHST (without QRPs) could be used to distinguish between them. However, if most effect sizes are greater than zero, not much is learned from statistical significance. The problem is not p-values or dichotomous think. The problem is that nobody is testing risky hypothesis that an effect size is of a minimum size, and decides in favor of the null-hypothesis when the data show the population effect size is not exactly zero, but practically meaningless (e.g., experimental ego-depletion effects are less than 1/10th of a standard deviation). Even specifying H0 as r < .05 or d < .01 would lower the discovery rates and increase the false discovery risk, while increasing the value of a statistically significance.

Cohen’s clear distinction between the null-hypothesis and the nil-hypothesis made it clear that nil-hypothesis testing is a ritual with little scientific value, while null-hypothesis testing is needed to advance psychological science. The past decade has been a distraction by suggesting that nil-hypothesis testing is meaningful, but only if open science practices are used to prevent false positive results. However, open science practices do not change the fundamental problem of nil-hypothesis testing that Cohen and others identified more than two decades ago. It is often said that science is self-correcting, but psychologists have not corrected the way they formulate their hypotheses. If psychology wants to be a science, they need to specify hypotheses that are worthy of empirical falsification. I am getting to old and cynical (much like my hero Cohen in the 1990s) to believe in change in my life-time, but I can write this message in a bottle and hope one day a new generation may find it and do something with it.

Open Science: Inside Peer-Review at PSPB

We submitted a ms. that showed problems with the validity of the race IAT as a measure of African Americans’ unconscious attitudes to PSPB (Schimmack & Howard, 2020). After waiting patiently for three months, we received the following decision letter from the acting editor Dr. Corinne Moss-Racusin at Personality and Social Psychology Bulletin. She assures us that she independently read our manuscript carefully – twice; once before and once after reading the reviews. This is admirable. Yet it is surprising that her independent reading of our manuscript places her in strong agreement with the reviewers. Somebody with less research experience might feel devastated by the independent evaluation by three experts that our work is “of low quality.” Fortunately, it is possible to evaluate the contribution of our manuscript from another independent perspective, namely the strength of the science.

The key claim of our ms. is simple. John Jost, Brian Nosek, and Mahzarin Banaji wrote a highly cited article that contained the claim that a large percentage of members of disadvantaged groups have an implicit preference for the out-group. As recently as 2019, Jost repeated this claim and used the term self-hatred to refer to implicit preferences for the in-group (Jost, 2019).

We expressed our doubt about this claim when the disadvantaged group are African Americans. Our main concern was that any claims about African Americans’ implicit preferences require a valid measure of African Americans’ preferences. The claim that a large number of African Americans have an implicit preference for the White outgroup rests entirely on results obtained with the Implicit Association Test (Jost, Nosek, & Banaji, 2004). However, since the 2004 publication, the validity of the race IAT as a measure of implicit preferences has been questioned in numerous publications, including my recent demonstration that implicit and explicit measures of prejudice lack discriminant validity (Schimmack, 2021). Even the author of the IAT is no longer supporting the claim that the race IAT is a measure of some implicit, hidden attitudes (Greenwald & Banaji, 2017). Aside from revisiting Jost et al.’s (2004) findings in light of doubts about the race IAT, we also conducted the first attempt at validating the race IAT for Black participants. Apparently, reading the article twice did not help Corinne Moss-Racusin to notice this new empirical contribution, even though it is highlighted in Figure 2. The key finding here is that we were able to identify an in-group preference factor because several explicit and implicit measures showed convergent validity (ig). For example, the evaluative priming task showed some validity with a factor loading of .42 in the Black sample. However, the race IAT failed to show any relationship with the in-group factor (p > .05). It was also unrelated to the out-group factor. Thus, the race IAT lacks convergent validity as a measure of in-group and out-group preferences among African Americans in this sample. Neither the two reviewers, nor Corinne Moss-Racusin challenge this finding. They do not even comment on it. Instead, they proclaim that this research is of low quality. I beg to differ. Based on any sensible understanding of the scientific method, it is unscientific to make claims about African Americans’ preferences based on a measure that has not been validated. It is even more unscientific to double down on a false claim when evidence is presented that the measure lacks validity.

Of course, one can question whether PSPB should publish this finding. After all, PSPB prides itself on being the flagship journal of the Society for Personality and Social Psychology (Robinson et al., 2021). Maybe valid measurement of African Americans’ attitudes is not relevant enough to meet the high standards of a 20% acceptance rate. However, Robinson et al. (2021) launched a diversity initiative in response to awareness that psychology has a diversity problem.

Maybe it will take some time before PSPB can find some associate editors to handle manuscripts that address diversity issues and are concerned with the well-being of African Americans. Meanwhile, we are going to find another outlet to publish our critique of Jost and colleagues unscientific claim that many African Americans hold negative views of their in-group that they are not aware of and can only be revealed by their scores on the race IAT.

Editorial Decision Letter from Corinne Moss-Racusin

Re: “The race Implicit Association Test is Biased: Most African Americans Have Positive Attitudes towards their In-Group” (MS # PSPB-21-365)

Dear Dr. Schimmack:

Thank you for submitting your manuscript for consideration to Personality and Social Psychology Bulletin. I would like to apologize for the slight delay in getting this decision out to you. Both of my very young children have been home with me for the past month, due to Covid exposures at their schools. As their primary caregiver, this has created considerable difficulties. I appreciate your understanding as we all work to navigate these difficult and unprecedented times.

I have now obtained evaluations of the paper from two experts who are well-qualified to review work in this area.  Furthermore, I read your paper carefully and independently, both before and after looking at the reviews.

I found the topic of your work to be important and timely—indeed, I read the current paper with great interest. Disentangling in-group and out-group racial biases, among both White and Black participants (within the broader context of exploring System Justification Theory) is a compelling goal. Further, I strongly agree with you that exploring whether Black participants’ in-group attitudes have been systematically misrepresented by the (majority White) scientific community is of critical importance.

Unfortunately, as you will see, both reviewers have significant, well-articulated concerns that prevent them from supporting publication of the manuscript. For example, reviewer 1 stated that “Overall, I found this article to be of low quality. It argues against an argument that researchers haven’t made and landed on conclusions that their data doesn’t support.” Further, reviewer 2 (whose review is appropriately signed) wrote clearly that, “The purpose of this submission, it seems to me, is not to illuminate anything, really, and indeed very little, if anything, is illuminated. The purpose of the paper, it seems, is to create the appearance of something scandalous and awful and perhaps even racist in the research literature when, in fact, the substantive results obtained here are very similar to what has been found before. And if the authors really want to declare that the race-based IAT is a completely useless measure, they have a lot more work to do than re-analyzing previously published data from one relatively small study.”

See Reviewer 2’s comments and my response here

My own reading of your paper places me in strong agreement with the reviewer’s evaluations. I am sorry to report that I will not be able to accept your paper for publication in PSPB.

The reviewers’ comments are, in my several years of experience as an editor, unusually thorough and detailed. Thus, I will not reiterate them here.  Nevertheless, issues of primary concern involved both conceptual and empirical aspects of the manuscript. Although some of these issues might be addressed, to some degree, with some considerable re-thinking and re-writing, many cannot be addressed without more data and theoretical overhaul.

I was struck by the degree to which claims appear to stray quite far from both the published literature and the data at hand. As just one example, the section on “African American’s Resilience in a Culture of Oppression” (pp. 5-6) cites no published work whatsoever. Rather, you note that your skepticism regarding key components of SJT is based on “the lived experience of the second author,” which you then summarize. While individual case studies such as this can certainly be compelling, there are clear questions pertaining to generalizability and scientific merit, and the inability to independently validate or confirm this anecdotal evidence. While you do briefly acknowledge this, you proceed to make broad claims—such as “No one in her family or among her Black friends showed signs that they preferred to be White or like White people more than Black people. In small towns, the lives of Black and White people are more similar than in big cities. Therefore, the White out-group was not all that different from the Black in-group,” again without citing any evidence. I found it problematic to ground these bold claims and critiques largely in anecdote. Further, this raises serious concerns—as reviewer 2 articulates in some detail—that the current work may distort the current state of the science by exaggerating or mischaracterizing the nature of existing claims.

Let me say this clearly: I am strongly in favor of work that attempts to refine existing theoretical perspectives, and/or critique established methods, measures, and paradigms. I am not an IAT “purist” by any stretch, nor has my own recent work consistently included implicit measures. Indeed, as noted above, I read the current work with great interest and openness. Unfortunately, like both reviewers, I cannot support its publication in the current form.

I would sincerely encourage you to consider whether the future of this line of work could involve 1. Additional experiments, 2. Larger and more diverse samples, 3. True and transparent collaboration (whether “adversarial” or not) with colleagues from different ideological/empirical perspectives, and 4. Ensuring that claims align much more closely to what is narrowly warranted by the data at hand. Unfortunately, as it stands, the potential contributions of this work appear to be far overshadowed by its problematic elements.

I understand that you will likely be disappointed by my decision, but I urge you to pay careful attention to the reviewers’ constructive comments, as they may help you revise this manuscript or design further research.  Please understand that my decision was rendered with the recognition that the page limitations of the journal dictate that only a small percentage of submitted manuscripts can be accepted.  PSPB receives more than 700 submissions per year, but only publishes approximately 125 papers each year.  Papers without major flaws are often not accepted by PSPB because the magnitude of the contribution is not sufficient to warrant publication.  With careful revision, I think this paper might be appropriate for a more specialized journal, and I wish you success in finding an appropriate outlet for your work.

I am sorry that I cannot provide a more favorable response to your submission.  However, I do hope that you will again consider PSPB as your research progresses.

P.S. I asked the acting editor to clarify her comments and her views about the validity of the race IAT as a measure of African Americans’ unconscious preferences. They declined to comment further.

Inside Anonymous Peer Review

After a desk-rejection for JPSP, my co-author and I submitted our ms. to PSPB (see blog https://replicationindex.com/2021/07/28/the-race-implicit-association-test-is-biased/). After several months, we received the expected rejection. But it was not all in vane. We received a detailed review that shows how little social psychologists really care about African Americans even when they claim to study racism and discrimination.

As peer-reviews are considered copyrighted material belonging to the reviewer, I cannot share the review in full. Rather I will highlight important sections that show how little authors with the authority of an expert reviewer pay attention to inconvenient scientific criticism of their work.

Here is the key issue. Our paper provides new evidence that the race IAT is an invalid measure of African Americans’ attitudes towards their own group and the White out-group. This new evidence is based on a reanalysis of the data that were used by Bar-Anan and Nosek (2014) to claim that the race IAT is the most valid measure to study African Americans’ implicit attitudes. Here is wat the reviewer had to say about this.

(6) It has been a while since I read the Bar-Anan and Nosek (2014) article, but my memory for it is incompatible with the claim that those authors were foolish enough to simply assume that the most valid implicit measures was the one that produced the biggest difference between Whites and Blacks in terms of in-group bias, as the present authors claim (pp. 7-8).

Would you kill Dumbledore if he asked you to?

So, the reviewer relies on his foggy memory to question our claim instead of retrieving a pdf file and checking for himself. New York University should be proud of this display of scholarship. I hope Jost made sure to get his Publons credit. Here is the relevant section from Bar-Anan and Nosek (2014 p. 675; https://link.springer.com/article/10.3758/s13428-013-0410-6).

A lazy recollection is used to dismiss the results of a new statistical analysis. This is how closed, confidential, back-room, peer-review works, which means it does not work. It does not serve the purpose to present all scientific arguments in the open and let data decide between opposing views. Pre-publication peer-review is not a reliable and credible mechanism to advance science. For this reason, I will publish as much as possible in open-peer review journals (e..g, Meta-Psychology). Open science without open exchange of ideas and conflicts is not open, trustworthy, or credible.

Psychology Intelligence Agency

I always wanted to be James Bond, but being 55 now it is clear that I will never get a license to kill or work for a government intelligence agency. However, the world has changed and there are other ways to spy on dirty secrets of evil villains.

I have started to focus on the world of psychological science, which I know fairly well because I was a psychological scientist for many years. During my time as a psychologist, I learned about many of the dirty tricks that psychologists use to publish articles to further their careers without advancing understanding of human behavior, thoughts, and feelings.

However, so far the general public, government agencies, or government funding agencies that hand out taxpayers’ money to psychological scientists have not bothered to monitor the practices of psychological scientists. They still believe that psychological scientists can control themselves (e.g., peer review). As a result, bad practices persist because the incentives favor behaviors that lead to publication of many articles even if these articles make no real contribution to science. I therefore decided to create my own Psychological Intelligence Agency (PIA). Of course, I cannot give myself a license to kill, and I have no legal authority to enforce laws that do not exist. However, I can gather intelligence (information) and share this information with the general public. This is less James Bond and more CIA that also shares some of its intelligence with the public (CIA factbook), or the website Retraction Watch that keeps track of article retractions.

Some of the projects that I have started are:

Replicability Rankings of Psychology Journals
Keeping track of the power (expected discovery rate, expected replication rate) and the false discovery risk of test results published in over 100 psychology journals from 2010 to 2020.

Personalized Criteria of Statistical Significance
It is problematic to use the standard criterion of significance (alpha = .05) when this criterion leads to few discoveries because researchers test many false hypotheses or test true hypotheses with low power. When discovery rates are low, alpha should be set to a lower value (e.g., .01, .005, .001). Here I used estimates of authors’ discovery rate to recommend an appropriate alpha level to interpret their results.

Quantitative Book Reviews
Popular psychology books written by psychological scientists (e.g., Nobel Laureate Daniel Kahneman) reach a wide audience and are assumed to be based on solid scientific evidence. Using statistical examinations of the sources cited in these books, I provide information about the robustness of the scientific evidence to the general public. (see also “Before you know it“)

Citation Watch
Science is supposed to be self-correcting. However, psychological scientists often cite outdated references that fit their theory without citing newer evidence that their claims may be false (a practice known as cherry picking citations). Citation watch reveals these bad practice, by linking articles with misleading citations to articles that question the claims supported by cherry picked citations.

Whether all of this intelligence gathering will have a positive effect depends on how many people actually care about the scientific integrity of psychological science and the credibility of empirical claims. Fortunately, some psychologists are willing to learn from past mistakes and are improving their research practices (Bill von Hippel).

You Can Lead a Horse To Water, But... - Meaning, Origin

What would Cohen say to 184 Significance Tests in 1 Article

I was fortunate enough to read Jacob Cohen’s articles early on in my career to avoid many of the issues that plague psychological science. One of his important lessons was that it is better to test a few (or better one) hypothesis in one large sample (Cohen, 1990) than to conduct many tests in small samples.

The reason is simple. Even if a theory makes a correct prediction, sampling error may produce a non-significant result, especially in small samples where sampling error is large. This type of error is known as type-II error, beta, or a false negative. The probability of obtaining the desired and correct outcome of a significant result, when a hypothesis is true is called power. The problem of testing multiple hypotheses is that the cumulative or total power of finding evidence for all correct hypotheses decreases with the number of tests. Even if a single test has 80% power (i.e., the probability of a significant result for a correct hypothesis is 80 percent), the probability of providing evidence for 10 correct hypotheses is only .8^10 = .11%. The expected value is that 2 of the 10 tests produce a type-II error (Schimmack, 2012).

Cohen (1961) also noted that the average power of statistical tests is well below 80%. For a medium/average effect size, power was around 50%. Now imagine that a researcher tests 10 true hypotheses with 50% power. The expected value is that 5 tests produce a significant result (p < .05) and 5 studies produce a type-II error (p > .05). The interpretation of the article will focus on the significant results, but they were selected basically by a coin flip. The next study will produce a different set of 5 significant studies.

To avoid type-II errors researchers could conduct a priori power analysis to ensure that they have enough power. However, this is rarely done with the explanation that a priori power analysis requires knowledge about the population effect size, which is unknown. However, it is possible to estimate the typical power of studies by keeping track of the percentage of significant results. Because power determines the rate of significant results, the rate of significant results is an estimate of average power. The main problem with this simple method of estimating power is that researchers often do not report all of their results. Especially before the replication crisis became apparent, psychologists tended to publish only significant results. As a result, it is largely unknown how much power actual studies in psychology have and whether power increased since Cohen (1961) estimated power to be around 50%.

Here I illustrate a simple way to estimate actual power of studies with a recent multi-study article that reported a total of 184 significance tests (more were reported in a supplement, but were not coded)! Evidently, Cohen’s important insights remain neglected, especially in journals that pride themselves on rigorous examination of hypotheses (Kardas, Kumar, & Epley, 2021).

Figure 2 shows the first rows of the coding spreadsheet (Spreadsheet).

Each row shows one specific statistical test. The column “HO rejected” reflects how authors interpreted a result. Broadly this decision is based on the p < .05 rule, but sometimes authors are willing to treat values just above .05 as sufficient evidence which is often called marginal significance. The column p < .05 strictly follows the p < .05 rule. The averages in the top row show that there are 77% significant results using authors’ rules and 71% using the p < .05 rule. This shows that 6% of the p-values were interpreted as marginally significant.

All test-values or point estimates with confidence intervals are converted into exact two-sided p-values. The two-sided p-values are then converted into z-scores using the inverse normal formula; z = -qnorm(2). Observed power is then estimated for the standard criterion of significance; alpha = .05, which corresponds to a z-score of 1.96. The formula for observed power is pnorm(z, 1.96). The top row shows that mean observed power is 69%. This is close to the 71% percentage with the strict p < .05 rule, but a bit lower than the 77% when marginally significant results are included. This simple comparison shows that marginally significant results inflate the percentage of significant results.

The inflation column keeps track of the consistency between the outcome of a significance test and the power estimate. When power is practically 1, a significant result is expected and inflation is zero. However, when power is only 60%, there is a 40% chance of a type-II error and authors were lucky if they got a significant result. This can happen in a single test, but not in the long run. Average inflation is a measure of how lucky authors were if they got more significant results than the power of their studies allows. Using the authors 77% success rate and estimated power of 69%, we have an inflation of 8%. This is a small bias, and we already saw that interpretation of marginal results accounts for most of it.

The last column is called the Replication Index (R-Index). It simply subtracts the inflation from the observed power estimate. The reason is that observed power is an inflated estimate of power when there are too many significant results. The R-Index is called an index because the formula is just an approximate correction for selection for significance. Later I show the results with a better method. However, the Index can clearly distinguish between junk science (R-Index below 50) and credible evidence. Based on the present results, the R-Index of 62 shows that the article reported some credible findings. Moreover, the R-Index now underestimates power because the rate of p-values below .05 is consistent with observed power. The inflation is just due to the interpretation of marginal results as significant. In short, the main conclusion from this simple analysis of test statistics in a single article is that the authors conducted studies with an average power of about 70%. This is expected to produce type-II errors, sometimes with p-values close to .05 and sometimes with p-values well above .1. This could mean that nearly a quarter of the published results are type-II errors.

but what about type-I errors?

Cohen was concerned about the problem that many underpowered studies fail to reject true hypotheses. However, the replication crisis shifted the focus from false negative results to false-positive results. An influential article by Simmons et al. (2011) suggested that many if not most published results might be false positive results. The authors also developed a statistical tool that examines whether a set of significant results is entirely based on false positive results called p-curve. The next figure shows the output of the p-curve app for the 130 significant results (only significant results are considered because p-values greater than .05 cannot be false positives).

The graph shows that there a lot more p-values below .01 (78%) than p-values between .04 and .05 (2%). This distribution of p-values is inconsistent with the hypothesis that all significant results are false positives. In addition, the program estimates that the average power of the 130 studies with significant results is 99%! As a result, there can be no false positives that would produce an estimate of 5% power. It is noteworthy that the p-curve analysis did not spot the inflation of significant results by interpreting marginally significant results because these results are omitted from the p-curve analysis. It is rather unlikely that the average power of studies is 99%. In fact, simulation studies have shown that the power estimates of p-curve are often inflated when studies are heterogeneous (Brunner, 2018; Brunner & Schimmack, 2020). The p-curve authors are aware of this bug, but have done nothing to fix it (Datacolada, 2018).

A better statistical method to analyze p-values is z-curve, which relies on the z-scores that were obtained from the p-values in the spreadsheet. However, the z-curve package for R can also read p-values. The next Figure shows a histogram of all 184 (significant and non-significant) values up to a value of 6. Values over 6 are not shown and are all treated as studies with perfect power.

The expected discovery rate corresponds to the power estimate in p-curve. It is notably lower than 99% and the 95%CI excludes a value of 99%. This finding simply shows once again that p-curve estimates are inflated.

The observed discovery rate is simply the same percentage that was computed on the spreadsheet using a strict p < .05 rule. The expected discovery rate is an estimate of the average power for all studies, including non-significant results that is corrected for any potential inflation. It is 62%, which matches the R-Index in the spreadsheet.

The comparison of the observed discovery rate of 71% and the expected discovery rate of 62% suggests that there is some overreporting of significant results. However, the 95%CI around the EDR estimate ranges from 27% to 88%. Thus, sampling error alone may explain this discrepancy.

An EDR of 62% implies that only a small number of significant results can be false positives. The point estimate is just 2%, but the 95%CI allows for up to 14% false positives. Thus, the reported results are unlikely to be false positives, but effect sizes could be inflated because selection for significance with modest power inflates effect size estimates.

There is also notable evidence of heterogeneity. The distribution of z-scores is much flatter than a standard normal distribution that is expected if all studies had the same power. This means that some results might be more credible than others. Therefore I conducted some moderator analyses.

One key hypothesis in the article was that shallow and deep conversations differ in important ways. Several studies tested this by comparing shallow and deep conversations. Fifty-four analyses included a contrast between shallow and deep conversations as a main effect or in an interaction. The expected replication rate is unchanged. The expected discovery rate is a bit higher, but surprisingly, the observed discovery rate is lower. Visual inspection of the z-curve plot shows an unusually high number of marginally significant results. This is further evidence to distrust marginally significant results. However, overall these results suggest that shallow and deep conversations differ.

Several analyses tested mediation, which can require large samples to have adequate power. Not surprisingly, the 39 mediation tests have only a replication rate of 53%. There is also some suggestion of bias, with an observed discovery rate of 51% and an expected discovery rate of only 25%, but the 95%CI around the point estimate is wide and includes 51%. The low expected discovery rate implies that the false discovery risk is 16%, which is unacceptably high.

One solution to the high false discovery risk is to lower the criterion for significance. The next conventional level is alpha = .01. The next figure shows the results for this criterion value (the red solid line has moved to z = 2.58.

Now the observed discovery rate is in line with the expected discovery rate (28% vs. 27%) and the false discovery risk has been lowered to 3%. However, the expected replication rate (for alpha = .01) is only 36%. Thus, follow-up studies need to increase sample sizes to replicate these mediation effects.

Conclusion

A post-hoc power-analysis of this recent article shows that psychologists still have not learned Cohen’s lesson that he shared in 1990 (more than 30 years ago). Conducting many significance tests with modest statistical power produces a confusing pattern of significant and non-significant results that is strongly influenced by sampling error. Rather than reporting results of individual studies, the authors should have reported meta-analytic results for tests of the same hypothesis. However, to end on a positive note, the studies are not p-hacked and the risk of false positives is low. Thus, the results provide some credible findings that can be used to conduct confirmatory tests of the hypothesis that deeper conversations are more awkward, but also more rewarding. I hope these analyses show that a deep dive into the statistical results reported in an article can also be rewarding.

Citation Watch

Good science requires not only open and objective reporting of new data; it also requires unbiased review of the literature. However, there are no rules and regulations regarding citations, and many authors cherry-pick citations that are consistent with their claims. Even when studies have failed to replicate, original studies are cited without citing the replication failures. In some cases, authors even cite original articles that have been retracted. Fortunately, it is easy to spot these acts of unscientific behavior. Here I am starting a project to list examples of bad scientific behaviors. Hopefully, more scientists will take the time to hold their colleagues accountable for ethical behavior in citations. They can even do so by posting anonymously on the PubPeer comment site.

Entry DateTable of Incorrect Citations
21/10/27Authors: Jürgen Kornmeier ,Kriti Bhatia,Ellen Joos
Year: 2021
Citation: In the present paradigm, it is of course not difficult to comprehend the direct influence from past percepts of a disambiguated lattice figure on the perception of a highly similar but ambiguous lattice variant. In other precognition paradigms, such as some of those used in the experiments of the seminal Bem paper [85], the potential role of the perceptual history is not as directly comprehensible as in the present study– which does not necessarily rule it out.
DOI: https://doi.org/10.1371/journal.pone.0258667
Correction: Does not cite evidence that Bem used questionable research practices to produce these results (https://psycnet.apa.org/record/2012-23130-001, https://doi.org/10.3758/s13423-012-0227-9, https://replicationindex.com/2018/01/05/bem-retraction/) and that the results failed to replicate (https://doi.org/10.1037/a0029709)
21/10/26Authors: B. Keith Payne1, Jason W. Hannay
Year: 2021
Citation: One of the most important contributions from psychological science is the concept of implicit bias. Implicit bias refers to positive or negative mental associations cued spontaneously by social groups. It is measured using cognitive tasks that test how those associations facilitate, interfere with, or otherwise bias task performance [5,6]. Many studies suggest that implicit bias is widespread, even among people who explicitly endorse egalitarian attitudes [7,8].
Others argue that implicit bias is a stable trait- like construct, and that context effects or temporal fluctuations reflect only measurement error [50,51].
DOI:
Correction: This quote and many other citations in this article fail to mention that the concept of implicit bias is controversial and lacks strong empirical support. There are many critical articles to cite, but my own criticism of the construct validity of implicit measures references most of them (https://doi.org/10.1177/1745691619863798). Another article directly criticizes Payne and is not cited (https://journals.sagepub.com/doi/abs/10.1177/1745691620931492). The authors cite my article [51], but fails to mention that it also contains evidence to support the claim that implicit racial bias measurs have only modest convergent validity with explicit racism measures and very little discriminant validity.
21/10/25Authors: Cassandra Baldwin, Katie E. Garrison, Roy F. Baumeister & Brandon J. Schmeichel
Year: 2021
Citation: Research has found that the capacity for executive control may work as if it depended on a limited resource. Effortful acts of control consume some of this resource, resulting in a state known as ego depletion (Baumeister et al., 1998; Muraven & Baumeister, 2000).
DOI: 10.1080/15298868.2021.1888787
Correction: does not cite meta-analysis that shows publication bias and no evidence for the effect (https://doi.org/10.3389/fpsyg.2014.00823). Also does not cite two failed replication attempts in major RRR (https://doi.org/10.1177/1745691616652873, https://doi.org/10.1177/0956797621989733)
21/10/25Authors: Liad Uziel, Roy F. Baumeister, and Jessica L. Alquist
Year: 2021
Citation: Furthermore, temporary reduction in selfcontrol (following laboratory manipulations or activities such as alcohol consumption) often causes an increase in careless and impulsive acts (Baumeister et al., 2007; Hagger et al., 2010).
DOI: https://doi.org/10.1037/mot0000213
Correction: cite an outdated meta-analysis that did not control for publication bias and fail to cite an updated meta-analysis that shows clear evidence of publication bias and no evidence for an effect (https://doi.org/10.3389/fpsyg.2014.00823), see also https://replicationindex.com/2016/04/18/is-replicability-report-ego-depletionreplicability-report-of-165-ego-depletion-articles/
21/10/20Authors: Nicole C. Nelson, Julie Chung, Kelsey Ichikawa, and Momin M. Malik
Year: 2021
Citation: The second event Earp and Tramifow point to is the publication of psychologist Daryl Bem’s (2011) paper "Feeling the future,” which presented evidence suggesting hat people could anticipate evocative stimuli before they actually happened (such as the ppearance of an erotic image)."
DOI: 10.1177/10892680211046508
Correction: Does not cite evidence that Bem used questionable research practices to produce these results (https://psycnet.apa.org/record/2012-23130-001, https://doi.org/10.3758/s13423-012-0227-9, https://replicationindex.com/2018/01/05/bem-retraction/) and that the results failed to replicate (https://doi.org/10.1037/a0029709)
21/10/20Authors: Gregory D. Webster, Val Wongsomboon, Elizabeth A. Mahar
Year: 2021
Citation: To be sure, quantity need not reflect quality in published articles (e.g., see Bem’s [2011] nine-study article purporting experimental evidence of precognition).
DOI: https://doi.org/10.1177/174569162199753
Correction: Does not cite evidence that Bem used questionable research practices to produce these results (https://psycnet.apa.org/record/2012-23130-001, https://doi.org/10.3758/s13423-012-0227-9, https://replicationindex.com/2018/01/05/bem-retraction/) and that the results failed to replicate (https://doi.org/10.1037/a0029709)
21/10/20Authors: Jason Chin, Justin T. Pickett, Simine Vazire, Alex O. Holcombe
Year: 2021
Citation: The threat posed by QRPs has been discussed most extensively in the field of psychology, arguably the eye of the storm of the “replication crisis.” In the wake of the “False Positive Psychology” paper (Simmons et al. 2011), Daryl Bem’s paper claiming to find evidence of Extra Sensory Perception (ESP; Bem 2011), and several cases of fraud, the field of psychology entered a period of intense self-examination.
DOI: https://doi.org/10.1007/s10940-021-09525-6
Correction: Does not cite evidence that Bem used questionable research practices to produce these results (https://psycnet.apa.org/record/2012-23130-001, https://doi.org/10.3758/s13423-012-0227-9, https://replicationindex.com/2018/01/05/bem-retraction/) and that the results failed to replicate (https://doi.org/10.1037/a0029709)
21/10/20Citation: Daryl Bem became notorious for publication of two articles in high-quality journals claiming the existence of ESP (Bem, 2011; Bem&Honorton, 1994). The experimental design and the statistical power looked persuasive enough to lead the editors and reviewers to a decision to publish despite the lack of a theory to explain the results.
DOI: https://doi.org/10.3758/s13420-021-00474-5
Correction: Does not cite evidence that Bem used questionable research practices to produce these results (https://psycnet.apa.org/record/2012-23130-001, https://doi.org/10.3758/s13423-012-0227-9, https://replicationindex.com/2018/01/05/bem-retraction/) and that the results failed to replicate (https://doi.org/10.1037/a0029709)
21/10/20Authors: T. D. Stanley, Hristos Doucouliagos, John P. A. Ioannidis, Evan C. Carter
Year: 2021
Citation: Bem conducted some dozen(s) of experiments that asked students to “feel the future” by responding in the present to random future stimulus that was unknown to both subjects and experimenters at the time.46–49 Even though Bem seemed to employ state-of-the-art methods, his findings that students could “feel the future” were implausible to most psychologists.
DOI: 10.1002/jrsm.1512
Correction: Does not cite evidence that Bem used questionable research practices to produce these results (https://psycnet.apa.org/record/2012-23130-001, https://doi.org/10.3758/s13423-012-0227-9, https://replicationindex.com/2018/01/05/bem-retraction/) and that the results failed to replicate (https://doi.org/10.1037/a0029709)
21/10/20Author: Guido W. Imbens
Year: 2021
Citation: In the Journal of Personality and Social Psychology, Bem (2011) studies whether precognition exists: that is, whether future events retroactively affect people’s responses. Reviewing nine experiments, he finds (from the abstract): “The mean effect size (d) in psi performance across all nine experiments was 0.22, and all but one of the experiments
yielded statistically significant results.” This finding sparked considerable controversy, some of it methodological.
DOI: https://doi.org/10.1257/jep.35.3.157
Correction: Does not cite evidence that Bem used questionable research practices to produce these results (https://psycnet.apa.org/record/2012-23130-001, https://doi.org/10.3758/s13423-012-0227-9, https://replicationindex.com/2018/01/05/bem-retraction/) and that the results failed to replicate (https://doi.org/10.1037/a0029709)
21/10/20Authors: Mariella Paul, Gisela H. Govaart, Antonio Schettino
Year: 2001
Citation: "Over the last decade, findings from a number of research disciplines have been under careful scrutiny. Prominent examples of research supporting incredible conclusions (Bem, 2011),
DOI: https://doi.org/10.1016/j.ijpsycho.2021.02.016
Correction: Does not cite evidence that Bem used questionable research practices to produce these results (https://psycnet.apa.org/record/2012-23130-001, https://doi.org/10.3758/s13423-012-0227-9, https://replicationindex.com/2018/01/05/bem-retraction/) and that the results failed to replicate (https://doi.org/10.1037/a0029709)
21/10/20Authors: Andrew T. Little, Thomas B. Pepinsky
Year: 2021
Citation: A prominent example here is Bem (2011) on extrasensory perception, which played a central role in uncovering the problems of p-hacking in psychology.
DOI: https://doi-org.myaccess.library.utoronto.ca/10.1086/710088
Correction: Does not cite evidence that Bem used questionable research practices to produce these results (https://psycnet.apa.org/record/2012-23130-001, https://doi.org/10.3758/s13423-012-0227-9, https://replicationindex.com/2018/01/05/bem-retraction/) and that the results failed to replicate (https://doi.org/10.1037/a0029709)
21/10/20Authors: Bruno Verschuere, Franziska M. Yasrebi-de Kom, Iza van Zelm, MSc, Scott O. Lilienfeld
Year: 2021
Citation: and the spurious “discovery” of
precognition (Bem, 2011)
DOI: https://doi-org.myaccess.library.utoronto.ca/10.1521/pedi_2019_33_426
Correction: Does not cite evidence that Bem used questionable research practices to produce these results (https://psycnet.apa.org/record/2012-23130-001, https://doi.org/10.3758/s13423-012-0227-9, https://replicationindex.com/2018/01/05/bem-retraction/) and that the results failed to replicate (https://doi.org/10.1037/a0029709)

21/10/20Authors: Lincoln J. Colling, Dénes Szűcs
Year: 2021
Citation: A series of events in the early 2010s, including the publication of Bem’s (2011) infamous study on extrasensory perception (or PSI), and data fabrication by Diederik Stapel and others (Stroebe et al. 2012), led some prominent researchers to claim that psychological science was suffering a crisis of confidence (Pashler and
Wagenmakers 2012).
DOI: https://doi.org/10.1007/s13164-018-0421-4
Correction: Does not cite evidence that Bem used questionable research practices to produce these results (https://psycnet.apa.org/record/2012-23130-001, https://doi.org/10.3758/s13423-012-0227-9, https://replicationindex.com/2018/01/05/bem-retraction/) and that the results failed to replicate (https://doi.org/10.1037/a0029709)
21/10/20Authors: Mόnika Gergelyfia, Ernesto J. Sanz-Arigita, Oleg Solopchuk, Laurence Dricot, Benvenuto Jacob, Alexandre Zénon
Year: 2021
Citation: Theories of MF can be classified in two major groups that assume either: (a) alterations of motivational processes leading to restrictions on the recruitment of cognitive resources for the task at hand (…) or b) progressive functional alteration of cognitive processes through metabolic mechanisms ( Gailliot and Baumeis- ter, 2007 ; Christie and Schrater, 2015 ; Holroyd, 2015 ; Hopstaken et al., 2015 ; Blain et al., 2016 ; Gergelyfiet al., 2015 ).
DOI: https://doi.org/10.1016/j.neuroimage.2021.118532
Correction: do not cite meta-analysis that shows publication bias and no evidence for glucose effects on willpower (https://doi.org/10.1177/0956797616654911)
21/10/20Authors: Alexandra Touroutoglou, Joseph Andreano, Bradford C. Dickerson, Lisa Feldman Barrett
Year: 2020
Citation: Some accounts hold that effort serves to manage intrinsic costs to finite resources such as metabolic resources (Gailliot and Baumeister, 2007; Gailliot et al., 2007; Holroyd, 2016),
DOI: https://doi.org/10.1016/j.cortex.2019.09.011
Correction: do not cite meta-analysis that shows publication bias and no evidence for glucose effects on willpower (https://doi.org/10.1177/0956797616654911)

21/10/10Authors: Scott W. Phillips; Dae-Young Kim
Year: 2021
Citation: Johnson et al. (2019) found no evidence for disparity in the shooting deaths of Black or Hispanic people. Rather, their data indicated an anti-White disparity in OIS deaths.
DOI: https://journals.sagepub.com/doi/10.1177/0093854821997529
Correction: Retraction (https://www.pnas.org/content/117/30/18130)
21/10/10Authors: Richard Stansfield, Ethan Aaronson, Adam Okulicz-Kozaryn
Year: 2021
Citation: While recent studies increasingly control for officer and incident characteristics (e.g., Fridell & Lim, 2016; Johnson et al., 2019; Ridgeway et al., 2020)
DOI: https://doi.org/10.1016/j.jcrimjus.2021.101828
Correction: Retraction of Johnson et al. (https://www.pnas.org/content/117/30/18130)
21/10/10Authors: P. A. Hancock; John D. Lee; John W. Senders
Citation: Misattributions involved in such processes of assessment can, as we have seen, lead to adverse consequences (e.g., Johnson et al., 2019).
DOI: DOI: 10. 1177/ 0018 7208 2110 36323
Correction: Retraction (https://www.pnas.org/content/117/30/18130)
21/10/10Authors: Desmond Ang
Citation: While empirical evidence of racial bias is mixed (Nix et al. 2017; Fryer 2019; Johnson et al. 2019; Knox, Lowe, and Mummolo 2020; Knox and Mummolo 2020)
DOI: doi:10.1093/qje/qjaa027
Correction: Retraction of Johnson et al. (https://www.pnas.org/content/117/30/18130)
21/10/10Authors: Lara Vomfell; Neil Stewart
Year: 2021
Citation: Some studies have argued that the general population in an area is not the appropriate comparison: instead one should compare rates of use of force to how often Black and White people come into contact with police [59–61]
DOI: https://www.nature.com/articles/s41562-020-01029-w
Correction: [60] Johnson et al. Retracted (https://www.pnas.org/content/117/30/18130)
21/10/10Authors: Jordan R. Riddell; John L. Worrall
Year: 2021
Citation: Recent years have also seen improvements in benchmarking-related research, that is, in formulating methods to more accurately analyze whether bias (implicit or explicit) or racial disparities exist in both UoF and OIS. Recent examples include Cesario, Johnson, and Terrill (2019), Johnson, Tress, Burkel, Taylor, and Cesario (2019), Shjarback and Nix (2020), and Tregle, Nix, and Alpert (2019).
DOI: https://doi.org/10.1016/j.jcrimjus.2020.101775
Correction: Retraction of Johnson et al. (https://www.pnas.org/content/117/30/18130)
21/10/10Authors: Dean Knox, Will Lowe, Jonathan Mummolo
Year: 2021
Citation: A related study, Johnson et al. (2019), attempts to estimate racial bias in police shootings. Examining only positive cases in which fatal shootings occurred, they find that the majority of shooting victims are white and conclude from this that no antiminority bias exists
DOI: https://doi.org/10.1017/S0003055420000039
Correction: Retraction of Johnson et al. (https://www.pnas.org/content/117/30/18130)
21/10/10Authors: Ming-Hui Li, Pei-Wei Li ,Li-Lin Rao
Year: 2021
Citation: The IAT has been utilized in diverse areas and has proven to have good construct validity and reliability (Gawronski et al., 2020).
DOI: https://doi.org/10.1016/j.paid.2021.111107
Correction: does not cite critique of the construct validity of IATs (https://doi.org/10.1177/1745691619863798)

21/10/10Authors: Chew Wei Ong, Kenichi Ito
Year: 2021
Citation: This penalty treatment of error trials has been shown to improve the correlations between the IAT and explicit measures, indicating a greater construct validity of the IAT.
DOI: 10.1111/bjso.12503
Correction: higher correlations do not imply higher construct validity of IATs as measures of implicit attitudes (https://doi.org/10.1177/1745691619863798)
21/10/10Authors: Sara Costa, Viviana Langher, Sabine Pirchio
Year: 2021
Citation: The most used method to assess implicit attitudes is the “Implicit Association Test” (IAT; Greenwald et al., 1998), which presents a good reliability (Schnabel et al., 2008) and validity (Nosek et al., 2005; Greenwald et al., 2009).
DOI: doi: 10.3389/fpsyg.2021.712356
Correction: does not cite critique of the construct validity of IATs (https://doi.org/10.1177/1745691619863798)
21/10/10Authors: Christoph Bühren, Julija Michailova
Year: 2021
Citation: not available, behind paywall
DOI: DOI: 10.4018/IJABE.2021100105
Correction: cite Caruso et al. (2013), but do not cite replication failure by Caruso et al. (2017) (https://journals.sagepub.com/doi/abs/10.1177/0956797617706161)
21/10/10Authors: Yang, Gengfeng, Zhenzhen, Dongjing
Year: 2021
Citation: "Studies have found that merely activating the concept of money can increase egocentrism, which can further redirect people's attention toward their inner motivations and needs (Zaleskiewicz et al., 2018) and reduce their sense of connectedness with others (Caruso et al., 2013).
DOI: 10.1002/cb.1973
Correction: cite Caruso et al. (2013), but do not cite replication failure by Caruso et al. (2017) (https://journals.sagepub.com/doi/abs/10.1177/0956797617706161)
21/10/10Authors: Garriy Shteynberg, Theresa A. Kwon, Seong-Jae Yoo, Heather Smith, Jessica Apostle, Dipal Mistry, Kristin Houser
Year: 2021
Citation: Money is often described as profane, vulgar, and filthy (Belk & Wallendorf, 1990), yet incidental exposure to money increases the endorsement of the very social systems that render such money meaningful (Caruso et al., 2013).
DOI: 10.1002/jts5.95
Correction: cite Caruso et al. (2013), but do not cite replication failure by Caruso et al. (2017) (https://journals.sagepub.com/doi/abs/10.1177/0956797617706161)
21/10/10Author: Arden Rowell
Year: 2021
Citation: In particular, some studies show that encouraging people to think about things in terms of money may measurably change people's thoughts, feelings, motivations, and behaviors. See Eugene M. Caruso, Kathleen D. Vohs, Brittani Baxter & Adam Waytz, Exposure to Money Increases Endorsement of Free-Market Systems and Social Inequality, 142 J. EXPERIMENTAL PSYCH. 301, 301-02, 305 (2013) DOI: https://scholarship.law.nd.edu/ndlr/vol96/iss4/9
Correction: cite Caruso et al. (2013), but do not cite replication failure by Caruso et al. (2017) (https://journals.sagepub.com/doi/abs/10.1177/0956797617706161)
21/10/10Authors: Anna Jasinenkoa, Fabian Christandl, Timo Meynhardt
Year: 2020
Citation: Caruso et al. (2013) find that exposure to money (which is prevalent in most shopping situations) activates personal tendencies to justify the market system. Furthermore, they find that money exposure also activates general system justification; however, he effect was far smaller than for the activation of MSJ.
DOI: https://doi.org/10.1016/j.jbusres.2020.04.006
Correction: cite Caruso et al. (2013), but do not cite replication failure by Caruso et al. (2017) (https://journals.sagepub.com/doi/abs/10.1177/0956797617706161)

A one hour introduction to bias detection

This introduction to bias detection builds on introductions to test-statistics and statistical power (Into to Statistics, Intro to Power).

It is well known that many psychology articles report too many significant results because researchers selectively publish results that support their predictions (Francis, 2014; Sterling, 1959; Sterling et al., 1995; Schimmack, 2021). This often leads to replication failures (Open Science Collaboration, 2015).

One way to examine whether a set of studies reported too many significant results is to compare the success rate (i.e., the percentage of significant results) with the mean observed power in studies (Schimmack, 2012). In this video, I illustrate this bias detection method using Vohs et al.’s (2006) Science article “The Psychological Consequences of Money.

I use this students for training purposes because the article reports 9 studies and a reasonably large number of studies is needed to have good power to detect selection bias. Also, the article is short and the results are straight forward. Thus, students have no problem filling out the coding sheet that is needed to compute observed power (Coding Sheet).

The results show clear evidence of selection bias that undermine the credibility of the reported results (see also TIVA). Although bias tests are available, few researchers use them to protect themselves from junk science and articles like this one continue to be cited at high rates (683 total, 67 in 2019). A simple way to protect yourself from junk science is to adjust the alpha level to .005 because many questionable practices produce p-values that are just below .05. For example, the lowest p-value in these 9 studies was p = .006. Thus, not a single study was statistically significant with alpha = .005.

Intro to Statistical Power in One Hour

Last week I posted a video that provided an introduction to the basic concepts of statistics, namely effect sizes and sampling error. A test statistic like a t-value, is simply the ratio of the effect size over sampling error. This ratio is also known as a signal to noise ratio. The bigger the signal (effect size), the more likely it is that we will notice it in our study. Similarly, the less noise we have (sampling error), the easier it is to observe even small signals.

In this video, I use the basic concepts of effect sizes and sampling error to introduce the concept of statistical power. Statistical power is defined as the percentage of studies that produce a statistically significant result. When alpha is set to .05, it is the expected percentage of p-values with values below .05.

Statistical power is important to avoid type-II errors; that is, there is a meaningful effect, but the study fails to provide evidence for it. While researchers cannot control the magnitude of effects, they can increase power by lowering sampling error. Thus, researchers should carefully think about the magnitude of the expected effect to plan how large their sample has to be to have a good chance to obtain a significant result. Cohen proposed that a study should have at least 80% power. The planning of sample sizes using power calculation is known as a priori power analysis.

The problem with a priori power analysis is that researchers may fool themselves about effect sizes and conduct studies with insufficient sample sizes. In this case, power will be less than 80%. It is therefore useful to estimate the actual power of studies that are being published. In this video, I show that actual power could be estimated by simply computing the percentage of significant results. However, in reality this approach would be misleading because psychology journals discriminant against non-significant results. This is known as publication bias. Empirical studies show that the percentage of significant results for theoretically important tests is over 90% (Sterling, 1959). This does not mean that mean power of psychological studies is over 90%. It merely suggests that publication bias is present. In a follow up video, I will show how it is possible to estimate power when publication bias is present. This video is important to understand what statistical power.