Category Archives: Uncategorized

Bayes-Factors in Favor of the Nil-Hypothesis are Meaningless

Zoltan Dienes just published an article in the journal that is supposed to save psychological science; Advances in Methods and Practices in Psychological Science. It is a tutorial about Bayes-Factors, which are advocated by Dienes and others as a solution to alleged problems with null-hypothesis significance testing.

The advantage of Bayes-Factors is supposed to be its ability to provide evidence for the null-hypothesis, while NHST is supposed to be one-sided and can only reject the null-hypothesis. The claim is that led to the problem that authors only published articles with p-values less than .05.

“Significance testing is a tool that is commonly used for this purpose; however, nonsignificance is not itself evidence that something does not exist. On the other hand, a Bayes factor can provide a measure of evidence for a model of something existing versus a model of it not existing”

The problem with this attack on NHST is that it is false. The main reason why NHST is unable to provide evidence for the non-existence of an effect is that it is logically impossible to show with empirical data that something does not exist or that the difference between two populations is exactly zero. For this reason, it has been pointed out again and again that it is silly to test the nil-hypothesis that there is no effect or that a mean difference or correlation is exactly zero.

This does not mean that it is impossible to provide evidence for the absence of an effect. The solution is simply to specify a range of values that are sufficiently small to consider these differences meaningful. Once the null-hypothesis is specified as a region of values, it becomes empirically testable with NHST or with Bayesian method. However, neither NHST nor Bayesian methods can provide evidence for a point hypothesis, and the idea that Bayes-Factors can be used to do so is an illusion.

The real problem for demonstrations of the absence of an effect is that small samples with between-subject designs produce large regions of plausible values because small samples have large sampling errors. As a result, the mean differences or correlations move around considerably and it is difficult to say something about the effect size in the population. As a result, the population effect size may be within a region around zero (H0) or outside this region (H1).

Let’s illustrate this with Dienes’ first example. “Theory A claims that autistic subjects will perform worse on a novel task than control subjects will. Theory B claims that the two groups will perform the same.” A researcher tests these two “theories” in a study with 30 participants in each group.

The statistical results that serve as the input for NHST or Bayesian statistics are the effect size and sampling error, and the degrees of freedom.

The autistic group had a score of 8 percent with sampling error of 6 percentage points. The 95%CI ranges from -4 to 20.

The control group has a score of 10 with a sampling error of 5 percentage points. The 95%CI ranges from 0 to 20.

Evidently, the confidence intervals overlap, but they also allow for large differences between the two populations from which these small samples were recruited.

A comparison of the two groups, yields a standardized effect size of d = .05, se = .2, t = .05/.20 = 0.25. The 95%CI for the standardized mean difference between the two groups ranges from d = -.35 to .45, and includes values for a small negative (d = -.2) or a small positive effect (d = .2).

Nevertheless, the default prior that is advocated by Wagenmakers and Rouder yields a Bayes-Factor of 0.27, which is below the aribtrary and low criterion of 1/3 that is used to claim that the data favor the model that claims there is absolutely no performance difference. It is hard to reconcile this claim with the 95%CI that allows for values as large as d = .4. However, to maintain the illusion that Bayes-Factors can miraculously provide evidence for the nil-hypothesis Bayesian propaganda claims that confidence intervals are misleading. Even if we do not trust confidence intervals, we can ask how a study with four times as much sampling error (se = .2) than the effect size (d = .05) can assure us that the true population effect size is 0? It can not.

A standard NHST analysis produces an unimpressive p-value of .80. Everybody knows that this p-value cannot be used to claim that there is no effect, but few people know why this p-value is uninformative. First, it is uninformative because it used d = 0 as the null-hypothesis. We can never prove that this hypothesis is false. However, we could set d = .2 as the lowest effect size that we consider a meaningful difference. Thus, we can compute the t-value for a one-sided test whether the observed value of d = .05 is significantly below d = .2. This is standard NHST. We may also recognize that the sample size is rather small, and adjust our alpha criterion accordingly and allow for a 20% chance of falsely rejecting the null-hypothesis that the effect size is d = .2 or larger. As we are only expecting worse performance, this is a one-sided test.

pt(.05/.20,28,.20/.20) gives us a p-value of .226. Still not good enough to reject the null-hypothesis that the true performance difference in the population is less than d = .2. The problem is that the study with 30 participants in a between-subject design simply has too much sampling error to draw inferences about the population.

Thus, there are three reasons why psychologists rarely provide evidence for the absence of an effect. First, they always specify the null-hypothesis as a point value. This makes it impossible to provide evidence for the null-hypothesis. Second, the sampling error is typically to large to draw firm conclusions about the absence of an effect. What is the solution to improve psychological science? Theories need to be specified with some minimal effect size. For example, tests of ego-depletion, facial feedback, or terror management (to name just a few) need to make explicit predictions about effect sizes. If even small effects are considered theoretically meaningful, studies that aim to demonstrate these effects need to be powered accordingly. For example, to test an effect of d = .2 with an 80% chance of a successful outcome, if the theory is right, requires N = 788 participants. If this study were to produce a non-significant result, one would also be justified to infer that the population effect size is trivial (d < .20) with an error probability of 20%. So, true tests of theories require specification of a boundary effect size that distinguishes meaningful effects from negligible ones. And theorists who claim that their theory is meaningful even if effect sizes are small (e.g., Greenwald’s predictive validity of IAT scores) have to pay the price and conduct studies that can detect these effects.

In conclusion, how do we advance psychological science? With better data. Statisticians are getting paid for publishing statistics articles. They have been unhelpful in advancing statistics for the past one-hundred years in their in-fighting about finding the right statistical tool for inconclusive data (between-subject N = 30). Let them keep fighting, but let’s ignore them. We will only make progress by reducing sampling error so that we can see signals or the absence of signals clearly. And the only statistician you need to read is Jacob Cohen. The real enemy is not NHST or p-values, but sampling error.

Christopher J. Bryan claims replicators p-hack to get non-significant results. I claim he p-hacked his original results.

Open draft for a Response Article to be Submitted to PNAS. (1200 words, Commentary/Letters only allowed 500 words). Co-authors are welcome. Please indicate intent and make contributions in the comments section. This attack on valuable replication work needs a response.

Draft 12/2/19

Title?

Bryan, Walton, Rogers and Dweck reported three studies that suggested a slight change in message wording can have dramatic effects on voter turnout (1). Gerber, Huber, Biggers, and Hendry reported a failure to replicate this result (2). Bryan, Yeager, and O’Brien reanalyzed Gerber et al.’s data and found a significant result consistent with their original results (3). Based on this finding, Bryana et al. (2019) make two claims that go beyond the question about ways to increase voter turnout.  First, Bryana et al. accuse Gerber et al. (2019) of exploiting replicators’ degrees of freedom to produce a non-significant result.  Others have called this practice reverse p-hacking (4). Second, they claim that many replicators may engage in deceptive practices to produce non-significant results because these results are deemed easier to publish.  We take issue with these claims about the intentions and practices of researchers who conduct replication studies. Moreover, we present evidence that Bryan et al.’s (2011) results are likely to be based by the exploitation of researchers’ degrees of freedom. This conclusion is consistent with widespread evidence that social psychologists in 2011 were abusing statistical methods to inflate effect sizes in order to publish eye-catching results that often do not replicate (5). We argue that only a pre-registered replication study with high precision will settle the dispute about the influence of subtle linguistic cues on voter turnout.

Bryan et al. (2011)

Study 1 used a very small sample size of n = 16 participants in each condition. After transforming the dependent variable, a t-test produced a just significant result (p < .05 & p > .005), p = .044.  Study 2 had 88 participants, but power was reduced because the outcome variable was dichotomous. A chi-square test produced again a just significant result, p = .018.  Study 3 increased sample size considerably (N = 214), which should also increase power and produce a smaller p-value if the population effect size is the same as in Study 2. However, the observed effect size was weaker and the result was again just significant, p = .027.  In the wake of the replication crisis, awareness has increased that sampling error produces large variability in p-values and that a string of just-significant p-values is unlikely to occur by chance. Thus, the results reported by Bryan et al. (2011) suggest that researchers’ degrees of freedom were used to produce significant results (6).  For example, converted into observed power, the p-values imply 52%, 66%, and 60% power, respectively.  It is unlikely that three studies with average power of 60% can produce three significant results; the expected value is only 1.8 significant results.  These calculations are conservative because questionable research practices inflate estimates of observed power.  The replication index (R-Index) corrects for this bias by subtracting the inflation rate from the estimate of observed power (7). With 60% mean observed power and a 100% success rate, the inflation rate is 40 percentage points, and the R-Index is 60% – 40% = 20%. Simulations show that an R-Index of 20% is obtained when the null-hypothesis is true. Thus, the published results provide no empirical evidence that subtle linguistic cues influence voter turnout because the published results are incredible.

Gerber et al. (2016)

Gerber et al. conducted a conceptual replication study with a much larger sample (n = 2,236 noun condition, n = 2,232 verb condition).  Their effect was in the same direction, but much weaker and not statistically significant, 95%CI = -1.8 to 3.8. They also noted that the original studies were conducted on the day before elections or in the morning of election day and limited their analysis to the day of elections, and reported a non-significant result for this analysis as well. Gerber et al. discuss various reasons for their replication failure that assume the original results are credible (e.g., internet vs. phone contact).  They even consider the possibility that their finding could be a type-II error, although this implies that the population effect size is much smaller than the estimates in Bryant et al.’s (2011) study.

Bryan et al. (2019)

Bryan et al. (2019) noted that Gerber et al. never reported the results of a simple comparison of the two linguistic conditions, while limiting the sample to participants who were contacted on the day before elections. When they conducted this analysis with a one-sided test and alpha = .05, they obtained a significant result, p = .036. They consider these results a successful replication and they allege that Gerber et al. were intentionally not reporting this result. We do not know why Gerber et al. (2011) did not report this result, but we are skeptical that it can be considered a successful replication for several reasons. First, adding another just significant result to a series of just significant results makes the evidence weaker not stronger (5).  The reason is that a credible set of studies with modest power should contain some non-significant results. The absence of such non-significant result undermines the trustworthiness of the reported results.  The maximum probability of obtaining a just significant result (.05 to .005) is 33%.  The probability of this outcome in four out of four studies is just .33^4 = .012.  Thus, even if we consider Gerber et al.’s study a successful replication, the results do not provide strong support for the hypothesis that subtle linguistic manipulations have a strong effect on voter turnout.  Another problem with Bryan et al.’s conclusions is that they put too much weight on the point estimates of effect sizes. “In sum, the evidence across the expanded set of model specifications that includes the main analytical choices by Gerber et al. supports a substantial and robust effect consistent with the original finding by Bryan et al.” (p. 6). This claim ignores that just significant p-values imply that the corresponding confidence intervals barely exclude an effect size of zero (i.e., p = .05 implies that 0 is the lower bound of the 95%CI).  Thus, each result individually cannot be used to claim that the population effect size is large. It is also not possible to use standard meta-analysis to reduce sampling error because there is evidence of selection bias.  In short, the reanalysis found a significant result with a one-sided test for a subset of the data.  This finding is noteworthy, but hardly a smoking gun to make claims that reverse p-hacking was used to hide a robust effect.

Broader Implications for the Replication Movement

Bryan et al. (2019) generalize from their finding of a one-sided significant p-value in a conceptual replication study to replication studies in general.  Many of these generalizations are invalid because Bryan et al. do not differentiate between different types of replication studies. First, there are registered replication reports (6). Registered replication reports are approved before data are collected and are ensured publication independent of the study outcome.  Thus, Bryan et al.’s claim that replicators use researchers’ degrees of freedom to produce null-results because they are easier to publish do not apply to these replication studies.  Nevertheless, registered replication reports have shaken the foundations of social psychology by failing to replicate ego depletion or facial feedback effects.  Moreover, these replication failures were also predicted by incredible p-values in the original articles.  In contrast, bias tests fail to show reverse p-hacking in replication studies. Readers of Bryan et al. (2019) should therefore simply ignore their speculations about motives and practices of researchers who conduct replication studies. Our advice for Bryan et al. (2019) is to demonstrate that subtle linguistic cues can influence voter turnout with a preregistered replication report. The 2020 elections are just around the corner. Good luck, you guys.

References

(1) C. J. Bryan, G. M. Walton, T. Rogers, C. S. Dweck, Motivating voter turnout by invoking the self. Proc. Natl. Acad. Sci. U.S.A. 108, 12653–12656 (2011).

(2) A. S. Gerber, G. A. Huber, D. R. Biggers, D. J. Hendry, A field experiment shows that subtle linguistic cues might not affect voter behavior. Proc. Natl. Acad. Sci. U.S.A. 113, 7112–7117 (2016).

(3) C. J. Bryana, D. S. Yeager, J. M. O’Brien, Replicator degrees of freedom allow publication of misleading failures to replicate, Proc. Natl. Acad. Sci. U.S.A. 108, (2019)

(4) F. Strack, Reflection on the smiling registered replication report. Perspective on Psychological Science, 11, 929-930 (2016)

(5) G. Francis, The frequency of excess success for articles in Psychological Science. Psychonomic Bulletin & Review, 21, 1180-1187 (2014)

(6) U. Schimmack, The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551-566 (2012).

(7) U. Schimmack, A Revised Introduction to the R-Index.  https://replicationindex.com/2016/01/31/a-revised-introduction-to-the-r-index/

Personality, Partnership and Well-Being

Personality psychologists have been successful in promoting the Big Five personality factors as a scientific model of personality. Short scales have been developed that make it possible to include Big Five measures in studies with large nationally representative samples. These data have been used to examine the influence of personality on wellbeing in married couples (Dyrenforth et al., 2010).

The inclusion of partners’ personality in studies of well-being has produced two findings. First, being married to somebody with a desirable personality (low neuroticism, high extraversion, openness, agreeableness, and conscientiousness) is associated with higher well-being. Second, similarity in personality is not a predictor of higher well-being.

A recent JPSP article mostly replicated these results (van Scheppingen, Chopik, Bleidorn, & Denissen, 2019). “Similar to previous studies using difference scores and profile correlations, results from response surface analyses indicated that personality similarity explained a small amount of variance in well-being as compared with the amount of variance explained by linear actor and partner effects” (e51)

Unfortunately, personality psychologists have made little progress in the measurement of the Big Five and continue to use observed scale scores as if they are nearly perfect measures of personality traits. This practice is problematic because it has been demonstrated in numerous studies that a large portion of the variance in Big Five scale scores is measurement error. Moreover, systematic rating biases have been shown to contaminate Big Five scale scores.

Anusic et al. (2009) showed how at least some of the systematic measurement errors can be removed from Big Five scale scores by means of structural equation modelling. In a structural equation model, the shared variance due to evaluative biases can be modelled with a halo factor, while the residual variance is treated as a more valid measure of the Big Five traits.

The availability of partner data makes it possible to examine whether the halo biases of husbands and wives are correlated. It is also possible to see whether halo bias of a partner has positive effects on well-being. As halo bias in self-ratings is often considered a measure of self-enhancement, it is possible that partner who enhance their own personality have a negative effect on well-being. Alternative, partners who enhance themselves are also more likely to have a positive perception of their partner (Kim et al., 2012), which could increase well-being. An interesting question is how much partner’s actual personality influences well-being after halo bias is removed from partner’s ratings of personality.

It was easy to test these hypotheses with the correlations reported in Table 1 in van Scheppingen et al.’s article, which is based on N = 4,464 couples in in the Health and Retirement Study. Because information about standard deviations were not provided, all SDs were set to 1. However, the actual SDs of Big Five traits tend to be similar so that this is a reasonable approximation.

I fitted the Halo-Alpha-Beta model to the data, but as with other datasets alpha could not be identified. Instead, a positive correlation between agreeableness and extraversion was present in this Big Five measure, which may reflect some secondary loadings that could be modelled with items as indicators. I allowed for the two halo factors to be correlated and I allowed well-being to be predicted by actor-halo and partner-halo. I also allowed for spousal similarity for each Big Five dimension. Finally, well-being was influenced by self-neuroticism and partner-neuroticism because neuroticism is the strongest predictor of well-being. This model had acceptable fit, CFI = .981, RMSEA = .038.

Figure 1 shows the model and the standardized parameter estimates.

The main finding is that self-halo is the strongest predictor of self-rated well-being. This finding replicates Kim et al.’s (2012) results. True neuroticism (tna; i.e., variance in neuroticism ratings without halo bias) is the second strongest predictor. The third strongest predictor is partner’s true neuroticism, although it explains less than 1% of the variance in well-being. The model also shows a positive correlation between partners’ halo factors, r = .32. This is the first demonstration that spouses’ halos are positively correlated. More research is needed to examine whether this is a robust finding and what factors contribute to spousal similarity in halo. This correlation has implications for spousal similarity in actual personality traits. After removing shared halo variance, spousal similarity is only notable for openness, r = .19, and neuroticism, r = . 13.

The key implications of this model is that actual personality traits, at least those measured with the Big Five, have a relatively small effect on well-being. The only trait with a notable contribution is neuroticism, but partner’s neuroticism explains less than 1% of the variance in well-being. An open question is whether the effect of self-halo should be considered a true effect on well-being or whether it simply reflects shared method variance (Schimmack & Kim, in press).

It is well-known that well-being is relatively stable over extended periods of time (Anusic & Schimmack, 2016; Schimmack & Oishi; 2005) and that spouses have similar levels of well-being (Schimmack & Lucas, 2010). The present results suggest that the Big Five personality traits account only for a small portion of the stable variance that is shared between spouses. This finding should stimulate research that looks beyond the Big Five to study well-being in married couples. This blog post shows the utility of structural equation modelling to do so.

The (not so great) Power of Situational Manipulations in the Laboratory

Social psychology is based on the fundamental assumption that brief situational manipulations can have dramatic effects on behavior. This assumption seemed to be justified by sixty years of research that often demonstrated large effects of subtle and sometimes even subliminal manipulations of the situation. However, since 2011 it has become clear that these impressive demonstrations were a sham. Rather than reporting the actual results of studies, social psychologists selectively reported results that were statistically significant, and because they used small samples, effect sizes were inflated dramatically to produce significant results. Thus, selective reporting of results from between-subject experiments with small samples ensured that studies could only provide evidence for the power of the situation.

Most eminent social psychologists who made a name for themselves using this flawed scientific method have been silent and have carefully avoided replicating their cherished findings from the past. In a stance of defiant silence, they pretend that their published results are credible and should be taken seriously.

A younger generation of social psychologists has responded to the criticism of old studies by improving their research practices. The problem for these social psychologists is that subtle manipulations of situations at best have subtle effects on behavior. Thus, the results are no longer very impressive, and even with larger samples, it is difficult to provide robust evidence for them. This is illustrated with an article by Van Dessel , Hughes, and De Houwer (2018) in Psychological Science, 2018.

The article has all the feature of the new way of doing experimental social psychology. The article received badges for sharing materials, sharing data, and preregistration of hypotheses. The authors also mention an a-priori power analysis that assumed a small to medium effect size.

The OSF materials provide further information about power calculations in the Design documents of each study (https://osf.io/wjz3u/). I compiled this information in the Appendix. It shows that the authors did not take attrition due to exclusion criteria into account and that they computed power for one-tailed significance tests. This would lead to lower power in post-hoc power analyses with two-tailed significance tests that are used in the article. The authors also assumed a stronger effect size for Studies 3 and 4, although these studies tested riskier hypotheses (actual behavior, one-day delay between manipulation and measurement of dependent variables). Most important, the authors powered studies to have 80% power for each individual hypothesis tests, which means that they can at best expect to find 80% significant results in their tests of multiple hypotheses (Schimmack, 2012).

Indeed, the authors found some non-significant results. For example, the IAT did not show the predicted effect in Study 1. However, Studies 1 and 2 mostly showed the predicted results, but they lack ecological validity because they examined responses to novel, fictional stimuli.

Studies 3 and 4 are more important for understanding actual human behaviors because they examined health behaviors with cookies and carrots as stimuli. The main hypothesis was that a novel Goal-Relevant Avatar-Consequences Task would shift health behaviors intentions, and attitudes. Table 3 shows means for several conditions and dependent variables that produce 12 hypothesis tests.

The next table shows the t-values for the 12 hypothesis tests. These t-values can be converted into p-values, z-scores, and observed power. The last column records whether the result is significant with alpha = .05.

The most important result is that median observed power is 64% and matches the success rate of 67%. Thus, the results are credible and there is no evidence to suggest that QRPs were used to produce significant results. However, the consistent estimates also suggest that the studies did not have 80% power as the authors intended based on their a prior assumption that effect sizes would be small to moderate. In fact, the average effect size is d = .29. An a priori power analysis with this effect size shows that n = 188 participants per cell (total N = 376) are needed to achieve 80% power. Thus all studies were underpowered. .

Power can be improved by combining theoretically equivalent cells. This produces significant results for consumer choice, d = .36, t(571) = 4.21, the explicit attitude measure, d = .24, t(571) = 2.83, and the IAT, d = .33, t(571) =3.81.

Thus, the results show that the goal-relevant avatar-consequence task can shift momentary behavioral intentions and attitudes. However, it is not clear whether it actually changes behavior. The reason is that Study 4 was underpowered with only 92 participants in each cell and the snack eating effect was just significant, p = .018. This finding first needs to be replicated with an adequate sample.

Study 3 aimed to demonstrate that the effects of a brief situational manipulation can have lasting effects. As a result, participants completed a brief survey on the next day. The results are reported in Table 3.

This table allows for 10 hypothesis tests. The results are shown in the next table.

First, I did not include the question about difficulty because it is difficult to say how the situational manipulation should affect it. The item also produced the weakest evidence. The remaining 8 tests showed three significant results. The success rate of 38% is matched by the average observed power, 35%. Thus, once more there is no evidence that QRPs were used to produce significant results. At the same time, the power estimate shows that the study did not have 80% power. One reason is that the average effect size is weaker, d = .22. An a priori power analysis shows that n = 326 participants per cell would be needed to have 80% power. Thus, the actual cell frequencies of n = 99 to 108 were too small to expect consistent results.

The inconsistent results make it difficult to interpret the results. It is possible that the manipulation had a stronger effect on ratings of unhealthy behaviors than on healthy behaviors, but it is also possible that the pattern of means changes in a replication study.

The authors conclusion, however, highlights statistically significant results as if non-significant results are theoretically irrelevant.

“Compared with a control training, consequence-based approach-avoidance training (but not typical approach-avoidance training) reduced self-reported unhealthy eating behaviors and increased healthy eating intentions 24 hr after training” (p. 1907).

This conclusion is problematic because the pattern of significant results was not predicted a priori and strongly influenced by random sampling error. Selecting significant results from a larger set of statistical tests creates selection bias and the results are unlikely to replicate when studies have low power. This does not mean that the conclusions are false. It only means that the results need to be replicated in a study with adequate power (N = 326 x 3 = 978).

Conclusion

Social psychologists have a long tradition of experimental research that aims to change individuals’ behaviors with situational manipulations in laboratories. In 2011, it became apparent that most results in this tradition lack credibility because researchers used small samples with between-subject designs and reported only significant results. As a result, reported effect sizes are vastly inflated and give a wrong impression of the power of situations. In response to this realization some social psychologists have embraced open science practices and report all of their results honestly. Bias tests confirmed that this article reported as many significant results as the power of studies justifies. However, the observed power was lower than the a priori power that researchers assumed they had when they planned their sample sizes. This is particularly problematic for Studies 3 and 4 that aimed to show that results last and influence actual behavior.

My recommendation for social psychologists is to take advantage of better designs (within-subject), conduct fewer studies, and to include real behavioral measures in these studies. The problem for social psychologists is that it is now easy to collect data with online samples, but these studies do not include measures of real behavior. The study of real behavior was done with a student sample, but it only had 92 participants per cell, which is a small sample size to detect the small to moderate effects of brief situational manipulations on actual behavior.

APPENDIX

A Comparison of Scientific Doping Tests

Psychological research is often underpowered; that is, studies have a low probability of producing significant results even if the hypothesis is correct, measures are valid, and manipulations are successful. The problem with underpowered studies is that they have too much sampling error to produce a statistically significant signal to noise ratio (i.e., effect size relative to sampling error). The problem of low power was first observed in 1962 by Cohen and has persisted till this day.

Researchers continue to conduct underpowered studies because they have found a number of statistical tricks to increase power. The problem with these tricks is that they produce significant results that are difficult to replicate and that have a much higher risk of being false positives than the claim p < .05 implies. These statistical tricks are known as questionable research practices (QRPs). John et al. (2012) referred to the use of these QRPs as scientific doping.

Since 2011 it has become apparent that many published results cannot be replicated because they were produced with the help of questionable research practices. This has created a crisis of confidence or a replication crisis in psychology.

In response to the replication crisis, I have developed several methods that make it possible to detect the use of QRPs. It is possible to compare these tests to doping tests in sports. The problem with statistical doping tests is that they require more than one sample to detect the use of doping. The more studies are available, the easier it is to detect scientific doping, but often the set of studies is small. Here I examine the performance of several doping tests for a set of six studies.

The Woman in Red: Examining the Effect of Ovulatory Cycle on Women’s Perceptions of and Behaviors Toward Other Women

In 2018, the journal Personality and Social Psychological Bulletin, published an article that examined the influence of women’s cycle on responses to a woman in a red dress. There are many reasons to suspect that there are no meaningful effects in this line of research. First, it has been shown that the seminal studies on red and attractiveness used QRPs to produce significant results (Francis, 2013). Second, research on women’s cycle has been difficult to replicate (Peperkoorn, Roberts, & Pollet, 2016).

The article reported six studies that measured women’s cycle and manipulated the color of a woman’s dress between subject. The key hypothesis was an attenuated interaction effect. That is, ovulating women should rate the woman in the red dress more negatively than women who were not ovulating. Table 1 shows the results for the first dependent variable that was reported.

resultDF1DF2test.statisticpvalzvalobs.powerSIG
F(1,62)=3.2341623.230.081.770.551
F(1,205)=3.6812053.680.061.910.601
F(1,125)=0.0111250.010.920.100.060
F(1,125)=3.8611253.860.051.950.621
F(1,188)=3.1711883.170.081.770.551
F(1,533)=3.1515333.150.081.770.551

The pattern of results is peculiar because five of the six results are marginally significant; that is the p-value is greater than .05, but smaller than .10. This is strange because sampling error should produce more variability in p-values across studies. Why would the p-values always be greater than .05 and never be less than .05? It is also not clear why p-values did not decrease when researchers started to increase sample sizes from N = 62 in Study 1 to N = 533 in Study 6. As increasing sample sizes decrease sampling error, we would expect test statistics (ratio of effect size over sampling error) to become stronger and p-values to become smaller. Finally, the observed power of the six studies tends to be around 50%, except for Study 3 with a clear non-significant result. How is it possible that 5 studies with about a 50% chance to get marginally significant results produced marginally significant results in all 5 studies? Thus, a simple glance at the pattern of results raises several red flags about the statistical integrity of the results. However, do doping tests confirm this impression?

Incredibility Index

Without the clearly non-significant result in Study 3, we would have 5 significant results with an average observed power of 57%. The incredibility index simply computes the binomial probabilty of obtaining 5 significant results in 5 attempts with a 57% probability of doing so (Schimmack, 2012). The probability of doing so is 6%. Using the median power (55%) produces the same result. This would suggest that QRPs were used. However, the set of studies does include a non-significant result, which reflects a change in publishing norms. Results like these would not have been reported before the replication crisis. And reporting a non-significant result makes the results more credible (Schimmack, 2012).

With the non-significant result, average power is 49% and there are now only 5 out of 6 successes. Although there is still a discrepancy ( 49% power vs. 83% success rate), the probability of this happening by chance is 17%. Thus, there is no strong evidence that QRPs were used.

The problem here is that the incredibility index has low power to detect doping in small sets of studies, unless all results are significant. Even a single non-significant result makes the observed pattern of results a lot more credible. However, absence of evidence does not mean evidence of absence. It is still possible that QRPs were used, but that the incredibility index failed to detect this.

Test of Insufficient Variance

The test of insufficient variance (TIVA) converts the p-values into z-scores and makes the simplifying assumptions that p-values were obtained from a series of z-tests. This makes it possible to use the standard normal distribution as a model of the sampling error in each study. For a set of independent test statistics that are sampled from a standard normal distribution, the sampling error is 1. However, if QRPs are used to produce significant results, test-statistics cluster just above the significance criterion (which is 1.65 for p < .10, when marginally significant results are present). This clustering can be detected by comparing the observed variance in z-scores to the expected variance of 1, using the chi-square test for the comparison of two variances.

Again, it is instructive to focus first on the set of 5 studies with marginally significant results. The variance of z-scores is very low, Var.Z = 0.008, because p-values are confined to the tight range from .05 to .10. The probability of observing this clustering in five studies is p = .0001 or 1 out of 8,892 times. Thus, we would have strong evidence of scientific doping.

However, when we include the non-significant result, variance increases to Var.Z = 0.507, which is no longer statistically significant in a set of six studies, p = .23. This shows again that a single clearly non-significant results makes the reported a lot more credible. It also shows that one large outlier makes TIVA insensitive to detecting QRPs, even when they are present.

The Robust Test of Insufficient Variance (formerly known as the Lucky Bounce Test)

The Robust Test of Insufficient Variance (ROTIVA) is less sensitive to outliers than TIVA. It works by creating a region of p-values (or z-scores, or observed powers) that are considered to be lucky. That is, the result is significant, but not highly convincing. A useful area of lucky outcomes are p-values between .05 and .005, which correspond to power of 50% to 80%. We might say that studies with 80% power are reasonably powered and produce significant results most of the time. However, studies with 50% power are risky because the produce a significant result only in every other study. Thus, getting a significant result is lucky. With two-sided p-values, the interval ranges from z = 1.96 to 2.8. However, when marginal significance is used, the interval ranges from z = 1.65 to 2.49 with a center at 2.07.

Once the area of lucky outcomes is defined, it is possible to specify the maximum probablity of observing a lucky outcome, which is obtained by centering the sampling distribution in the middle of the lucky interval, which is 34%.

Thus, the maximum probability of obtaining a lucky significant result in a single study is 34%. This value can be used to compute the probability of obtaining x number of lucky result in a set of studies using binomial probabilities. With 5 out of 5 studies, the probability is very small, p = .005, but we see that the robust test is not as powerful as TIVA in this situation without outliers. This reverses when we include the outlier. ROTIVA still shows significant evidence of QRPs with 5 out of 6 lucky results, p = .020, when TIVA was no longer significant.

Z-Curve2.0

Z-curve was developed to estimate the replication rate for a set of studies with significant results (Brunner & Schimmack, 2019). As z-curve only selects significant results, it assumes rather than tests the presence of QRPs. The details of z-curve are too complex to discuss here. It is only important to know that z-curve allows for heterogeneity in power and approximates the distribution of significant p-values, converted into z-scores, with a mixture model of folded standard normal distributions. The model parameters are weights for components with low to high power. Although the model is fitted only to significant results, the weights can also be used to make predictions about the distribution of z-scores in the range of non-significant results. It is then possible to examine whether the predicted number of non-significant results matches the observed number of significant results.

To use z-curve for sets of studies with marginally significant results, one only needs to adjust the significance criterion from p = .05 (two-tailed) to p = .10 (two-tailed) or from z = 1.96 to z = 1.65. Figure 2 shows the results, including bootstrapped confidence intervals.

The most relevant statistic for the detection of QRPs is the comparison of the observed discovery rate and the estimated discovery rate. As for the incredibility index, the observed discovery rate is simply the percentage of studies with significant results (5 out of 6). The expected discovery rate is the area under the gray curve that is in the range of significant results with z > 1.65. As can be seen this area with very small, given the estimated sampling distribution from which significant results were selected. The 95%CI for the observed discovery rate has a lower limit of 54%, while the upper limit for the estimated discovery rate is 15%. Thus, these intervals do not overlap and are very far from each other, which provides strong evidence that QRPs were used.

Conclusion

Before the replication crisis it was pretty much certain that articles would only report significant results that support hypotheses (Sterling, 1959). This selection of confirmatory evidence was considered an acceptable practices, although it undermines the purpose of significance testing. In the wake of the replication crisis, I developed tests that can examine whether QRPs were used to produce significant results. These tests work well even in small sets of studies as long as all results are significant.

In response to the replication crisis, it has become more acceptable to publish non-significant results. The presence of clearly non-significant results makes a published article more credible, but it doesn’t automatically mean that QRPs were not used. A new deceptive practice would be to include just one non-significant result to avoid detection by scientific doping tests like the incredibility index or TIVA. Here I show that a second generation of doping tests is able to detect QRPs in small sets of studies even when non-significant results are present. This is bad news for p-hackers and good news for science.

I suggest that journal editors and reviewers make use of these tools to ensure that journals publish only credible scientific evidence. Articles like this one should not be published because they do not report credible scientific evidence. Not publishing articles like this is even beneficial for authors because they avoid damage to their reputation when post-publication peer-reviews reveal the use of QRPs that are no longer acceptable.

References

Francis, G. (2013). Publication bias in “Red, Rank, and Romance in Women Viewing Men” by Elliot et al. (2010). Journal of Experimental Psychology: General, 142, 292-296.

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence
of questionable research practices with incentives for truth telling.
Psychological Science, 23, 524–532. doi:10.1177/0956797611430953

Peperkoorn, L. S., Roberts, S. C., & Pollet, T. V. (2016). Revisiting the red effect on attractiveness and sexual receptivity: No effect of the color red on human mate preferences. Evolutionary Psychology, 14(4). http://dx.doi.org/10.1177/1474704916673841

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551-566. https://replicationindex.com/2018/02/18/why-most-multiple-study-articles-are-false-an-introduction-to-the-magic-index/

Schimmack, U. (2015). The Test of Insufficient Variance. https://replicationindex.com/2015/05/13/the-test-of-insufficient-variance-tiva-a-new-tool-for-the-detection-of-questionable-research-practices-2/

Schimmack, U. (2015). The Lucky Bounce Test. https://replicationindex.com/2015/05/27/when-exact-replications-are-too-exact-the-lucky-bounce-test-for-pairs-of-exact-replication-studies/

Hidden Evidence in Racial Bias Research by Cesario and Johnson

In a couple of articles, Cesario and Johnson have claimed that police officers have a racial bias in the use of force with deadly consequences (Cesario, Johnson, & Terrill, 2019; Johnson, Tress, Burkel, Taylor, & Cesario, 2019). Surprisingly, they claim that police officers in the United States are MORE likely to shoot White civilians than Black civilians. And the differences are not small either. According to their PNAS article, “a person fatally shot by police was 6.67 times less likely (OR = 0.15 [0.09, 0.27]) to be Black than White” (p. 15880). In their SPPS article, they write “The odds were 2.7 times higher for Whites to be killed by police gunfire relative to Blacks given each group’s SRS homicide reports, 2.6 times higher for Whites given each group’s SRS homicide arrests, 2.9 times higher for Whites given each group’s NIBRS homicide reports, 3.9 times higher for Whites given each group’s NIBRS homicide arrests, and 2.5 times higher for Whites given each group’s CDC death by assault data. Thus, the authors claim that for every Black civilian killed by police, there are 2 to 6 White civilians killed by police under similar circumstances.

The main problem with Cesario and Johnson’s conclusion is that they rest entirely on the assumption that violent crime statistics are a reasonable estimate for the frequency of encounters with police that may result in the fatal use of force.

One cannot experience a policing outcome without exposure to police, and if exposure rates differ across groups, then the correct benchmark is on those exposure rates.” (Cesario, Johnson, & Terrill, 2019, p. 587).

In the context of police shootings, exposure would be reasonably approximated by rates of criminal involvement for Blacks and Whites; the more group members are involved in criminal activity, the more exposure they have to situations in which police shootings would be likely to occur” (p. 587).

The quotes make it clear that Cesario and Johnson use crime statistics as a proxy for encounters with police that sometimes result in the fatal use of force.

What Cesario and Johnson are not telling their readers is that there are much better statistics to estimate how frequently civilians encounter police. I don’t know why Cesario and Johnson did not use this information or share it with their readers. I only know that they are aware that this information exists because they cite an article that made use of this information in their PNAS article (Tregle, Nix, Alpert, 2019). Although Tregle et al. (2019) use exactly the same benchmarking approach as Cesario and Johnson, the results are not mentioned in the SPPS article.

The Police-Public-Contact Survey

The Bureau of Justice Statistics has collected data from over 100,000 US citizens about encounters with police. The Police-Public Contact Survey has been conducted in 2002, 2005, 2008, 2011, and 2015. Tregle et al. (2019) used the freely available data to create three benchmarks for fatal police shootings.

First, they estimated that there are 2.5 million police-initiated contacts a year with Black civilians and 16.6 million police initiated contacts a year with White civilians. This is a ratio of 1:6.5, which is slightly bigger than the ratio for Black and White citizens (39.9 million vs. 232.9 million), 1:5.8. Thus, there is no evidence that Black civilians have disproportionally more encounters with police than White civilans. Using either one of these benchmarks, still suggests that Black civilians are more likely to be shot than White civilians by a ratio of 3:1.

One reason for the proportionally higher rate of police encounters for White civilians is that they drive more than Blacks, which leads to more traffic stops for Whites. Here the ratio is 2.0 million to 14.0 million or 1:7. The picture changes for street stops, with a ratio of 0.5 million to 2.6 million, 1:4.9. But even this ratio still implies that Black civilians are at a greater risk to be fatally shot during a street stop with an odds-ratio of 2.55:1.

It is telling that Cesario and Johnson are aware of an article that came to opposite conclusions based on a different approach to estimate police encounters and do not mention this finding in their article. Apparently it was more convenient to ignore this inconsistent evidence to tell their readers that data consistently show no anti-Black bias. While readers who are not scientists may be shocked by this omission of inconvenient evidence, scientists are all to familiar with this deceptive practice of cherry picking that is eroding trust in science.

Encounters with Treats and Use of Force

Cesario and Johnson are likely to argue that it is wrong to use police encounters as a benchmark and that violent crime statistics are more appropriate because police officers mostly use force in encounters with violent criminals. However, this is simply an assumption that is not supported by evidence. For example, it is questionable to use homicide statistics because homicide arrests account for a small portion of incidences of fatal use of force.

A more reasonable benchmark are incidences of non-fatal use of force. The PPCS data make it possible to do so because respondents also report about the nature of the contact with police, including the use of force. It is not even necessary to download and analyze the data because Hyland et al. (2015) already reported on racial disparities in incidences that involved threats or non-fatal use of force (see Table 2, Table 1 in Hyland et al. (2015).

The crucial statistic is that there are 159,100 encounters with Black civilians and 445,500 encounters with White civilians that involve threat or use of force; a ratio of 1: 2.8. Using non-fatal encounters as a benchmark for fatal encounters still results in a greater probability of a Black civilian to be killed than a White civilian, although the ratio is now down to a more reasonable ratio of 1.4:1.

It is not clear why Cesario and Johnson did not make use of a survey that was designed to measure police encounters when they are trying to estimate racial disparities in police encounters. What is clear is that these data exist and that they lead to a dramatically different conclusion than the surprising results in Cesario and Johnson’s analyses that rely on violent crime statistics to estimate police encounters.

Implications

It is important to keep in mind that the racial disparity in the fatal use of force in the population is 3:1 (Tregle et al., 2019, Table 1). The evidence from the PPCS only helps to shed light on the factors that contribute to this disparity. First, Black civilians are not considerably more likely to have contact with police than White civilians. Thus, it is simply wrong to claim that different rates of contact with police explain racial disparities in fatal use of force. There is also no evidence that Black civilians are disproportionally more likely to be stopped by police by driving. The caveat here is that Whites might drive more and that there could be a racial bias in traffic stops after taking amount of driving into account. This simply shows how difficult it is to draw conclusions about racial bias based on these kind of data. However, the data do show that the racial disparity in fatal use of force cannot be attributed to more traffic stops of Black drivers. Even the ratio of street stops is not notably different from the population ratios.

The picture changes when threats and use of force as added to the picture. Black civilians are 2.5 times more likely to have an encounter that involves threats and use of force than White civilans (3.5% vs. 1.4%, in Table 2; Table 1 from Hyland et al., 2015).

These results shed some light on an important social issue, but these numbers also fail to answer important questions. First of all, they do not answer questions about the reasons why officers use threats and force more often with Black civilians. Sometimes the use of force is justified and some respondents of the PPCS even admitted that the use of force was justified. However, at other times the use of force is excessive. The incidence rates in the PPCS are too small to draw firm conclusions about this important question.

Unfortunately, social scientists are under pressure to publish to build their careers, and they are under pressure to present strong conclusions to get their manuscripts accepted. This pressure can lead researchers to make bigger claims than their data justify. This is the case with Cesario and Johnson’s claim that officers have a strong bias to use deadly force more frequently with White civilians than Black civilians. This claim is not supported by strong data. Rather it rests entirely on the use of violent crime statistics to estimate police encounters. Here I show that this approach is questionable and that different results are obtained with other reasonable approaches to estimate racial differences in police encounters.

Unfortunately, Cesario and Johnson are not able to see how offensive their claims are to family members of innocent victims of deadly use of force, when they attribute the use of force to violent crime, which implies that the use of force was justified and that victims are all criminals who threatened police with a weapon. Even if the wast majority of cases are justified and fatal use of force was unavoidable, it is well known that this is not always the case. Research on fatal use of force would be less important if police officers would never make mistakes in the use of force. Cesario and Johnson receive tax-payer money to found their research because fatal use of force is sometimes unnecessary and unjustified. It is those cases that require an explanation and interventions that minimize the unnecessary use of force. To use taxpayer’s money to create the false impression that fatal use of force is always justified and that police officers are more afraid of using force with Black civilians than they are afraid of Black civilians is not helpful and offensive to the families of innocent Black victims that are grieving a loved one. The families of Tamir Rice, Atatiana Jefferson, Eric Garner, Philando Castile, to name a few, deserve better.

Police Officers are not Six Times more Likely to Shoot White Civilians than Black Civilians: A Coding Error in Johnson et al. (2019)

Rickard Carlsson and I submitted a letter to the Proceedings of the National Academy of Sciences. The format allows only 500 words (PDF). Here is the long version of our concerns about Johnson et al.’s PNAS article about racial disparities in police shootings. An interesting question for meta-psychologists is how the authors and reviewers did not catch an error that led to the implausible result that police officers are six times more likely to shoot White civilians than Black civilians when they felt threatened by a civilian.

Police Officers are not Six Times more Likely to Shoot White Civilians than Black Civilians: A Coding Error in Johnson et al. (2019)

Ulrich Schimmack Rickard Carlsson
University of Toronto, Mississauga Lineaus University

The National Academy of Sciences (NAS) was founded in 1863 by Abraham Lincoln to provide independent, objective advice to the nation on matters related to science and technology (1).  In 1914, NAS established the Proceedings of the National Academy of Sciences (PNAS) to publish scientific findings of high significance.  In 2019, Johnson, Tress, Burke, Taylor, and Cesario published an article on racial disparities in fatal shootings by police officers in PNAS (2).  Their publication became the topic of a heated exchange in the Oversight Hearing on Policing Practices in the House Committee on the Judiciary on September 19, 2019. Heather Mac Donald cited the article as evidence that there is no racial disparity in fatal police shootings. Based on the article, she also claimed “In fact, black civilians are shot less, compared with whites, than their rates of violent crime would predict” (3). Immediately after her testimony, Phillip Atiba Goff challenged her claims and pointed out that the article had been criticized (4). In a rebuttal, Heather MacDonald cited Johnson from the authors response that the authors stand by their finding (5).  Here we show that the authors’ conclusions are based on a statistical error in their analyses.

The authors relied on the Guardian’s online database about fatal use of force (7). The database covers 1,146 incidences in 2015.  One aim of the authors’ research was to examine the influence of officers’ race on the use of force. However, because most officers are White, they only found 12 incidences (N = 12, 5%) where a Black citizen was fatally shot by a Black officer. This makes it impossible to examine statistically reliable effects of officers’ race.  In addition, the authors examined racial disparities in fatal shootings with regression models that related victims’ race to victims, officers, and counties’ characteristics. The results showed that “a person fatally shot by police was 6.67 times less [italics added] likely (OR = 0.15 [0.09, 0.27]) to be Black than White” (p. 15880).  This finding would imply for every case of a fatal use of force with a Black citizen like Eric Garner or Tamir Rice, there should be six cases similar cases with White citizens.  The authors explain this finding with depolicing; that is, officers may be “less likely to fatally shoot Black civilians for fear of public and legal reprisal” (p. 15880).  The authors also conducted several additional analyses that are reported in their supplementary materials.  However, they claim that their results are robust and “do not depend on which predictors are used” (p. 15881). We show that all of these statements are invalidated by a coding mistake in their statistical model.

Table 1
Racial Disparity in Race of Fatally Shot Civilians

    Model County Predictor Odds-Ratio (Black/White), 95%CI
M1 Homicide Rates 0.31 (0.23, 0.42)
M2 Population Rates 2.03 (1.21, 3.41)
M3 Population & Homicide Rates 0.89 (0.44, 1.80)

The authors did not properly code categorical predictor variables. In a reply, the authors acknowledge this mistake and redid the analyses with proper weighted effect coding of categorical variables. Their new results are reported in Table. 1   The correct results show that the choice of predictor variables does have a strong influence on the conclusions.  In a model that only uses homicide rates as predictor (M1), the intercept still shows a strong anti-White bias, with 3 White civilians being killed for every 1 Black civilian in a county with equal proportions of Black and White citizens. In the second model with population proportions as predictor, the data show anti-Black bias. When both predictors are used, the data show parity, but with a wide margin of error that ranges from a ratio of 2 White civilians for 1 Black civilian to 2 Black civilians for 1 Black civilian.  Thus, after correcting the statistical mistake, the results are no longer consistent and it is important to examine which of these models should be used to make claims about racial disparities.

We argue that it is necessary to include population proportions in the model.  After all, there are many counties in the dataset with predominantly White populations and no shootings of Black civilians. This is not surprising. For officers to encounter and fatally shoot a Black resident, there have to be Black civilians. To ignore the demographics would be a classic statistical mistake that can lead to false conclusions, such as the famous example that is used to teach the difference between correlation and causation. In this example, it appears as if Christians commit more homicides because homicide rates are positively correlated with the number of churches. This inference is wrong because the correlation between churches and homicides simply reflects the fact that counties with a larger population have more churches and more homicides.  Thus, the model that uses only population ratios as predictor is useful because it tells us whether White or Black people are shot more often than we would expect if race was unrelated to police shootings. Consistent with other studies, including an article by the same authors, we see that Black citizens are shot disproportionally more often than White citizens (8,9).

The next question that a scientific study of police shootings can examine is why there exist racial disparities in police shootings.  Importantly, answering this question does not make racial disparities disappear. Even if Black citizens are shot more often because they are more often involved in crimes, as the authors claim, there exists a racial disparity.  It didn’t disappear, nor does this explanation account for incidences like the death of Eric Garner or Tamir Rice.  However, the authors’ conclusion that “racial disparity in fatal shootings is explained by non-Whites’ greater exposure to the police through crime” (p. 15881) is invalid for several reasons.

First of all, the corrected results for the model that takes homicide rates and population rates into account no longer provides conclusive evidence about racial disparities. The data still allow for a racial disparity where Black civilians are shot at twice the rate as White civilians.  Moreover, this model ignores the authors’ own finding that victims’ age is a significant predictor of victims’ race.  Parity is obtained for the average age of 37, but the age effect implies that 20-year old victims are significantly more likely to be Black, OR(B/W) = 3.26, 95%CI = 1.26 to 8.43 while 55-year old victims are significantly more likely to be White, OR(B/W) = 0.24, 95%CI = 0.08 to 0.71.  Thus, even when homicide rates are included in the model, the authors’ data are consistent with the public perception that officers are more likely to use force with young Black men than with young White men.

The second problem is that the model does not include other potentially relevant predictor variables, such as poverty rates, and that an analysis across counties is unable to distinguish between actual and spurious predictors because all statistics are highly correlated with counties’ demographics (r > .9).

A third problem is that it is questionable to rely on statistics about homicide victims as a proxy for police encounters. The use of homicide rates implies that most victims of fatal use of force are involved in homicides. However, the incidences in the Guardian database show that many victims were involved in less severe crimes.

Finally, it is still possible that there is racial disparity in unnecessary use of force even if fatal incidences are proportional to violent crimes. If police encounter more Black people in ambiguous situations because Black people are disproportionally more involved in violent crime, they would still accidentally shoot more Black citizens than White citizens. It is therefore important to distinguish between racial bias of officers and racial disparities in fatal incidences of use of force.  Racial bias is only one of several factors that can produce racial disparities in the use of excessive force.

Conclusion

During a hearing on policing practices in the House Committee on the Judiciary, Heather MacDonald cited Johnson et al.’s (2019) article as evidence that crime accounts for racial disparities in the use of lethal force by police officers and that “black civilians are shot less, compared with whites, than their rates of violent crime would predict.” Our analysis of Johnson et al.’s (2019) article shows that these statements are to a large extent based on a statistical error.  Thus, the article cannot be used as evidence to claim that there are no racial disparities in policing or as evidence that police officers are even more reluctant to use excessive force with Black suspects than with White civilians.  The only lesson that we can learn from this article is that social scientists make mistakes and that pre-publication peer-review alone does not ensure that these mistakes are caught and corrected. It is puzzling how the authors and reviewers did not detect a statistical mistake when the results implied that police officers fatally shoot 6 White suspects for every Black suspect. It was this glaring finding that made us conduct our own analyses and to detect the mistake. This shows the importance of post-publication peer review to ensure that scientific information that informs public policy is as objective and informative as it can be

References

1. National Academy of Sciences.  Mission statement of the (http://www.nasonline.org/about-nas/mission/)

2. Johnson, D. J., Trevor T., Nicole, B., Carley, T., & Cesario, J. (2019). Officer characteristics and racial disparities in fatal officer-involved shootings. Proceedings of the National Academy of Sciences, 116(32), 15877–15882.

3. MacDonald, H. (2019). False Testimony, https://www.city-journal.org/police-shootings-racial-bias

4. Knox, D. & Mummolo, J. (2019). Making inferences about racial disparities in police violence. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3431132

5. Johnson, D. J., & Cesario, J. (2019). Reply to Knox and Mummolo: Critique of Johnson et al. (2019). https://psyarxiv.com/dmhpu/

6. Johnson, D. J., & Cesario, J. (2019). Reply to Schimmack: Critique of Johnson et al. (2019).

7. “The counted.” The Guardian. https://www.theguardian.com/us-news/ng-interactive/2015/jun/01/the-counted-police-killings-us-database#

8. J. Cesario, D. J. Johnson, W. Terrill, Is there evidence of racial disparity in police use of deadly force? Analyses of officer-involved fatal shootings in 2015–2016. Soc. Psychol. Personal. Sci. 10, 586–595 (2018).

9. Edwards, F., Lee, H., Esposito, M. (2019). Risk of being killed by police use of force in the United States by age, race-ethnicity, and sex. Proceedings of the National Academy of Sciences, 116(34), 16793-16798. doi: 10.1073/pnas.1821204116