Category Archives: P-Hacking.

Z-Curve: An even better p-curve

April 25, 2021P-Curve, P-Hacking., Z-CurveUlrich Schimmack

So far Simmons, Nelson, and Simonsohn have not commented on this blog post. I now submitted it as a commentary to JEP-General. Let’s see whether it will be send out for review and whether they will comment as (anonymous) reviewers.

Abstract

P-Curve was a first attempt to take the problem of selection for significance seriously and to evaluate whether a set of studies provides credible evidence against the null-hypothesis after taking selection bias into account. Here I showed that p-curve has serious limitations and provides misleading evidence about the strength of evidence against the null-hypothesis. I showed that all of the information that is provided by a p-curve analysis (Simonsohn, Nelson, & Simmons, 2014) is also provided by a z-curve analysis (Bartos & Schimmack, 2021). Moreover, z-curve provides additional information about the presence and the amount of selection bias. As z-curve is superior than p-curve, the rational choice is to use z-curve to examine the credibility of significant results.

Keywords: Publication Bias, Selection Bias, Z-Curve, P-Curve, Expected Replication Rate, Expected Discovery Rate, File-Drawer, Power

Introduction

In 2011, it dawned on psychologists that something was wrong with their science. Daryl Bem had just published an article with nine studies that showed an incredible finding (Bem, 2011). Participants’ responses were influenced by random events that had not yet occurred. Since then, the flaws in research practices have become clear and it has been shown that they are not limited to mental time travel (Schimmack, 2020). For decades, psychologists assumed that statistically significant results reveal true effects and reported only statistically significant results (Motyl et al., 2017; Sterling, 1959; Sterling et al., 1995). However, selective reporting of significant results undermines the purpose of significance testing to distinguish true and false hypotheses. If only significant results are reported, most published results could be false positive results (Simmons, Nelson, & Simonsohn, 2011).

Selective reporting of significant results also undermines the credibility of meta-analyses (Rosenthal, 1979), which explains why meta-analyses also suggest humans posses psychic abilities (Bem & Honorton, 1994). Thus, selection bias not only invalidates the results of original studies, it also threatens the validity of conclusions based on meta-analyses that do not take selection bias into account.

Concerns about a replication crisis in psychology led to an increased focus on replication studies. An ambitious project found that only 37% of studies in (cognitive & social) experimental psychology could be replicated (Open Science Collaboration, 2015). This dismal result created a crisis of confidence in published results. To alleviate these concerns, psychologists developed new methods to detect publication bias. These new methods showed that Bem’s paranormal results were obtained with the help of questionable research practices (Francis, 2012; Schimmack, 2012), which explained why replication attempts were unsuccessful (Galak et al., 2012). Furthermore, Francis showed that many published articles in the prestigious journal Psychological Science show signs of publication bias (Francis, 2014). However, the presence of publication bias does not imply that the published results are false (positives). Publication bias may merely inflate effect sizes without invalidating the main theoretical claims. To address the latter question it is necessary to conduct meta-analyses that take publication bias into account. In this article, I compare two methods that were developed for this purpose; p-curve (Simonsohn et al., 2014), and z-curve (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020). P-curve was introduced in 2014 and has already been used in many articles. Z-curve was developed in 2015, but was only published recently in a peer-reviewed journal. Experimental psychologists who are familiar with speed-accuracy tradeoffs may not be surprised to learn that z-curve is a superior method. As Brunner and Schimmack (2020) demonstrated with simulation studies, p-curve often produces inflated estimates of the evidential value of original studies. This bias was not detected by the developers of p-curve because they did not evaluate their method with simulation studies. Moreover, their latest version of p-curve was never peer-reviewed. In this article, I first provide a critical review of p-curve and then show how z-curve addresses all of them.

P-Curve

P-curve is a name for a family of statistical tests that have been combined into the p-curve app that researchers can use to conduct p-curve analyses, henceforth called p-curve . The latest version of p-curve is version 4.06 that was last updated on November 30, 2017 (p-curve.com).

The first part of a p-curve analysis is a p-curve plot. A p-curve plot is a histogram of all significant p-values where p-values are placed into five bins, namely p-values ranging from 0 to .01, .01 to .02, .02 to .03, .03 to .04, and .04 to .05. If the set of studies contains mostly studies with true effects that have been tested with moderate to high power, there are more p-values between 0 and .01 than between .04 and .05. This pattern has been called a right-skewed distribution by the p-curve authors. If the distribution is flat or reversed (more p-values between .04 and .05 than between 0 and .01), the data lack evidential value; that is, the results are more consistent with the null-hypothesis than with the presence of a real effect.

The main limitation of p-curve plots is that it is difficult to evaluate ambiguous cases. To aid in the interpretation of p-curve plots, p-curve also provides statistical tests of evidential value. One test is a significance tests against the null-hypothesis that all significant p-values are false positive results. If this null-hypothesis can be rejected with the traditional alpha criterion of .05, it is possible to conclude that at least some of the significant results are not false positives. The main problem with this significance test is that it does not provide information about effect sizes. A right-skewed p-curve with a significant p-values may be due to weak evidence with many false positive results or strong evidence with few false positives.

To address this concern, the p-curve app also provides an estimate of statistical power. When studies are heterogeneous (i.e., different sample sizes or effect sizes or both) this estimate is an estimate of mean unconditional power (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020). Unconditional power refers to the fact that a significant result may be a false positive result. Unconditional power does not condition on the presence of an effect (i.e., the null-hypothesis is false). When the null-hypothesis is true, a result has a probability of alpha (typically 5%) to be significant. Thus, a p-curve analysis that includes some false positive results, includes some studies with a probability equal to alpha and others with probabilities greater than alpha.

To illustrate the p-curve app, I conducted a meta-analysis of all published articles by Leif D. Nelson, one of the co-authors of p-curve I found 119 studies with codable data and coded the most focal hypothesis for each of these studies. I then submitted the data to the online p-curve app. Figure 1 shows the output.

Visual inspection of the p-curve plot shows a right-skewed distribution with 57% of the p-values between 0 and .01 and only 6% of p-values between .04 and .05. The statistical tests against the null-hypothesis that all of the significant p-values are false positives is highly significant. Thus, at least some of the p-values are likely to be true positives. Finally, the power estimate is very high, 97%, with a tight confidence interval ranging from 96% to 98%. Somewhat redundant with this information, the p-curve app also provides a significance test for the hypothesis that power is less than 33%. This test is not significant, which is not surprising given the estimated power of 97%.

The p-curve results are surprising. After all, Nelson openly stated that he used questionable research practices before he became aware of the high false positive risk associated with these practices. “We knew many researchers—including ourselves—who readily admitted to dropping dependent variables, conditions, or participants to achieve significance.” (Simmons, Nelson, & Simonsohn, 2018, p. 255). The impressive estimate of 97% power is in stark contrast to the claim that questionable research practices were used to produce Nelson’s results. A z-curve analysis of the data shows that the p-curve results provide false information about the robustness of Nelson’s published results.

Z-Curve

Like p-curve, z-curve analyses are supplemented by a plot of the data. The main difference is that p-values are converted into z-scores using the formula for the inverse normal distribution; z = qnorm(1-p/2). The second difference is that significant and non-significant p-values are plotted. The third difference is that z-curve plots have a much finer resolution than p-curve plots. Whereas p-curve bins all z-scores from 2.58 to infinity into one bin (p < .01), z-curve uses the information about the distribution of z-scores all the way up to z = 6 (p = .000000002; 1/500,000,000). Z-statistics greater than 6 are assigned a power of 1.

Visual inspection of the z-curve plot reveals something that the p-curve plot does not show, namely there is clear evidence for the presence of selection bias. Whereas p-curve suggests that “highly” significant results (0 to .01) are much more common than “just” significant results (.04 to .05), z-curve shows that just significant results (.05 to .005) are much more frequent than highly significant (p < .005) results. The difference is due to the implicit definition of high and low in the two plots. The high frequency of highly significant (p < .01) results in the p-curve plots is due to the wide range of values that are lumped together into this bin. Once it is clear that many p-values are clustered just below .05 (z > 1.96, the vertical red line), it is immediately notable that there are too few just non-significant (z < 1.96) values. This steep drop in frequencies for just significant to just not significant values is inconsistent with random sampling error. Thus, publication bias is readily visible by visual inspection of a z-curve plot. In contrast, p-curve plots provide no information about publication bias because non-significant results are not shown. Even worse, right skewed distributions are often falsely interpreted as evidence that there is no publication bias or use of questionable research practices (e.g., Rusz, Le Pelley, Kompier, Mait, & Bijleveld, 2020). This misinterpretation of p-curve plots can be easily avoided by inspection of z-curve plots.

The second part of a z-curve analysis uses a finite mixture model to estimate two statistical parameters of the data. These parameters are called the expected discovery rate and the expected replication rate (Bartos & Schimmack, 2021). Another term for these parameters is mean power before selection and mean power after selection for significance (Brunner & Schimmack, 2020). The meaning of these terms is best understood with a simple example where a researcher tests 100 false hypotheses and 100 true hypotheses with 100% power. The outcome of this study produces significant and non-significant p-values. The expected value for the frequency of significant p-values is 100 for the 100 true hypotheses tested with 100% power and 5% for the 100 false hypotheses that produce 5 significant results when alpha is set to 5%. Thus, we are expecting 105 significant results and 95 non-significant results. In this example, the discovery rate is 105/200 = 52.5%. With real data, the discovery rate is often not known because not all statistical tests are published. When selection for significance is present, the observed discovery rate is an inflated estimate of the actual discovery rate. For example, if 50 of the 95 non-significant results are missing, the observed discovery rate is 105/150 = 70%. Z-curve.2.0 uses the distribution of the significant z-scores to estimate the discovery rate by taking selection bias into account. That is, it uses the truncated distribution for z-scores greater than 1.96 to estimate the shape of the full distribution (i.e., the grey curve in Figure 2). This produces an estimate of the mean power before selection for significance. As significance is determined by power and sampling error, the estimate of mean power provides an estimate of the expected discovery rate. Figure 2 shows an observed discovery rate of 87%. This is in line with estimates of discovery rates around 90% in psychology journals (Motyl et al., 2017; Sterling, 1959; Sterling et al., 1995). However, the z-curve estimate of the expected discovery rate is only 27%. The bootstrapped, robust confidence interval around this estimate ranges from 5% to 51%. As this interval does not include the value for the observed discovery rate, the results provide statistically significant evidence that questionable research practices were used to produce 89% significant results. Moreover, the difference between the observed and expected discovery rate is large. This finding is consistent with Nelson’s admission that many questionable research practices were used to achieve significant results (Simmons et al., 2018). In contrast, p-curve provides no information about the presence or amount of selection bias.

The power estimate provided by the p-curve app is the mean power of studies with a significant result. Mean power for these studies is equal or greater to the mean power of all studies because studies with higher power are more likely to produce a significant result (Brunner & Schimmack, 2020). Bartos and Schimmack (2021) refer to mean power after selection for significance as the expected replication rate. To explain this term, it is instructive to see how selection for significance influences mean power in the example with 100 test of true null-hypotheses and 100 tests of true alternative hypotheses with 100% power. We expect only 5 false positive results and 100 true positive results. The average power of these 105 studies is (5 * .05 + 100 * 1)/105 = 95.5%. This is much higher than the mean power before selection for significance which was based on 100 rather than just 5 tests of a true null-hypothesis. For Nelson’s data, p-curve produced an estimate of 97% power. Thus, p-curve predicts that 96% of replication attempts of Nelson’s published results would produce a significant result again. The z-curve estimate in Figure 2 shows that this is a dramatically inflated estimate of the expected replication rate. The z-curve estimate is only 52% with a robust 95% confidence interval ranging from 40% to 68%. Simulation studies show that z-curve estimates are close to the simulated values, whereas p-curve estimates are inflated when the studies are heterogeneous (Brunner & Schimmack, 2020). The p-curve authors have been aware of this bias in p-curve estimates since January 2018 (Simmons, Nelson, & Simonsohn, 2018), but they have not changed their app or warned users about this problem. The present example clearly shows that p-curve estimates can be highly misleading and that it is unscientific to use or interpret p-curve estimates of the expected replication rate.

Published Example

Since p-curve was introduced, it has been cited in over 500 articles and it has been used in many meta-analyses. While some meta-analyses correctly interpreted p-curve results to demonstrate merely that a set of studies have some evidential value (i.e., the nil-hypothesis that all significant results are false positives), others went further and drew false conclusions from a p-curve analysis. Moreover, meta-analyses that used p-curve missed the opportunity to quantify the amount of selection bias in a literature. To illustrate how meta-analysts can benefit from a z-curve analysis, I reexamined a meta-analysis of the effects of reward stimuli on attention (Rusz, et al., 2020).

Using their open data (https://osf.io/rgeb6/), I first reproduced their p-curve analysis using the p-curve app (http://www.p-curve.com/app4/). Figure 3 show that 42% of the p-values are between .01 and 0, whereas only 7% of the p-values are between .04 and .05. The figure also shows that the observed p-curve is similar to the p-curve that is predicted by a homogeneous set of studies with 33% power. Nevertheless, power is estimated to be 52%. Rusz et al. (2020) interpret these results as evidence that “this set of studies contains evidential value for reward-driven distraction” and that “It provides no evidence for p-hacking” (p. 886).

Figure 4 shows the z-curve for the same data. Visual inspection of the z-curve plot shows that there are many more just-significant than just-not-significant results. This impression is confirmed by a comparison of the observed discovery rate (74%) versus the expected discovery rate (27%). The bootstrapped, robust 95% confidence interval, 8% to 58%, does not include the observed discovery rate. Thus, there is statistically significant evidence that questionable research practices inflated the percentage of significant results. The expected replication rate is also lower (37%) than the p-curve estimate (52%). With an average power of 37%, it is clear that published studies are underpowered. Based on these results, it is clear that effect-size meta-analysis that do not take selection bias into account produce inflated effect size estimates. Moreover, when the ERR is higher than the EDR, studies are heterogenous, which means that some studies have even less power than the average power of 37%, and some of these may be false positive results. It is therefore unclear which reward stimuli and which attention paradigms show a theoretically significant effect and which do not. However, meta-analysts often falsely generalize an average effect to individual studies. For example, Rusz et al. (2020) concluded from their significant average effect size (d ~ .3) that high-reward stimuli impair cognitive performance “across different paradigms and across different reward cues” (p. 887). This conclusion is incorrect because they mean effect size is inflated and could be based on subsets of reward stimuli and paradigms. To demonstrate that a specific reward stimulus influences performance on a specific task would require high powered replication studies for the various combinations of rewards and paradigms. At present, the meta-analysis merely shows that some rewards can interfere with some tasks.

Conclusion

Simonsohn et al. (2014) introduced p-curve as a statistical tool to correct for publication bias and questionable research practices in meta-analyses. In this article, I critically reviewed p-curve and showed several limitations and biases in p-curve results. The first p-curve methods focussed on statistical significance and did not quantify the strength of evidence against the null-hypothesis that all significant results are false positives. This problem was solved by introducing a method that quantified strength of evidence as the mean unconditional power of studies with significant results. However, the estimation method was never validated with simulation studies. Independent simulation studies showed that p-curve systematically overestimates power when effect sizes or sample sizes are heterogeneous. In the present article, this bias inflated mean power for Nelson’s published results from 52% to 97%. This is not a small or negligible deviation. Rather, it shows that p-curve results can be extremely misleading. In an application to a published meta-analysis, the bias was less extreme, but still substantial, 37% vs. 52%, a 15 percentage points difference. As the amount of bias is unknown unless p-curve results are compared to z-curve results, researchers can simply use z-curve to obtain an estimate of mean power after selection for significance or the expected replication rate.

Z-curve not only provides a better estimate of the expected replication rate. It also provides an estimate of the expected discovery rate; that is the percentage of results that are significant if all studies were available (i.e., after researchers empty their file drawer). This estimate can be compared to the observed discovery rate to examine whether selection bias is present and how large it is. In contrast, p-curve provides no information about the presence of selection bias and the use of questionable research practices.

In sum, z-curve does everything that p-curve does better and it provides additional information. As z-curve is better than p-curve on all features, the rational choice is to use z-curve in future meta-analyses and to reexamine published p-curve analyses with z-curve. To do so, researchers can use the free R-package zcurve (Bartos & Schimmack, 2020).

References

Bartoš, F., & Schimmack, U. (2020). “zcurve: An R Package for Fitting Z-curves.” R package version 1.0.0

Bartoš, F., & Schimmack, U. (2021). Z-curve.2.0: Estimating the replication and discovery rates. Meta-Psychology, in press.

Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100, 407–425. http://dx.doi.org/10.1037/a0021524

Bem, D. J., & Honorton, C. (1994). Does psi exist? Replicable evidence for an anomalous process of information transfer. Psychological Bulletin, 115(1), 4–18. https://doi.org/10.1037/0033-2909.115.1.4

Brunner, J. & Schimmack, U. (2020). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta-Psychology, 4, https://doi.org/10.15626/MP.2018.874

Francis, G. (2012). Too good to be true: Publication bias in two prominent studies from experimental psychology. Psychonomic Bulletin & Review, 19, 151–156. http://dx.doi.org/10.3758/s13423-012-0227-9

Francis G., (2014). The frequency of excess success for articles in Psychological Science. Psychonomic Bulletin and Review, 21, 1180–1187. https://doi.org/10.3758/s13423-014-0601-x

Galak, J., LeBoeuf, R. A., Nelson, L. D., & Simmons, J. P. (2012). Correcting the past: Failures to replicate. Journal of Personality and Social Psychology, 103, 933–948. http://dx.doi.org/10.1037/a0029709

Motyl, M., Demos, A. P., Carsel, T. S., Hanson, B. E., Melton, Z. J., Mueller, A. B., Prims, J. P., Sun, J., Washburn, A. N., Wong, K. M., Yantis, C., & Skitka, L. J. (2017). The state of social and personality science: Rotten to the core, not so bad, getting better, or getting worse? Journal of Personality and Social Psychology, 113(1), 34–58. https://doi.org/10.1037/pspa0000084

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716–aac4716. https://doi.org/10.1126/science.aac4716

Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86(3), 638–641. https://doi.org/10.1037/0033-2909.86.3.638

Rusz, D., Le Pelley, M. E., Kompier, M. A. J., Mait, L., & Bijleveld, E. (2020). Reward-driven distraction: A meta-analysis. Psychological Bulletin, 146(10), 872–899. https://doi.org/10.1037/bul0000296

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17, 551–566. https://doi.org/10.1037/a0029487

Schimmack, U. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychology/Psychologie canadienne. 61 (4), 364-376. https://doi.org/10.1037/cap0000246

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359–1366. http://dx.doi.org/10.1177/0956797611417632

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2018). False-positive citations. Perspectives on Psychological Science, 13(2), 255–259. https://doi.org/10.1177/1745691617698146

Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve: A key to the file-drawer. Journal of Experimental Psychology: General, 143(2), 534–547. https://doi.org/10.1037/a0033242

Sterling, T. D. (1959). Publication decision and the possible effects on inferences drawn from tests of significance – or vice versa. Journal of the American Statistical Association, 54, 30–34. https://doi.org/10.2307/2282137

Sterling, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Publication decisions revisited: The effect of the outcome of statistical tests on the decision to publish and vice versa. The American Statistician, 49, 108–112. https://doi.org/10.2307/2684823

Defund Implicit Bias Research

December 13, 2020BlackLivesMatter, Experimental Social Psychology, IAT, Implicit Association Test, Implicit Bias, P-Hacking., Prejudice, Project Implicit, race IAT, Z-CurveUlrich Schimmack

The notion of implicit bias has taken root in North America and influential politicians like Hillary Clinton or FBI director James Comey used the idea to understand persistent racism and prejudice in the United States (Greenwald, 2015).

From Anthony Greenwald’s talk (40.21 minutes)

The main idea of implicit bias is that most White Americans have negative associations about Blacks that influence their behaviors without their awareness. This explains why even Americans who hold egalitarian values and do not want to discriminate end up discriminating against Black Americans.

The idea of implicit bias emerged in experimental social psychology in the 1980s. Until then most academic psychologists dismissed Freudian ideas of unconscious processes. However, research in cognitive psychology with computerized tasks suggested that some behaviors may be directly guided by unconscious processes that cannot be controlled by our conscious and may even influence behavior without our awareness (Greenwald, 1992).

Some examples of these unconscious processes are physiological processes (breathing), highly automated behaviors (driving while talking to a friend), and basic cognitive processes (e.g., color perception). These processes differ from cognitive tasks like adding 2 + 3 + 5 or deciding what take out food to order tonight. There is no controversy about this distinction. The controversial and novel suggestion was that prejudice could work like color perception. We automatically notice skin color and our unconscious guides our actions based on this information. Eventually the term implicit bias was coined to refer to automatic prejudice.

To provide evidence for implicit bias, experimental social psychologists adopted experiments from cognitive psychology to study prejudice. For example, one procedure is to present racial stimuli on a computer screen very quickly and immediately replace them with some neutral stimulus to prevent participants from actually seeing the stimulus. This method is called subliminal (below-threshold of awareness) priming.

Some highly cited studies suggested that subliminal priming influences behaviour without awareness (Bargh et al., 1996; Devine, 1989). However, in the past decade it has become apparent that these results are not credible (Schimmack, 2020). The reason is that social psychologists did not use the scientific method properly. Instead of using experiments to examine whether an effect exists, they only looked for evidence that shows an effect. Studies that failed to show the expected effects of subliminal priming were simply not reported. As a result, even incredible subliminal priming studies that reversed the order of cause and effect were successful (Bem, 2011). In the 2010s, some courageous researchers started publish replication failures (Doyen et al., 2012). They were attacked for doing so because it was a well-known secrete among experimental social psychologists that many studies fail, but you were not supposed to tell anybody about it. In short, the evidence that started the implicit revolution (Greenwald & Banaji, 2017) is invalid and casts a shadow over the whole notion of prejudice without awareness.

Measuring Implicit Bias

In the 1990s, experimental psychologists started developing methods to measure individuals’ implicit biases. The most prominent method is the Implicit Association Test (IAT, Greenwald et al., 1998) that has produced a large literature with thousands of studies that used the IAT to measure attitudes towards the self (self-esteem), exercise, political candidates, etc. etc. However, the most important literature with the IAT are studies of implicit bias. In these studies, White Americans tend to show a clear preference for Whites over Black Americans. This preference can also be shown with self-ratings. However, a notable group of participants shows much stronger preferences for Whites with the IAT than in their self-ratings. This finding has been used to claim that some White Americans are more prejudice than their are aware off.

One problem with the IAT and other measures of implicit bias is that they are not very good. That is, an individual’s test score is much more strongly influenced by measurement error than by their implicit bias. One way to demonstrate this is to examine the reliability of IAT scores. A good measure should produce similar results when it is used twice (e.g., two Covid-19 tests should be both positive or negative, not one positive and one negative). Reliability can be assessed by examining the correlation of two IATs. A correlation of r = .5 would imply that there is a 75% chance for somebody to score above average on both tests and a 25% chance to get conflicting results (i.e., above and below average).

Experimental social psychologists rarely examines reliability because most of their studies are cross-sectional ( a single experimental session lasting from 10 minutes to 1 hour). However, a few studies with repeated measurements provide some information. Short intervals are preferable to avoid any real changes in implicit bias. Bar-Anan and Nosek (2014) reported a retest-correlation of r = .4, for tests taken within a few hours. Lai et al. (2016) conducted the largest study with several hundred participants for tests taken within a few days. The retest correlations ranged from .22 to .30. Even two similar, but not identical, race IATs in the same session produce low correlations, r ~ .2 (Cunningham et al., 2001). More extensive psychometric analysis further suggest that some of the variance in implicit bias measures is systematic measurement error that influences one type of measure, but not other measures (Schimmack, 2019). Longitudinal studies over several years further show that the reliable variance in IATs is highly stable over time (Onyeador et al., 2020).

In short, ample evidence suggests that most of the variance in implicit bias measures is measurement error. This has important implications for research with these measures that tries to change implicit bias or use implicit bias measures to predict behaviors. However, experimental social psychologists have ignored these implications when they implicitly assumed that their measures are perfectly valid.

The Numbers do not add up

Some simple math shows the problems for experimental social psychologists to study implicit bias. The main method to study implicit bias is to conduct experiments where participants are randomly assigned to two or more groups. Each group receives a different treatment and then the effects on an implicit bias measure and actual behaviors are observed. For illustrative purposes, I assume that manipulations actually have a moderate effect size of half a standard deviation (d = .5) on implicit bias. However, because only a small proportion of the variance in the implicit bias measures is valid (here the assumption is a generous .5^2 = 25%), the effect that an experimental social psychologist could observe is only .25 standard deviations. That is, measurement error cuts the actual effect size in half. The effect on an actual behavior is even smaller because the link between attitudes and a single behavior is also small, d = .5 * .3 = .15. Thus, even under favorable conditions, experimental social psychologists can only expect to observe small effect sizes.

A good scientist would plan studies to be able to reliably detect these small effect sizes. Cohen (1988) provided guidelines for scientists how to plan sample sizes that make it possible to detect these small effects. A so-called power analysis shows that N = 500 participants are needed to detect an effect size of d = .25 and 1,400 participants are needed to detected an effect size of d = .15 for behavior.

However, experimental social psychologists tend to conduct studies with much smaller sample, often fewer than 100 participants. With N = 100, they would have only a 25% chance to reliably (with a p-value below .05) detect an effect and the observed effect size would be severely inflated because the significant result can only be significant with an inflated effect size estimate. Thus, we would expect many non-significant results in the implicit bias literature. However, we do not see these results because experimental social psychologists did not report their failures.

Implicit Bias Intervention Studies

For 20 years, experimental social psychologists have reported studies that seemed to change implicit bias (Dasgupta & Greenwald, 2001; Kawakami, Dovidio, Moll, Hermsen, Russin, 2000). The most influential article was Dasgupta and Greenwald’s (2001) article with nearly 700 citations. As this article spanned an entire literature, it is worthwhile to take a closer look at it.

There were two studies, but only Study 1 focused on implicit race bias. The sample size was N = 48. These 48 participants were divided into three groups, leaving n = 18 per group. Aside from a control group, one group was shown positive example of Blacks and negative examples of Whites and another group was shown the reverse. To get a significant result for the extreme comparison of the opposing groups, we have a study with 36 participants. To have an 80% chance to get a significant result for this contrast, an observed difference of d = .96 is needed. Taking measurement error into account this requires a change in implicit bias by 2 standard deviations. Otherwise, a non-significant result is likely and the study is risky.

Surprisingly, the authors did find a very strong effect size for their manipulation, d = 1.29. They even found a significant difference with the control group, d = .58.

As shown in Figure 1, Panel A, results revealed that exposure to pro-Black exemplars had a substantial effect on automatic racial associations (or the IAT effect).5 The magnitude of the automatic White preference effect was significantly smaller immediately after exposure to pro-Black exemplars (IAT effect = 78 ms; d = 0.58) compared with nonracial exemplars (IAT effect = 174
ms; d = 1.15), F(1, 31) = 6.79, p = .01; or pro-White exemplars (IAT effect = 176 ms; d = 1.29), F(1, 31) = 5.23, p = .029. IAT effects in control and pro-White conditions were statistically comparable
(F < 11)

Dasgupta and Greenwald not only wanted to show an immediate effect. They also wanted to show that this effect can last at least for a short time. Thus, they repeated the measurement a second day. The problem is that they now need to show two significant results, when they have a relatively low chance to show even one. The risk of failure therefore increased considerably, but they were successful again.

Panel B of Figure 1 illustrates the response latency data 24 hr after exemplar exposure. Compared with the control condition, the magnitude of the IAT effect in the pro-Black condition remained significantly diminished 1 day after encountering admired Black and disliked White images (IAT effects = 126 ms vs. 51 ms, respectively; ds = 0.98 vs. 0.38, respectively), F(1, 31) = 4.16, p = .05. Similarly, compared with the pro-White condition, the IAT effect in the pro-Black exemplar condition remained substantially smaller as well (IAT effects = 107 vs. 51 ms, respectively;
ds = 1.06 vs. 0.38, respectively), F(1, 31) = 3.67, p = .065.

Nobody cared about p-values that are strictly not significant (p = .05, p = .068), but these days these p-values are considered red flags that may suggest the use of questionable research practices to find significance. Another sign of questionable practices is when multiple tests are all successful because each test produces a new opportunity for failure. Thus, the fact that everything always works in experimental social psychology is a sign of widespread abuse of the scientific method (Sterling, 1959; Schimmack, 2012).

Study 2 did not examine racial bias, but it is relevant because it presents more statistical tests. If they also show the desired results, we have additional evidence that QRPs were used. Study 2 examined prejudice towards old people. Notably, the reported study did not have a control group as in Study 1, thus there is only a comparison of manipulations with favorable old people versus favorable young people. Study 2 also did not bother to examine whether the changes last for a day, or at least there were no results reported if this was examined. Thus, there is only one statistical test and that was significant with p = .03.

As illustrated in Figure 2, exposure to pro-elderly exemplars yielded a substantially smaller automatic age bias effect (IAT effect = 182 ms, d = 1.23) than exposure to pro-young exemplars
(IAT effect = 336 ms, d = 1.75), F ( 1 , 24) = 5.13, p = .03.

Over the past decade, meta-scientists have developed new tools to examine the presence of questionable practices even in small sets of studies. One test examines the variability of p-values as a function of sampling error (TIVA). After converting p-values into z-scores, we would expect a variance of 1, but the variance is only 0.05. This outcome has only a probability of 1 out of 180 times to occur by chance. Even if we are conservative and make this 1 out of 100, Dasgupta and Greenwald were extremely lucky to get significant results in all of their critical tests. We can also examine the power of their studies given the reported test statistics. The average observed power is 56%, yet they had 100% successes. This suggests that QRPs were used to inflate the success rate. This test is extremely conservative because mean observed power is also inflated by the use of QRPs. A simple correction is to subtract the inflation (100% – 56% = 44%) from the observed mean power. This yields a corrected replicability index of 56% – 44% = 12%. A replicability index of 21% is obtained when there is actually no effect.

In short, power analyses and bias tests suggest that Dasgupta and Greenwald’s article contains no empirical evidence that simple experimental manipulations can produce lasting changes in implicit bias. Yet, this article suggested to other experimental social psychologists that changing IAT scores is relatively easy and worthwhile. This generated a large literature with hundreds of studies. Next we are going to examine what we can learn from 20 years of research with over 40,000 participants.

A Z-Curve Analysis of Implicit Bias Intervention Studies

Psychologists often use meta-analyses to make sense of a literature. The implicit bias literature is no exception (Forscher et al., 2019; Kurdi et al., 2019). The problem with traditional meta-analyses is that they are uninformative. Their main purpose is to claim that an effect exists and to provide an average effect size estimate that nobody cares about. Take the meta-analysis by Forscher et al. (2019) as an example. After finding as many published and unpublished studies as possible, the results are converted into effect size estimates to end up with the conclusion that

“implicit measures can be changed, but effects are often relatively weak (|ds| < .30).

What do we do with this information. After all, Dasgupta and Greenwald (2001) reported an effect size of d > 1. Does this mean, they had a more powerful manipulation or does this mean their results were inflated by QRPs?

Traditional meta-analysis suffers from two problems. First, unlike medical meta-analysis where manipulations represent a treatment with the same drug, social psychologists use very different manipulations to change implicit bias ranging from living with a Black roommate for a semester to subliminal presentation of stimuli on a computer screen. Not surprisingly there is evidence of heterogeneity, that is, effect sizes vary, making any conclusions about the average effect size meaningless. What we really want to know is which manipulations reliably can produce the largest changes in implicit attitudes.

The next problem of this meta -analysis is that it did not differentiate between IATs. Implicit measures of attitudes towards alcohol or consumer products were treated the same as implicit bias. Thus, the average results may not hold for implicit bias.

The biggest problem is that meta-analysis in psychology do not take publication bias into account. Either they do not even examine it or, as in this case, they find evidence for publication bias, but don’t correct conclusions accordingly.

“we found that procedures that directly or indirectly targeted associations, depleted mental resources, or induced goals all changed implicit measures relative to neutral procedures” (p. 541).

It is not clear whether this conclusion holds after taking publication bias into account. Meta-scientists have developed better tools to examine and correct for the influence of questionable research practices that inflate effect sizes (QRP, John et al., 2012). A simulation study found that z-curve is superior to several alternative methods (Brunner & Schimmack, 2020). Thus, I conducted a z-curve analysis of the literature on implicit bias interventions.

The meta-analysis by Forscher et al. (2019) was very helpful to find studies until 2014. I also looked for newer studies that cited Dasgupta and Greenwald (2001), the seminal study in this field. I did not bother to get data from unpublished studies or dissertations. The reason is that these sources are only included in traditional meta-analysis to give the illusion that all studies were included and that there is no bias. However, original researchers who used QRPs are not going to share their failed studies. Z-curve can correct bias for the published studies and does not require cooperation from original researchers to correct the scientific record.

I found 214 studies with 49,1145 participants (data). Figure 1 shows the z-curve. A z-curve is a histogram of the reported test-statistics converted into z-scores. Each z-score reflects the strength of evidence (effect size over sampling error) against the null-hypothesis in each study. As the direction of the effect is irrelevant, all z-scores are positive.

The first notable finding is that the peak of the distribution is at z = 1.96, which corresponds to a two-sided p-value of .05. The second finding is the sharp drop from the peak to values below 1.96. The third observation is that the peak of the distribution has a density of 1.1, which is much larger than the peak density of a standard normal distribution (~ .4). All of these results together make it clear that non-significant results are missing. To quantify the amount of bias due to the use of QRPs, we can compare the observed discovery rate (the percentage of significant results) with the expected discovery rate based on the z-curve model (the grey curve is the predicted distribution without QRPs). The literature contains 74% significant results, when we would expect only 8% significant results.

Thus, there is strong evidence that QRPs undermine the credibility of this literature. Especially, p-values like those reported by Dasgupta and Greenwald (2001) are often a sign of studies with low power that required QRPs to produce a p-value less than .05 (see values below x-axis, 12% for z-scores 2 to 2.5). However, there is also clear evidence of heterogeneity. Studies with z-scores greater than 4 are expected to replicate with 90% or more (again values below x-axis) and 6 studies are not shown because their z-scores even exceeded the maximum value of 6 on the x-axis. To give a context, particle physicists use a z-score of 5 to claim major discoveries. Thus, a few studies produced credible evidence, while the bulk of studies used QRPs to achieve statistical significance in studies with low power.

There are two remarkable articles in this literature that deserve closer attention (Lai et al., 2014, 2016). Before I examine these two articles in more detail, I also conducted a z-curve analysis of the literature without these two articles to examine the credibility of typical articles in this literature.

The z-curve plot for traditional articles in this literature looks even worse. The expected discovery rate of 7% is just above the discovery rate of 5% that is expected from studies without any effect simply because the alpha criterion of .05 allows for 5% false positive discoveries. Moreover, the 95% confidence interval of the expected replication rate does include 5%, which means we cannot rule out that all of the published studies with significant results are false positives. This is also reflected in the maximum False Discovery Rate, 73%, but the upper limit of the 95% confidence interval includes 100%.

While there may be two or three studies with credible evidence, 154 studies with nearly 20,000 participants have produced no scientific information about implicit bias. In short, like several other areas of research in experimental social psychology, implicit bias research is junk science and the seminal study by Dasgupta and Greenwald is no excpetion.

Exception No 1: Lai et al. (2014)

The IAT is a popular measure of implicit bias in part because the developers of the IAT created an online site where visitors can get feedback on their (invalid) IAT scores, including the race IAT. This website is called Project Implicit. Some also volunteer to be participants in studies with the IAT. This makes it possible to get large samples. Lai et al. (2014) used Project Implicit to conduct 50 studies with 18 different interventions. Each study had several hundred participants, which allows for higher power to get significant results and more precise effect size estimates. The next figure shows the z-curve for these 50 studies.

Visual inspection of the histogram does not show the previous steep cliff around z = 1.96. In addition, the replication rate for significant studies is high and the lower limit of the 95%CI is still 65%. Thus, even if some minor QRPs may have produced a little bump around 1.96, this article provides credible evidence that IAT scores can be changed with some manipulations. However, it also shows that several manipulations produce hardly any effects.

Moreover, it is possible that the little bump around 1.96 is a chance finding. This can be examined by fitting z-curve to all values, including no-significant ones. Now the estimated discovery rate perfectly matches the observed discovery rate, suggesting that no QRPs were used.

In short, a single study with well-powered studies that honestly reported results provided more informative results than a literature with hundreds of underpowered studies that used QRPs to publish significant results. This just shows how powerful real science can be, while at the same time exposing the flaws of the way most experimental social psychologists to this day conduct their research.

Do Successful Changes of IAT scores Reveal Changes in Implicit Bias?

If we think about measures as perfect representations of constructs, any change in a measure implies that we changed the construct. However, Figure 1 showed that we need to distinguish measures and constructs. This brings up a new question. Did Lai et al. successfully change implicit biases or did they merely change IAT scores without changing attitudes.

This question can be difficult to answer. One way to examine this would be to see whether the manipulation also influenced behaviour. In the Figure a change of actual implicit bias would also produce a change in behavior, whereas the direct effect on the measure (red path) would not imply a change in behavior. However, as we saw studies with actual behaviors require even larger samples than used in the Project Implicit studies. So, this information is not available.

This brings us to the second exceptional study, which was also conducted by Lai and colleagues (2016). It is essentially a replication and extension of their first study. Focussing on the successful intervention in Lai et al. (2014), the authors examined whether the immediate effects would persist for a few days. First, the authors successfully replicated the immediate effects. More important, they failed to find significant effects a few days later, despite high power to do so. Even participants who were trained to fake the IAT did not bother to fake the IAT again the second time. Thus, even successful interventions that change IAT scores do not seem to change implicit biases measured with the IAT.

Don’t just trust me. Even Greenwald himself has declared that there are no proven ways to change implicit bias, although he fails to explain how he obtained strong effects in his seminal study.

“Importantly, there are no such situational interventions that have been established to have durable effects on IAT measures (Lai et al., 2016)” (Rae and Greenwald, 2017).

“None of the eight effective interventions produced an effect that persisted after a delay of one or a few days.This lack of persistence was not previously known because more than 90% of prior intervention studies had considered changes only within a single experimental session (Lai et al. 2013).” (Greenwald and Lai, 2020).

In short, 20 years of research that started with strong and persistent effects in Dasgupta and Greenwald’s seminal article has produced no useful information how to change implicit bias, despite hundreds of articles that claimed to change implicit bias successfully.

Where do we go from here?

Based on the famous saying “insanity is doing the same thing over and over again and expecting different results” we have to declare experimental social psychologists insane. For decades they have tried to make a contribution to the understanding of prejudice by bringing White students at White universities into labs run by mostly White professors, expose them to some stimuli and measured prejudice right afterwards. The only things that changed is that social psychologists now do even shorter studies with larger samples over the Internet. Should anybody expect that a brief manipulation can have profound effects? The only people who think this could work are social psychologists who have been deluded by inflated effect sizes in p-hacked studies that even subliminal manipulations can have profound effects on prejudice. Meanwhile, racisms remains a troubling reality in the United States as the summer in 2020 made clear.

Murals of George Floyd emerge around the world

It is time to use research funding wisely and not to waste it on experimental social psychology that is more concerned with publications and citations than with affecting real change. Resources need to be invested in longitudinal studies, studies with children, studies at work places with real outcome measures. Right now, this research does not attract funding because researchers who pump out five quick, p-hacked experiments get more publications, funding, and positions than researchers who do one well-designed longitudinal study that may fail to show a statistically significant result. Junk is drowning out good science. Maybe a new administration that actually cares about racial justice will allocate research money more wisely. Meanwhile, experimental social psychologists need to rethink their research practices and wonder what their real priorities are. As a group, they can either continue to do meaningless research or step up. However, they can no longer deceive themselves or others that their past research made a real contribution. Denial is not an answer, unless they want to take a place next to Trump in history. Publishing only studies that work was a big mistake. It is time to own up to it.

References

Onyeador, I. N., Wittlin, N. M., Burke, S. E., Dovidio, J. F., Perry, S. P., Hardeman, R. R., … van Ryn, M. (2020). The Value of Interracial Contact for Reducing Anti-Black Bias Among Non-Black Physicians: A Cognitive Habits and Growth Evaluation (CHANGE) Study Report. Psychological Science, 31(1), 18–30. https://doi.org/10.1177/0956797619879139

When DataColada kissed Fiske’s ass to publish in Annual Review of Psychology

December 20, 2019Datacolada, File-Drawer, P-Hacking., Questionable Research PracticesUlrich Schimmack

One of the worst articles about the decade of replication failures is the “Psychology’s Renaissance” article by the datacolada team (Leif Nelson, Joseph Simmons, & Uri Simonsohn).

This is not your typical Annual Review article that aims to give a review over developments in the field. it is an opinion piece filled with bold claims that lack empirical evidence.

The worst claim is that p-hacking is so powerful that pretty much every study can be made to work.

“Experiments that work are sent to a journal, whereas experiments that fail are sent to the file drawer (Rosenthal 1979). We believe that this “file-drawer explanation” is incorrect. Most failed studies are not missing. They are published in our journals, masquerading as successes.”

We can all see that not publishing failed studies is a bit problematic. Even Bem’s famous manual for p-hackers warned that it is unethical to hide contradictory evidence. “The integrity of the scientific enterprise requires the reporting of disconfirming results” (Bem). Thus, the idea that researchers are sitting on a pile of failed studies that they failed to disclose makes psychologists look bad and we can’t have that in Fiske’s Annual Review of Psychology journal. Thus, psychologists must have been doing something that is not dishonest and can be sold as normal science.

“P-hacking is the only honest and practical way to consistently get underpowered studies to be statistically significant. Researchers did not learn from experience to increase their sample sizes precisely because their underpowered studies were not failing.” (p. 515).

This is utter nonsense. First, researchers have file-drawers of studies that did not work. Just ask them and they may tell you that they do.

“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.” (Roy Baumeister, personal email communication)

Leading social psychologists, Gilbert and Wilson provide an even more detailed account of their research practices that produce many non-significant results that are not reported (a.k.a. a file drawer), which has been preserved thanks to Greg Francis.

First, it’s important to be clear about what “publication bias” means. It doesn’t mean that anyone did anything wrong, improper, misleading, unethical, inappropriate, or illegal. Rather it refers to the well known fact that scientists in every field publish studies whose results tell them something interesting about the world, and don’t publish studies whose results tell them nothing. Francis uses sophisticated statistical tools to discover what everyone already knew—and what he could easily have discovered simply by asking us. Yes, of course we ran some studies on “consuming experience” that failed to show interesting effects and are not reported in our JESP paper. Let us be clear: We did not run the same study over and over again until it yielded significant results and then report only the study that “worked.” Doing so would be clearly unethical. Instead, like most researchers who are developing new methods, we did some preliminary studies that used different stimuli and different procedures and that showed no interesting effects. Why didn’t these studies show interesting effects? We’ll never know. Failed studies are often (though not always) inconclusive, which is why they are often (but not always) unpublishable. So yes, we had to mess around for a while to establish a paradigm that was sensitive and powerful enough to observe the effects that we had hypothesized. In one study we might have used foods that didn’t differ sufficiently in quality, in another we might have made the metronome tick too fast for people to chew along. Exactly how good a potato chip should be and exactly how fast a person can chew it are the kinds of mundane things that scientists have to figure out in preliminary testing, and they are the kinds of mundane things that scientists do not normally report in journals (but that they informally share with other scientists who work on similar phenomenon). Looking back at our old data files, it appears that in some cases we went hunting for potentially interesting mediators of our effect (i.e., variables that might make it larger or smaller) and although we replicated the effect, we didn’t succeed in making it larger or smaller. We don’t know why, which is why we don’t describe these blind alleys in our paper. All of this is the hum-drum ordinary stuff of day-to-day science.

Aside from this anecdotal evidence, the datacolada crew actually had access to empirical evidence in an article that they cite, but maybe never read. An important article in the 2010s reported a survey of research practices (John, Loewenstein, & Prelec, 2012). The survey asked about several questionable research practices, including not reporting entire studies that failed to support the main hypothesis.

Not reporting studies that “did not work” was the third most frequently used QRP. Unfortunately, this result contradicts datacolada’s claim that there are no studies in file-drawers and so they ignore this inconvenient empirical fact to tell their fairy tail of honest p-hackers that didn’t know better until 2011 when they published their famous “False Positive Psychology” article.

This is a cute story that isn’t supported by evidence, but that has never stopped psychologists from writing articles that advance their own career. The beauty of review articles is that you don’t even have to phack data. You just pick and choose citations or make claims without evidence. As long as the editor (Fiske) likes what you have to say, it will be published. Welcome to psychology’s renaissance; same bullshit as always.

Replicability-Index

Improving the replicability of empirical research

Category Archives: P-Hacking.

Z-Curve: An even better p-curve

Abstract

Introduction

P-Curve

Z-Curve

Published Example

Conclusion

References

When DataColada kissed Fiske’s ass to publish in Annual Review of Psychology