A Comparison of Scientific Doping Tests

Psychological research is often underpowered; that is, studies have a low probability of producing significant results even if the hypothesis is correct, measures are valid, and manipulations are successful. The problem with underpowered studies is that they have too much sampling error to produce a statistically significant signal to noise ratio (i.e., effect size relative to sampling error). The problem of low power was first observed in 1962 by Cohen and has persisted till this day.

Researchers continue to conduct underpowered studies because they have found a number of statistical tricks to increase power. The problem with these tricks is that they produce significant results that are difficult to replicate and that have a much higher risk of being false positives than the claim p < .05 implies. These statistical tricks are known as questionable research practices (QRPs). John et al. (2012) referred to the use of these QRPs as scientific doping.

Since 2011 it has become apparent that many published results cannot be replicated because they were produced with the help of questionable research practices. This has created a crisis of confidence or a replication crisis in psychology.

In response to the replication crisis, I have developed several methods that make it possible to detect the use of QRPs. It is possible to compare these tests to doping tests in sports. The problem with statistical doping tests is that they require more than one sample to detect the use of doping. The more studies are available, the easier it is to detect scientific doping, but often the set of studies is small. Here I examine the performance of several doping tests for a set of six studies.

The Woman in Red: Examining the Effect of Ovulatory Cycle on Women’s Perceptions of and Behaviors Toward Other Women

In 2018, the journal Personality and Social Psychological Bulletin, published an article that examined the influence of women’s cycle on responses to a woman in a red dress. There are many reasons to suspect that there are no meaningful effects in this line of research. First, it has been shown that the seminal studies on red and attractiveness used QRPs to produce significant results (Francis, 2013). Second, research on women’s cycle has been difficult to replicate (Peperkoorn, Roberts, & Pollet, 2016).

The article reported six studies that measured women’s cycle and manipulated the color of a woman’s dress between subject. The key hypothesis was an attenuated interaction effect. That is, ovulating women should rate the woman in the red dress more negatively than women who were not ovulating. Table 1 shows the results for the first dependent variable that was reported.

resultDF1DF2test.statisticpvalzvalobs.powerSIG
F(1,62)=3.2341623.230.081.770.551
F(1,205)=3.6812053.680.061.910.601
F(1,125)=0.0111250.010.920.100.060
F(1,125)=3.8611253.860.051.950.621
F(1,188)=3.1711883.170.081.770.551
F(1,533)=3.1515333.150.081.770.551

The pattern of results is peculiar because five of the six results are marginally significant; that is the p-value is greater than .05, but smaller than .10. This is strange because sampling error should produce more variability in p-values across studies. Why would the p-values always be greater than .05 and never be less than .05? It is also not clear why p-values did not decrease when researchers started to increase sample sizes from N = 62 in Study 1 to N = 533 in Study 6. As increasing sample sizes decrease sampling error, we would expect test statistics (ratio of effect size over sampling error) to become stronger and p-values to become smaller. Finally, the observed power of the six studies tends to be around 50%, except for Study 3 with a clear non-significant result. How is it possible that 5 studies with about a 50% chance to get marginally significant results produced marginally significant results in all 5 studies? Thus, a simple glance at the pattern of results raises several red flags about the statistical integrity of the results. However, do doping tests confirm this impression?

Incredibility Index

Without the clearly non-significant result in Study 3, we would have 5 significant results with an average observed power of 57%. The incredibility index simply computes the binomial probabilty of obtaining 5 significant results in 5 attempts with a 57% probability of doing so (Schimmack, 2012). The probability of doing so is 6%. Using the median power (55%) produces the same result. This would suggest that QRPs were used. However, the set of studies does include a non-significant result, which reflects a change in publishing norms. Results like these would not have been reported before the replication crisis. And reporting a non-significant result makes the results more credible (Schimmack, 2012).

With the non-significant result, average power is 49% and there are now only 5 out of 6 successes. Although there is still a discrepancy ( 49% power vs. 83% success rate), the probability of this happening by chance is 17%. Thus, there is no strong evidence that QRPs were used.

The problem here is that the incredibility index has low power to detect doping in small sets of studies, unless all results are significant. Even a single non-significant result makes the observed pattern of results a lot more credible. However, absence of evidence does not mean evidence of absence. It is still possible that QRPs were used, but that the incredibility index failed to detect this.

Test of Insufficient Variance

The test of insufficient variance (TIVA) converts the p-values into z-scores and makes the simplifying assumptions that p-values were obtained from a series of z-tests. This makes it possible to use the standard normal distribution as a model of the sampling error in each study. For a set of independent test statistics that are sampled from a standard normal distribution, the sampling error is 1. However, if QRPs are used to produce significant results, test-statistics cluster just above the significance criterion (which is 1.65 for p < .10, when marginally significant results are present). This clustering can be detected by comparing the observed variance in z-scores to the expected variance of 1, using the chi-square test for the comparison of two variances.

Again, it is instructive to focus first on the set of 5 studies with marginally significant results. The variance of z-scores is very low, Var.Z = 0.008, because p-values are confined to the tight range from .05 to .10. The probability of observing this clustering in five studies is p = .0001 or 1 out of 8,892 times. Thus, we would have strong evidence of scientific doping.

However, when we include the non-significant result, variance increases to Var.Z = 0.507, which is no longer statistically significant in a set of six studies, p = .23. This shows again that a single clearly non-significant results makes the reported a lot more credible. It also shows that one large outlier makes TIVA insensitive to detecting QRPs, even when they are present.

The Robust Test of Insufficient Variance (formerly known as the Lucky Bounce Test)

The Robust Test of Insufficient Variance (ROTIVA) is less sensitive to outliers than TIVA. It works by creating a region of p-values (or z-scores, or observed powers) that are considered to be lucky. That is, the result is significant, but not highly convincing. A useful area of lucky outcomes are p-values between .05 and .005, which correspond to power of 50% to 80%. We might say that studies with 80% power are reasonably powered and produce significant results most of the time. However, studies with 50% power are risky because the produce a significant result only in every other study. Thus, getting a significant result is lucky. With two-sided p-values, the interval ranges from z = 1.96 to 2.8. However, when marginal significance is used, the interval ranges from z = 1.65 to 2.49 with a center at 2.07.

Once the area of lucky outcomes is defined, it is possible to specify the maximum probablity of observing a lucky outcome, which is obtained by centering the sampling distribution in the middle of the lucky interval, which is 34%.

Thus, the maximum probability of obtaining a lucky significant result in a single study is 34%. This value can be used to compute the probability of obtaining x number of lucky result in a set of studies using binomial probabilities. With 5 out of 5 studies, the probability is very small, p = .005, but we see that the robust test is not as powerful as TIVA in this situation without outliers. This reverses when we include the outlier. ROTIVA still shows significant evidence of QRPs with 5 out of 6 lucky results, p = .020, when TIVA was no longer significant.

Z-Curve2.0

Z-curve was developed to estimate the replication rate for a set of studies with significant results (Brunner & Schimmack, 2019). As z-curve only selects significant results, it assumes rather than tests the presence of QRPs. The details of z-curve are too complex to discuss here. It is only important to know that z-curve allows for heterogeneity in power and approximates the distribution of significant p-values, converted into z-scores, with a mixture model of folded standard normal distributions. The model parameters are weights for components with low to high power. Although the model is fitted only to significant results, the weights can also be used to make predictions about the distribution of z-scores in the range of non-significant results. It is then possible to examine whether the predicted number of non-significant results matches the observed number of significant results.

To use z-curve for sets of studies with marginally significant results, one only needs to adjust the significance criterion from p = .05 (two-tailed) to p = .10 (two-tailed) or from z = 1.96 to z = 1.65. Figure 2 shows the results, including bootstrapped confidence intervals.

The most relevant statistic for the detection of QRPs is the comparison of the observed discovery rate and the estimated discovery rate. As for the incredibility index, the observed discovery rate is simply the percentage of studies with significant results (5 out of 6). The expected discovery rate is the area under the gray curve that is in the range of significant results with z > 1.65. As can be seen this area with very small, given the estimated sampling distribution from which significant results were selected. The 95%CI for the observed discovery rate has a lower limit of 54%, while the upper limit for the estimated discovery rate is 15%. Thus, these intervals do not overlap and are very far from each other, which provides strong evidence that QRPs were used.

Conclusion

Before the replication crisis it was pretty much certain that articles would only report significant results that support hypotheses (Sterling, 1959). This selection of confirmatory evidence was considered an acceptable practices, although it undermines the purpose of significance testing. In the wake of the replication crisis, I developed tests that can examine whether QRPs were used to produce significant results. These tests work well even in small sets of studies as long as all results are significant.

In response to the replication crisis, it has become more acceptable to publish non-significant results. The presence of clearly non-significant results makes a published article more credible, but it doesn’t automatically mean that QRPs were not used. A new deceptive practice would be to include just one non-significant result to avoid detection by scientific doping tests like the incredibility index or TIVA. Here I show that a second generation of doping tests is able to detect QRPs in small sets of studies even when non-significant results are present. This is bad news for p-hackers and good news for science.

I suggest that journal editors and reviewers make use of these tools to ensure that journals publish only credible scientific evidence. Articles like this one should not be published because they do not report credible scientific evidence. Not publishing articles like this is even beneficial for authors because they avoid damage to their reputation when post-publication peer-reviews reveal the use of QRPs that are no longer acceptable.

References

Francis, G. (2013). Publication bias in “Red, Rank, and Romance in Women Viewing Men” by Elliot et al. (2010). Journal of Experimental Psychology: General, 142, 292-296.

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence
of questionable research practices with incentives for truth telling.
Psychological Science, 23, 524–532. doi:10.1177/0956797611430953

Peperkoorn, L. S., Roberts, S. C., & Pollet, T. V. (2016). Revisiting the red effect on attractiveness and sexual receptivity: No effect of the color red on human mate preferences. Evolutionary Psychology, 14(4). http://dx.doi.org/10.1177/1474704916673841

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551-566. https://replicationindex.com/2018/02/18/why-most-multiple-study-articles-are-false-an-introduction-to-the-magic-index/

Schimmack, U. (2015). The Test of Insufficient Variance. https://replicationindex.com/2015/05/13/the-test-of-insufficient-variance-tiva-a-new-tool-for-the-detection-of-questionable-research-practices-2/

Schimmack, U. (2015). The Lucky Bounce Test. https://replicationindex.com/2015/05/27/when-exact-replications-are-too-exact-the-lucky-bounce-test-for-pairs-of-exact-replication-studies/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s