An article in Psychological Science titled “Women Are More Likely to Wear Red or Pink at Peak Fertility” reported two studies that related women’s cycle to the color of their shirts. Study 1 (N = 100) found that women were more likely to wear red or pink shirts around the time of ovulation. Study 2 (N = 25) replicated this finding. An article in Slate magazine, “Too good to be true” questioned the credibility of the reported results. The critique led to a lively discussion about research practices, statistics, and psychological science in general.
The R-Index provides some useful information about some unresolved issues in the debate.
The main finding in Study 1 was a significant chi-square test, chi-square (1, N = 100) = 5.32, p = .021, z = 2.31, observed power 64%.
The main finding in Study 2 was a chi-square test, chi-square (1, N = 25) = 3.82, p = .051, z = 1.95, observed power 50%.
One way to look at these results is to assume that the authors planned the two studies, including sample sizes, conducted two statistical significance tests and reported the results of their planned analysis. Both tests have to produce significant results in the predicted direction at p = .05 (two-tailed) to be published in Psychological Science. The authors claim that the probability of this event to occur by chance is only 0.25% (5% * 5%). In fact, the probability is even lower because a two-tailed can be significant when the effect is opposite to the hypothesis (i.e., women are less likely to wear red at peak fertility, p < .05, two-tailed). The probability to get significant results in a theoretically predicted direction with p = .05 (two-tailed) is equivalent to a one-tailed test with p = .025 as significance criterion. The probability of this happening twice in a row is only 0.06%. According to this scenario, the significant results in the two studies are very unlikely to be a chance finding. Thus, they provide evidence that women are more likely to wear red at peak fertility.
The R-Index takes a different perspective. The focus is on replicability of the results reported in the two studies. Replicability is defined as the long-run probability to produce significant results in exact replication studies; everything but random sampling error is constant.
The first step is to estimate replicability of each study. Replicabilty is estimated by converting p-values into observed power estimates. As shown above, observed power is estimated to be 64% in Study 1 and 50% in Study 2. If these estimates were correct, the probability to replicate significant results in two exact replication studies would be 32%. This also implies that the chance of obtaining significant results in the original studies was only 32%. This raises the question of what researchers would do when a non-significant result is obtained. If reporting or publication bias prevent these results from being published, published results provide an inflated estimate of replicability (100% success rate with 32% probability to be successful).
The R-Index uses the median as the best estimate of the typical power in a set of studies. Median observed power is 57%. However, the success rate is 100% (two significant results in two reported attempts). The discrepancy between the success rate (100%) and the expected rate of significant results (57%) shows the inflated rate of significant results that is expected based on the long-run success rate of 57% (100% – 57% = 43%). This would be equivalent to getting red twice in a roulette game with a 50% chance of red or black (ignoring 0 here). Ultimately, an unbiased roulette table would produce black outcomes to get the expected rate of 50% red and 50% black numbers.
The R-Index corrects for this inflation by subtracting the inflation rate from observed power.
The R-Index is 57% – 43% = 14%.
To interpret an R-Index of 14%, the following scenarios are helpful.
When the null-hypothesis is true and non-significant results are not reported, the R-Index is 22%. Thus, the R-Index for this pair of studies is lower than the R-Index for the null-hypothesis.
With just two studies, it is possible that researchers were just lucky to get two significant results despite a low probability of this event to occur.
For other researchers it is not important why reported results are likely to be too good to be true. For science, it is more important that the reported results can be generalized to future studies and real world situations. The main reason to publish studies in scientific journals is to provide evidence that can be replicated even in studies that are not exact replication studies, but provide sufficient opportunity for the same causal process (peak fertility influences women’s clothing choices) to be observed. With this goal in mind, a low R-Index reveals that the two studies provide rather weak evidence for the hypothesis and that the generalizability to future studies and real world scenarios is uncertain.
In fact, only 28% of studies with an average R-Index of 43% replicated in a test of the R-Index (reference!). Failed replication studies consistently tend to have an R-Index below 50%.
For this reason, Psychological Science should have rejected the article and asked the authors to provide stronger evidence for their hypothesis.
Psychological Science should also have rejected the article because the second study had only a quarter of the sample size of Study 1 (N = 25 vs. 100). Given the effect size in Study 1 and observed power of only 63% in Study 1, cutting the sample sizes by 75% reduces the probability to obtain a significant effect in Study 2 to 20%. Thus, the authors were extremely lucky to produce a significant result in Study 2. It would have been better to conduct the replication study with a sample of 150 participants to have 80% power to replicate the effect in Study 1.
The R-Index of “Women Are More Likely to Wear Red or Pink at Peak Fertility” is 14. This is a low value and suggests that the results will not replicate in an exact replication study. It is possible that the authors were just lucky to get two significant results. However, lucky results distort the scientific evidence and these results should not be published without a powerful replication study that does not rely on luck to produce significant results. To avoid controversies like these and to increase the credibility of published results, researchers should conduct more powerful tests of hypothesis and scientific journals should favor studies that have a high R-Index.
Since this article has been published, concerns about studies that relate women’s hormone levels to their behaviors has increased. Leading evolutionary psychologists Gangstad declared most findings in this literature to be garbage (Engber, 2018). This literature is yet another example of the bad practices in psychology. Conducting low powered studies and then publishing only results that became significant with the help of inflated effect sizes is not science and does not generate useful knowledge. It is sad that 10 years after the replication crisis and six years after I posted this blog post, many psychologists still continue to pursue research following this flawed model.