An article in *Psychological Science* titled “Women Are More Likely to Wear Red or Pink at Peak Fertility” reported two studies that related women’s cycle to the color of their shirts. Study 1 (N = 100) found that women were more likely to wear red or pink shirts around the time of ovulation. Study 2 (N = 25) replicated this finding. An article in Slate magazine, “Too good to be true” questioned the credibility of the reported results. The critique led to a lively discussion about research practices, statistics, and psychological science in general.

The R-Index provides some useful information about some unresolved issues in the debate.

The main finding in Study 1 was a significant chi-square test, chi-square (1, N = 100) = 5.32, p = .021, z = 2.31, observed power 64%.

The main finding in Study 2 was a chi-square test, chi-square (1, N = 25) = 3.82, p = .051, z = 1.95, observed power 50%.

One way to look at these results is to assume that the authors planned the two studies, including sample sizes, conducted two statistical significance tests and reported the results of their planned analysis. Both tests have to produce significant results in the predicted direction at p = .05 (two-tailed) to be published in *Psychological Science*. The authors claim that the probability of this event to occur by chance is only 0.25% (5% * 5%). In fact, the probability is even lower because a two-tailed can be significant when the effect is opposite to the hypothesis (i.e., women are less likely to wear red at peak fertility, p < .05, two-tailed). The probability to get significant results in a theoretically predicted direction with p = .05 (two-tailed) is equivalent to a one-tailed test with p = .025 as significance criterion. The probability of this happening twice in a row is only 0.06%. According to this scenario, the significant results in the two studies are very unlikely to be a chance finding. Thus, they provide evidence that women are more likely to wear red at peak fertility.

The R-Index takes a different perspective. The focus is on replicability of the results reported in the two studies. Replicability is defined as the long-run probability to produce significant results in exact replication studies; everything but random sampling error is constant.

The first step is to estimate replicability of each study. Replicabilty is estimated by converting p-values into observed power estimates. As shown above, observed power is estimated to be 64% in Study 1 and 50% in Study 2. If these estimates were correct, the probability to replicate significant results in two exact replication studies would be 32%. This also implies that the chance of obtaining significant results in the original studies was only 32%. This raises the question of what researchers would do when a non-significant result is obtained. If reporting or publication bias prevent these results from being published, published results provide an inflated estimate of replicability (100% success rate with 32% probability to be successful).

The R-Index uses the median as the best estimate of the typical power in a set of studies. Median observed power is 57%. However, the success rate is 100% (two significant results in two reported attempts). The discrepancy between the success rate (100%) and the expected rate of significant results (57%) shows the inflated rate of significant results that is expected based on the long-run success rate of 57% (100% – 57% = 43%). This would be equivalent to getting red twice in a roulette game with a 50% chance of red or black (ignoring 0 here). Ultimately, an unbiased roulette table would produce black outcomes to get the expected rate of 50% red and 50% black numbers.

The R-Index corrects for this inflation by subtracting the inflation rate from observed power.

**The R-Index is 57% – 43% = 14%. **

To interpret an R-Index of 14%, the following scenarios are helpful.

When the null-hypothesis is true and non-significant results are not reported, the R-Index is 22%. Thus, the R-Index for this pair of studies is lower than the R-Index for the null-hypothesis.

With just two studies, it is possible that researchers were just lucky to get two significant results despite a low probability of this event to occur.

For other researchers it is not important why reported results are likely to be too good to be true. For science, it is more important that the reported results can be generalized to future studies and real world situations. The main reason to publish studies in scientific journals is to provide evidence that can be replicated even in studies that are not exact replication studies, but provide sufficient opportunity for the same causal process (peak fertility influences women’s clothing choices) to be observed. With this goal in mind, a low R-Index reveals that the two studies provide rather weak evidence for the hypothesis and that the generalizability to future studies and real world scenarios is uncertain.

In fact, only 28% of studies with an average R-Index of 43% replicated in a test of the R-Index (reference!). Failed replication studies consistently tend to have an R-Index below 50%.

For this reason, Psychological Science should have rejected the article and asked the authors to provide stronger evidence for their hypothesis.

Psychological Science should also have rejected the article because the second study had only a quarter of the sample size of Study 1 (N = 25 vs. 100). Given the effect size in Study 1 and observed power of only 63% in Study 1, cutting the sample sizes by 75% reduces the probability to obtain a significant effect in Study 2 to 20%. Thus, the authors were extremely lucky to produce a significant result in Study 2. It would have been better to conduct the replication study with a sample of 150 participants to have 80% power to replicate the effect in Study 1.

**Conclusion**

The R-Index of “Women Are More Likely to Wear Red or Pink at Peak Fertility” is 14. This is a low value and suggests that the results will not replicate in an exact replication study. It is possible that the authors were just lucky to get two significant results. However, lucky results distort the scientific evidence and these results should not be published without a powerful replication study that does not rely on luck to produce significant results. To avoid controversies like these and to increase the credibility of published results, researchers should conduct more powerful tests of hypothesis and scientific journals should favor studies that have a high R-Index.

## Adendum 12/20/2020

Since this article has been published, concerns about studies that relate women’s hormone levels to their behaviors has increased. Leading evolutionary psychologists Gangstad declared most findings in this literature to be garbage (Engber, 2018). This literature is yet another example of the bad practices in psychology. Conducting low powered studies and then publishing only results that became significant with the help of inflated effect sizes is not science and does not generate useful knowledge. It is sad that 10 years after the replication crisis and six years after I posted this blog post, many psychologists still continue to pursue research following this flawed model.

This post is a good example of why peer review is essential to our science. Unlike published articles, un-reviewed blog posts do not need to include appropriate reviews of the prior literature, yet a publication on this issue would not have made it past review without acknowledging our published direct replication of these studies (which included a reported failure to replicate; Tracy & Beall, 2014); our finding of a moderator of the previously published effect—which provides some indication as to why the effect emerged so strongly in those first two studies (also see Tracy & Beall, 2014); or any of our more recent blog posts on this issue, where we report results from every data point collected (total N = 633, odds ratio = 1.67, p = .032), and explain in detail how we have responded openly to all questions raised about this research. Although these more recent reports may not be relevant to the question of whether our initial paper should have been published in Psych Science, they are certainly relevant to the central issue of this post: how replicable the reported effect is.

-Jess Tracy

Both the original post and the comment by Jess Tracy are correct and make valid points. However, by focusing exclusively on p-values, they both miss the point. Scientists should care about measuring and explaining the relationships between natural phenomenon. In this case, the constructs of interest are wearing red/pink and female fertility. The only appropriate question concerns the relationship between the two. Taking Jess Tracy at her word, we can convert p = .032 into a Z-value (which is 1.85 if that p-value is two-tailed). We can then convert this Z to a t, and with N=633, the value is essentially unchanged. Then we can convert the Z to an r to see that the relationship between wearing red/pink and female fertility is r = .07. (Of note, if the p-value reported was two-tailed and should have been one-tailed for theoretical reasons, then the r = .08 instead).

If that is all of the data collected on this question, then r = .07 is our best estimate as to the relationship. As scientists we should also be interested in the precision of that estimate. We can estimate that with a 95% confidence interval. In this case it is about [.00, .15]. (Note: The lower end may be more like .01, but it hard to be that precise without knowing more about the data.)

So that is it. If this is all the data we have, then this is our best estimate of the relationship r = .07 +/- .078. No concerns about p-values. No concerns about replicability.

One we have established such a relationship, then scientists should care about explanations for that relationship. And that is beyond the scope of this discussion.

Here is a link to the Tracy & Beall, 2014:

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0088852;jsessionid=E9020495AB48730766CC41B30620BA9B

First of all I love the fact that this is published in an open access journal, so everyone can read the article and make up their own minds.

I am just wondering whether conclusion can be drawn from studies using sample sizes as in Tracy & Beall, 2104. I am not smart enough to do any power analyses, but I read the following: “Next, we again classified women according to whether they reported wearing a red or pink shirt (n = 22) or an other-colored shirt (n = 187)”. To me that seems like a tiny sample size (only 22 women wore a red or pink shirt), and I wonder if that could have caused a fluke result.

Dear Dr. Tracy,

Thank you for your comments. I hope we can have a constructive discussion about my blog post.

1. You believe in peer-review as a way to ensure quality control and suggest that peer-review would have prevented the publication of my post. I believe that comments on blogs can act as a public and transparent peer-review process. Your response essentially constitutes a peer-review and I am happy to share your comments with everybody who is interested in my blog. If I made some mistakes in my blog, I am happy to correct them and everybody can witness how peer-review increases the quality of science.

2. You suggest that my blog should not have been published because I do not cite Tracy and Beall (2014). I am grateful for you to inform me about this related study, but my post was based on the replicability of the studies published in the Psychological Science article.

3. You suggest that a moderator variable explains why the previous two studies produced a stronger effect than other studies. This is possible, but the R-Index provides another explanation. The low R-Index of the two studies in Psychological Science suggests that the reported effect sizes were inflated by sampling error and that exact replication studies would produce results with lower effect sizes. A moderator effect implies that a study varied some factors (the replication study is not exact) and that effect sizes change systematically as a function of variation in the experiment. The R-Index is not concerned with moderator variables. It predicts replicability for studies with the exact same conditions, including sample size. Thus, a moderator effect is irrelevant for my conclusion that exact replication studies are likely to produce weaker effect sizes.

4. You mention that you now have data from N = 633 participants and that the odds-ratio is 1.67 as compared to the odds ratios reported in Psychological Science (3.85, 8.67). This new estimate of the true effect size confirms the prediction of the R-Index that the published effect sizes were inflated and that exact replication studies with Ns of 100 or 25 participants and a true effect size of 1.67 are likely to produce a much lower success rate than your success rate of 100% implies. In other words, you were lucky to get 2 significant results in studies with low power, but in the long run you or other researchers will not be lucky all the time and non-significant results will emerge.

5. I think the final point of your comment reveals a misunderstanding. The replicability-index is concerned with the replicability of published results. The R-Index does not say anything about the probability whether an effect exists at all (odds-ratio > 1) or not. The main point of the R-Index is that underpowered studies with inflated effect sizes provide insufficient empirical evidence to decide whether an effect does not exist or whether a small effect exists. The main conclusion is that it would have been better to conduct a more powerful study in the first place and to publish the results, rather than publishing preliminary results that created an unnecessary controversy and then publish a follow up study.

6. The main aim of the R-Index is to increase the statistical power of psychological studies. Jacob Cohen has made this point over 50 years ago, but Psychological Science has ignored Cohen’s important contribution and continues to conduct and report underpowered studies that require favorable sampling error to be significant. A high R-Index suggests that a researcher invested time, effort, and resources into testing a hypothesis. Thus, the R-Index can complement other indices of research productivity, like the H-index, that favor publishing many studies that require less effort.

In conclusion, I consider our open exchange about my blog a positive example of open scientific dialogue. I think the following conclusions are consistent with my blog and your response: (a) the combined evidence of all your studies suggest that women during peak fertility are more likely to wear red/pink than other colors, (b) the effect size is weak (odds-ratio 1.67), (c) the moderate to strong effect sizes estimates in Psychological Science provided inflated effect size estimates of the true effect size, and (d) the R-Index correctly predicted that the true effect size is weaker than the effect sizes published in Psychological Science.

Reblogged this on LMGTFY.