Why You Should Not Trust P-Curve

In 2011, Simmons, Nelson, & Simonsohn published an article that showed with simulation studies how researchers can get significant results without a real effect. This practice has become widely known as p-hacking. In 2014, the authors presented a statistical method that uses the distribution of significant p-values to determine whether the significant results have evidential value or whether they were p-hacked. This method is called p-curve and it has been used in numerous meta-analysis.

While p-hackers like Norbert Schwarz were initially afraid that p-curve could reveal their questionable research practices, the past 10 years have shown that p-curve is often used to sell p-hacked results as credible or even robust evidence.

In this blog post, I focus on the article”Meta-Analyses and P-Curves Support Robust Cycle Shifts in Women’s Mate Preferences: Reply to Wood and Carden (2014) and Harris, Pashler, and Mickes (2014)” by Kelly Gildersleeve, Martie G. Haselton, and Melissa R. Fales, in the prestigious journal Psychological Bulletin.

The article reports a p-curve analysis of studies that examined the influence of women’s cycle on their mate preferences. A meta-analysis by the authors in 2014 appeared to support the ovulatory shift hypothesis. Articles by Wood et al. and Harris et al. questioned these results and suggested that the evidence might be inflated by selective reporting of confirmatory evidence.

To respond to this criticism, Gildersleeve et al. conducted a p-curve analysis. A p-curve analysis has two parts. One part is a histogram of significant p-values with five bins for p-values from .05 to .04, .04 to .03, .03 to .02, .02 to .01, and .01 to .00. The second part is a significance test that of the distribution of p-values against the null-hypothesis that p-values have a uniform distribution. A uniform distribution is expected when the null-hypothesis is true.

They report two p-curve analyses. Figure 3 shows a so-called right-skewed distribution and a p-value of .0005.

Another analysis supports this conclusion, but with a less impressive p-value.

Based on these results, the authors were allowed to conclude that there is strong evidence in support of the cycle-shift hypotheses and to claim in the title that there is robust evidence for it.

As we previously reported, our meta-analysis revealed strong support for the ovulatory shift hypothesis. New analyses using the p-curve method again revealed strong support for genuine cycle shifts as predicted by the ovulatory shift hypothesis. Claims by Wood et al. (2014), Wood and Carden (2014), and Harris et al. (2014) that the abundance of positive findings in the cycle shifts literature merely reflects publication bias, p-hacking, or other research artifacts did not anticipate and cannot explain these new findings.

The authors go even further and suggest that their results contradict claims that p-hacking is major problem in this line of research.

Given recent doubts about the evidential value of published research findings, many researchers have called for “cleaning up” psychological science. We fully support this effort. However, just as claims regarding the existence of hypothesized effects should be supported with strong empirical evidence, so should claims regarding whether p-hacking or other practices have produced the illusion of positive evidence. In the case of the literature on cycle shifts in women’s mate preferences, speculations about a widespread “false positive problem” are unwarranted.

So far, this article has been cited 50 times. I could not find articles criticizing these conclusions. Instead, some articles cited the article to claim that evidence for the ovulatory shift hypothesis. For example, Lewis (2020) wrote.

Gildersleeve, Haselton, and Fales (2014b) provide a robust defence of their meta-analysis suggesting that the p-hacking could not be the sole reason for the observed shifts in masculinity preferences (p. 2).

Since p-curve was published in 2014, new methods have been developed to examine publication bias and evidential value in meta-analyses. My colleagues and I have developed z-curve for this purpose (Bartos & Schimmack, 2022; Brunner & Schimmack, 2021). I used the data in Gildersleeve et al.’s Supplement to conduct a z-curve analysis of their data.

A z-curve analysis converts two-sided p-values into absolute z-scores. A z-curve plot shows the distribution of z-scores, but unlike a p-curve plot, it does not truncate p-values / z-scores at the level of significance (z = 1.96, red dashed line). Thus, visual inspection of a z-curve plot makes it easy to spot p-hacking and other practices that lead to an overrepresentation of significant results. The z-curve plot for Gildersleeve et al.’s data make it obvious that the data that the published results are selected for significance. The only reason why there is a non-significant result, is that one study used a one-sided test to get significance, but the two-sided p-value falls short of significance. Z-curve also provides a statistical test of selection for significance. For this purpose, z-curve uses a mixture model to predict the distribution of non-significant results (the dotted blue curve from 0 to 1.96). Based on the model, the reported significant results are just 23% of all results that would be expected based on the selection model. Due to the small number of studies, the 95%CI around this point-estimate ranges from 5% to 51%, but that is still well below the observed rate of 92% significant results. In short, z-curve makes it clear that selection bias is present, whereas p-curve does not provide information about the presence of selection bias. Z-curve does not show how much p-hacking or failure to report non-significant results contributes to the discrepancy between the observed and expected discovery rate, but that is not important. What matters is that questionable practices contributed to the evidence for the ovulatory cycle hypothesis.

Z-curve also provides a different answer to the question whether the data provide evidence for the hypothesis after taking selection bias into account. One way to address this question is the expected replication rate (ERR). The ERR is a measure of average power of studies with significant results. It predicts the outcome of exact replication studies with the same sample sizes because the long-run rate of significant results is determined by the average power of these studies (Brunner & Schimmack, 2021). The point estimate of the ERR is 23%. This is a bit lower than the implied power in p-curve plots that show the predicted line for 33% power. More importantly, this point estimate also comes with a wide confidence interval due to the small number of studies and the 95% confidence interval includes a value of 5%, which is expected when the null-hypothesis is true and all studies were false positives. Thus, there is insufficient evidence to rule out the possibility that all studies are false positives. Moreover, even if some of these studies are not false positives, replication failures with the same sample sizes are likely and replication studies would need larger samples to provide evidence for the effect, despite the fact that two studies had sample sizes over 7,000 participants. This implies that the true effect sizes are very small even if they are not zero.

Finally, z-curve.2.0 makes it possible to estimate the false positive risk; that is the percentage of significant results that were false positives. This risk can be estimated based on the expected discovery rate using a formula by Soric (1989). The point-estimate is 22%, suggesting that some of the results may not be false positives, but the 95% confidence interval around this estimate ranges from 5% to 100%. Thus, the evidence is insufficient to rule out the possibility that all of the results are false positives at the typical level of certainty that is used to draw scientific conclusions (alpha = 5%).

Importantly, the absence of evidence is not the same as evidence of the absence of an effect. The data are simply uninformative. Based on 24 significant results in Gildersleeve et al.’s meta-analysis we have no scientific evidence in support of the ovulatory shift hypothesis. The z-curve analysis makes it obvious that the data lack evidential value. In contrast, a p-curve analysis of these data allowed the authors to claim robust evidence for the hypothesis. Not surprisingly, p-curve results are often used to claim that effects are real and p-curve is rarely used to claim that data have no evidential value. This problem is even worse when data are more heterogeneous than the present data because p-curve assumes a single parameter whereas z-curve is a mixture model that assumes heterogeneity in population z-scores.

If you find the z-curve results convincing, you may not want to trust p-curve results and subject the data to a z-curve analysis to ensure that you are not fooled by p-hacked data.

If you trust p-curve and evolutionary theory, you might change your mind after reading the reflections of a leading researcher in this area, who’s lab contribute 12 of the 25 results in this meta-analysis.

Gangestad in “The Wax and Wane of Ovulating-Woman Science” by Daniel Engber in Slate
“When we wrote the book, we were drawing on a broad literature,” Gangestad told me, “but some of what we wrote was just garbage because we trusted all that work, including our own.”

Leave a ReplyCancel reply