Smart P-Hackers Have File-Drawers and Are Not Detected by Left-Skewed P-Curves


In the early 2010s, two articles suggested that (a) p-hacking is common, (b) false positives are prevalent, and (c) left-skewed p-curves reveal p-hacking to produce false positive results (Simmons et al., 2011; Simonsohn, 2014a). However, empirical application of p-curve have produced few left-skewed p-curves. This raises question about the absence of left-skewed z-curves. One explanation is that some p-hacking strategies do not produce notable left skew and that these strategies may be used more often because they require fewer resources. Another explanation could be that file-drawering is much more common than p-packing. Finally, it could be that most of the time p-hacking is used to inflate true effect sizes rather than to chase false positive results. P-curve plots do not allow researchers to distinguish these alternative hypotheses. Thus, p-curve should be replaced by more powerful tools that detect publication bias or p-hacking and estimate the amount of evidence against the null-hypothesis. Fortunately, there is an app for this (zcurve package).


Simonsohn, Nelson, and Simmons (2014) coined the term p-hacking for a set of questionable research practices that increase the chances of obtaining a statistically significant result. In the worst case scenario, p-hacking can produce significant results without a real effect. In this case, the statistically significant result is entirely explained by p-hacking.

Simonsohn et al. (2014) make a clear distinction between p-hacking and publication bias. Publication bias is unlikely to produce a large number of false positive results because it requires 20 attempts to produce a single significant result in either direction or 40 attempts to get a significant result with a predicted direction. In contrast, “p-hacking can allow researchers to get most studies to reveal significant relationships between truly unrelated variables (Simmons et al., 2011)” (p. 535).

There have been surprisingly few investigations of the best way to p-hack studies. Some p-hacking strategies may work in simulation studies that do not impose limits on resources, but they may not be practical in real applications of p-hacking. I postulate that the main goal of p-hacking is to get significant results with minimal resources rather than with a minimum number of studies and that p-hacking is more efficient with a file drawer of studies that are abandoned.

Simmons et al. (2011) and Simonsohn et al. (2014) suggest one especially dumb p-hacking strategy, namely simply collecting more data until a significant result emerges.

“For example, consider a researcher who p-hacks by analyzing data after every five per-condition participants and ceases upon obtaining significance.” (Simonsohn et al., 2014).

This strategy is known to produce more p-values close to .04 than .01.

The main problem with this strategy is that sample sizes can get very large before the significant result emerges. I limited the maximum sample size before a researcher would give up to N = 200. A limit of 20 makes sense because N = 200 would allow a researcher to run 20 studies with the starting sample size of N = 10 to get a significant result. The p-curve plot shows a similar distribution as the simulation in the p-curve article.

The success rate was 25%. This means, 75% of studies with N = 200 produced a non-significant result that had to be put in the file-drawer. Figure 2 shows the distribution of sample sizes for the significant results.

The key finding is that the chances of a significant results drop drastically after the first attempt. The reason is that the most favorable results in the first trial produce a significant result in the first trial. As a result, the non-significant ones are less favorable. It would be better to start a new study because the chances to get a significant result are higher than adding participants after an unsuccessful attempt. In short, just adding participants to get significant is a dumb p-hacking method.

Simonsohn et al. (2014) do not disclose the stopping rule, but they do show that they got only 5.6% significant results compared to the 25% with N = 200. This means they stopped much earlier. Simulation suggest that they stopped when N = 30 (n = 15 per cell) did not produce a significant result (1 million simulations, success rate = 5.547%). The success rates for N = 10, 20, and 30 were 2.5%, 1.8%, and 1.3%, respectively. These probabilities can be compared to a probability of 2.5 for each test with N = 10. It is clear that trying three studies is a more efficient strategy than to add participants until N reaches 30. Moreover, neither strategy avoids producing a file drawer. To avoid a file-drawer, researchers would need to combine several questionable research practices (Simmons et al., 2011).

Simmons et al. (2011) proposed that researchers can add covariates to increase the number of statistical tests and to increase the chances of producing a significant result. Another option is to include several dependent variables. To simplify the simulation, I am assuming that dependent variables and covariates are independent of each other. Sample size has no influence on these results. To make the simulation consistent with typical results in actual studies, I used n = 20 per cell. Adding covariates or additional dependent variables requires the same amount of resources. For example, participants make additional ratings for one more item and this item is either used as a covariate or as a dependent variable. Following Simmons et al. (2011), I first simulated a scenario with 10 covariates.

The p-curve plot is similar to the repeated peaking plot and is called left-skewed. The success rate, however, is disappointing. Only 4.48% of results were statistically significant. This suggests that collecting data to be used as covariates is another dumb p-hacking strategy.

Adding dependent variables is much more efficient. In the simple scenario, with independent DVs, the probability of obtaining a significant result equals 1-(1-.025)^11 = 24.31%. A simulation with 100,000 trials produced a percentage of 24.55%. More important, the p-curve is flat.

Correlation among the dependent variables produces a slight left-skewed distribution, but not as much as the other p-hacking methods. With a population correlation of r = .3, the percentages are 17% for p < .01 and 22% for p between .04 and .05.

These results provide three insights into p-hacking that have been overlooked. First, some p-hacking methods are more effective than others. Second, the amount of left-skewness varies across p-hacking methods. Third, efficient p-hacking produces a fairly large file-drawer of studies with non-significant results because it is inefficient to add participants to data that failed to produce a significant result.


False P-curve Citations

The p-curve authors made it fairly clear what p-curve does and what it does not do. The main point of a p-curve analysis is to examine whether a set of significant results was obtained at least partially with some true effects. That is, at least in a subset of the studies the null-hypothesis was false. The authors call this evidential value. A right-skewed p-curve suggests that a set of significant results have evidential value. This is the only valid inference that can be drawn from p-curve plots.

“We say that a set of significant findings contains evidential value when we can rule out selective reporting as the sole [italics added] explanation of those findings” (p. 535).

The emphasize on selective reporting as the sole explanation is important. A p-curve that shows evidential value can still be biased by p-hacking and publication bias, which can lead to inflated effect size estimates.

To make sure that I interpret the article correctly, I asked one of the authors on twitter and the reply confirmed that p-curve is not a bias test, but strictly a test that some real effects contributed to a right-skewed p-curve. The answer also explains why the p-curve authors did not care about testing for bias. They assume that bias is almost always present; which makes it unnecessary to test for it.

Although the authors stated the purpose of p-curve plots clearly, many meta-analysists have misunderstood the meaning of a p-curve analysis and have drawn false conclusions about right-skewed p-curves. For example, Rivers (2017) writes that a right-skewed p-curve suggests “that the WIT effect is a) likely to exist, and b) unlikely biased by extensive p-hacking.” The first inference is correct. The second one is incorrect because p-curve is not a bias detection method. A right-skewed p-curve could be a mixture of real effects and bias due to selective reporting.

Rivers also makes a misleading claim that a flat p-curve shows the lack of evidential value, whereas “a significantly left-skewed distribution indicates that the effect under consideration may be biased by p-hacking.” These statements are wrong because a flat p-curve can also be produced by p-hacking, especially when a real effect is also present.

Rivers is by no means the only one who misinterpreted p-curve results. Using the 10 most highly cited articles that applied p-curve analysis, we can see the same mistake in several articles. A tutorial for biologists claims “p-curve can, however, be used to identify p-hacking, by only considering significant findings” (Head, 2015, p. 3). Another tutorial for biologists repeats this false interpretation of p-curves. “One proposed method for identifying P-hacking is ‘P-curve’ analysis” (Parker et al., 2016, p. 714). A similar false claim is made by Polanin et al. (2016). “The p-curve is another method that attempts to uncover selective reporting, or “p-hacking,” in primary reports (Simonsohn, Nelson, Leif, & Simmons, 2014)” (p. 211). The authors of a meta-analysis of personality traits claim that they conduct p-curve analyses “to check whether this field suffers from publication bias” (Muris et al., 2017, 186). Another meta-analysis on coping also claims “p-curve analysis (Simonsohn, Nelson, & Simmons, 2014) allows the detection of selective reporting by researchers who “file-drawer” certain parts of their studies to reach statistical significance” (Cheng et al., 2014; p. 1594).

Shariff et al.’s (2016) article on religious priming effects provides a better explanation of p-curve, but their final conclusion is still misleading. “These results suggest that the body of studies reflects a true effect of religious priming, and not an artifact of publication bias and p-hacking.” (p. 38). The first part is correct, but the second part is misleading. The correct claim would be “not solely the result of publication bias and p-hacking”, but it is possible that publication bias and p-hacking inflate effect size estimates in this literature. The skew of p-curves simply does not tell us about this. The same mistake is made by Weingarten et al. (2016). “When we included all studies (published or unpublished) with clear hypotheses for behavioral measures (as outlined in our p-curve disclosure table), we found no evidence of p-hacking (no left-skew), but dual evidence of a right-skew and flatter than 33% power.” (p. 482). While a left-skewed p-curve does reveal p-hacking, the absence of left-skew does not ensure that p-hacking was absent. The same mistake is made by Steffens et al. (2017), who interpret a right-skewed p-curve as evidence “that the set of studies contains evidential value and that there is no evidence of p-hacking or ambitious p-hacking” (p. 303).

Although some articles correctly limit the interpretation of the p-curve to the claim that the data contain evidential value (Combs et al., 2015; Rand, 2016; Siks et al., 2018), the majority of applied p-curve articles falsely assume that p-curve can reveal the presence or absence of p-hacking or publication bias. This is incorrect. A left-skewed p-curve does provide evidence of p-hacking, but the absence of left-skew does not imply that p-hacking is absent.

How prevalent are left-skewed p-curves?

After 2011, psychologists were worried that many published results might be false positive results that were obtained with p-hacking (Simmons et al., 2011). As the combination of p-hacking in the absence of a real effect does produce left-skewed p-curves, one might expect that a large percentage of p-curve analyses revealed left-skewed distributions. However, empirical examples of left-skewed p-curves are extremely rare. Take, power-posing as an example. It is widely assumed these days that original evidence for power-posing was obtained with p-hacking and that the real effect size of power-posing is negligible. Thus, power-posing would be expected to show a left-skewed p-curve.

Simmons and Simonsohn (2017) conducted a p-curve analysis of the power-posing literature. They did not observe a left-skewed p-curve. Instead, the p-curve was flat, which justifies the conclusion that the studies contain no evidential value (i.e., we cannot reject the null-hypothesis that all studies tested a true null-hypothesis). The interpretation of this finding is misleading.

“In this Commentary, we rely on p-curve analysis to answer the following question: Does the literature reviewed by Carney et al. (2015) suggest the existence of an effect once one accounts for selective reporting? We conclude that it does not. The distribution of p values from those 33 studies is indistinguishable from what would be expected if (a) the average effect size were zero and (b) selective reporting (of studies or analyses) were solely responsible for the significant effects that were published”

The interpretation only focus on selective reporting (or testing of independent DVs) as a possible explanation for lack of evidential value. However, usually the authors emphasize p-hacking as the most likely explanation for significant results without evidential value. Ignoring p-hacking is deceptive because a flat p-curve can occur as a combination of p-hacking and real effect, as the authors showed themselves (Simonsohn et al., 2014).

Another problem is that significance testing is also one-sided. A right-skewed p-curve can be used to reject the null-hypotheses that all studies are false positives, but the absence of significant right skew cannot be used to infer the lack of evidential value. Thus, p-curve cannot be used to establish that there is no evidential value in a set of studies.

There are two explanations for the surprising lack of left-skewed p-curves in actual studies. First, p-hacking may be much less prevalent than is commonly assumed and the bigger problem is publication bias which does not produce a left-skewed distribution. Alternatively, false positive results are much rarer than has been assumed in the wake of the replication crisis. The main reason for replication failures could be that published studies report inflated effect sizes and that replication studies with unbiased effect size estimates are underpowered and produce false negative results.

How useful are Right-skewed p-curves?

In theory, left-skew is diagnostic of p-hacking, but in practice left-skew is rarely observed. This leaves right-skew as the only diagnostic information of p-curve plots. Right skew can be used to reject the null-hypothesis that all of the significant results tested a true null-hypothesis. The problem with this information is shared by all significance tests. It does not provide evidence about the effect size. In this case, it does not provide evidence about the percentage of significant results that are true positives (the false positive risk), nor does it quantify the strength of evidence.

This problem has been addressed by other methods that quantify how strong the evidence against the null-hypothesis is. Confusingly, the p-curve authors used the term p-curve for a method that estimates the strength of evidence in terms of the unconditional power of the set of studies (Simonsohn et al., 2014b). The problem with these power estimates is that they are biased when studies are heterogeneous (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020). Simulation studies show that z-curve is a superior method to quantify the strength of evidence against the null-hypothesis. In addition, z-curve.2.0 provides additional information about the false positive risk; that is the maximum number of significant results that may be false positives.

In conclusion, p-curve plots no longer produce meaningful information. Left-skew can be detected in z-curves plots as well as in p-curve plots and is extremely rare. Right skew is diagnostic of evidential value, but does not quantify the strength of evidence. Finally, p-curve plots are not diagnostic when data contain evidential value and bias due to p-hacking or publication bias.

Leave a Reply