P-Curve Does Not Detect P-Hacking: What Does?

The early 2010s were a time of upheaval in psychological science. Many psychologists lost faith in the ability of significance testing to test psychological theories. Until then, the common concern was that many studies often were doomed to failure because low statistical power was bound to produce a non-significant result. By the same logic, a significant result was assumed to be a sure sign of a real effect. This logical fallacy ignored that psychology journals nearly always report successful rejections of null hypotheses. This selection for significance in published results renders statistically significance insignificant (Sterling, 1959). In the worst-case scenario, every published significant result is the result of many attempts without a real effect (Rosenthal, 1979).

An influential article by Simmons, Nelson, and Simonsohn (2011) showed that researchers may not need to try very often to get a significant result. Using a number of statistical tricks, known as questionable research practices or p-hacking, it is possible to get a significant result in every other study without a real effect. Ample evidence shows that publication bias and p-hacking contribute to the high success rate in psychological journals (Francis, 2012; Schimmack, 2012). The observed discovery rates (i.e., the percentage of significant results) is simply too high, given the true probability of studies to produce significant results.

For the diagnosis of excessive significance, it is irrelevant whether bias is produced by the omission of non-significant results or p-hacking; that is the use of statistical tricks to turn a non-significant result into a significant one. Both selection for significance and p-hacking undermine the purpose of meta-analyses to estimate the true population effect size. Several methods have been developed to correct for bias in meta-analyses. For the purpose of bias correction, however, it is important whether selection or p-hacking produced too many significant results (Carter et al., 2019). It is therefore important to distinguish between selection bias and p-hacking in meta-analyses that correct for bias.

P-Curve

Simonsohn, Nelson, and Simmons (2014) developed a statistical method called p-curve. The basic idea of p-curve is that p-values have a uniform distribution when the null-hypothesis is true. This is also true for the distribution of the p-values below .05 that are typically used to reject the null-hypothesis. Thus, if only significant results are available, which is often the case in psychology journals, the distribution of p-values can be used to test the null-hypothesis that all significant results are false positive results (i..e., the null hypothesis is true in all studies). In this case, the p-value distribution is uniform. A variety of tests have been developed to test this null hypothesis. The specifics of these tests are not relevant here.

The p-curve authors have pointed out that many p-hacking methods will produce an abnormal distribution of p-values that is biased towards weak evidence (i.e., more p-values above .01 than below .01). They call this a right-skewed distribution. So far, this test is the only direct test of p-hacking in actual data. The key problem with this test is that it only works when the null hypothesis is true in all studies and the evidence was produced with p-hacking. If the null hypothesis is false and the studies produced significant results with an effect, the method fails because true effects produce more p-values below .01. This is a problem because p-hacking still inflates effect size estimates in meta-analyses. Here I show alternative ways to diagnose p-hacking when the null-hypothesis is false that can be used to correct for p-hacking in meta-analyses.

Effect of P-Hacking on Meta-Analyses

Carter et al. (2019) conducted an extensive simulation study to examine the influence of p-hacking and publication bias on effect size estimates using a variety of meta-analytic methods. One of the more promising approaches to correct for bias are selection models (McShane, Böckenholt, & Hansen, 2016). Selection models can assume homogeneity of the population effect size (i.e., all studies have the same population effect size) or heterogeneity (i.e., different studies have different population effect sizes). The heterogeneous model assumes that population effect sizes are normally distributed and estimates the standard deviation of this distribution (tau). More importantly, the selection model allows for bias. To do so, researchers can specify ranges of p-values that have the same probability of being published. A simple model distinguishes between significant results in favor of a specific hypothesis (p < .025 one-tailed) and all other results (p > .025, one-tailed). Other specifications are possible and more plausible. For example, it is easy to find studies with p-values just above .05 (two-sided) that are used to reject the null-hypothesis. These, marginally significant results are more likely to be published than results that clearly do not support a hypothesis. This can be specified with a model with steps at .025 and .05 (one-sided). It is also possible that researchers select for the direction of an effect, even if the effects are not significant. This can be tested with another step at p = .5 (one-sided) that distinguishes between positive, p < .50 and negative p > .50 results.

The key problem of selection models is that they assume selection for significance rather than p-hacking. The following example shows how this assumption influences the results and that p-hacking results in an underestimation of effect sizes. As these problems are a bigger concern when power is low so that more p-hacking is needed to get significant results, and when heterogeneity is relatively high. Heterogeneity cannot be too high because large effect sizes produce high power.

I used Carter et al.’s simulation with high p-hacking. I set the mean of the population effect sizes to zero with a wide standard deviation of tau = .4. I simulated 1,000 studies for large sample accuracy that reveals systematic biases in the methods.

I used the weightr package in R to fit the selection model to the data. I first specified the standard model that assumes only selection for significance at p = .025, one-tailed. The key results are the estimates of the average effect size, d = -.21, 95%CI = -.24 to -.18. This shows that p-hacking leads to underestimation of the average effect size when bias is assumed to be caused by selection for significance. Heterogeneity was also underestimated, but only slightly, tau = .30, 95%CI = .28 to .32.

To model p-hacking, I added a step at p = .005 (one-tailed). The rational is that p-hacking produces more just significant results than a model without bias would predict. To capture this prediction, I am using p-values between .05 and .01, two-sided. First of all, the model correctly notices that just significant p-values are overrepresented, selection weight = 2.77 (nearly 3 times as many as p-values below .01 that have no selection bias). This shows that the specification of a step at .005 (one-tailed) can be used to diagnose p-hacking.

However, adding this step did not produce unbiased estimates. The average was still underestimated, d = -.14, 95%CI = -.19 to -.09. However, the estimate of heterogeneity was better and no longer significantly different from the true value, tau = .38, 95%CI = .34 to .41.

These results are encouraging and suggest that the performance of selection models can be improved by distinguishing between selection for significance and p-hacking and to use the prevalence of p-values between .05 and .01 to diagnose p-hacking.

To verify that selection does not produce too many just-significant results, I simulated data without bias and then deleted 50% of the non-significant results. The model still slightly underestimated the true average, d = -.08, 95%CI = -.13 to -.03, but it estimated heterogeneity correctly, tau = .42, 95%CI = .39 to .46. It also correctly diagnosed that about half of the non-significant results were missing, selection weight = 60%, 95%CI = 41% to 78%.

Importantly, adding a step at p = .005 (one-tailed) did not change these results. The model correctly noted no special selection for just significant results, selection weight = 1.11, 95%CI = .71 to 1.51. The average was still slightly underestimated, d = -.07, 95%CI -.13 to -.02. Heterogeneity was estimated the same as in the model without the extra step, tau = .43, 95%CI = .39 to .47.

What if only significant results are available?

The previous simulation included positive and negative non-significant results. However, in reality, most results are positive and significant. I therefore examined the performance of the selection model under conditions that led to the development of p-curve.

I used the same simulation of p-hacking, but selected only statistically significant positive results. Before I present the results, it is important to discuss the performance criterion that should be used to evaluate the model. As noted by Simonsohn et al. (2018), when only positive and significant results are available, and results are heterogeneous, we want to know the true effect size of studies that have been published; not the hypothetical effect size of studies that may even have produced negative results. Moreover, selection models cannot estimate the average for a hypothetical set of studies with negative results if there are no negative results. I therefore computed the mean and standard deviation of the simulated population effect sizes in the set of studies with positive results as the criterion values. The values were mean d = .26 and sd = .28.

The default model that only assumes selection for significance produced an estimate of d = .18, 95%CI = .15 to .22. The estimate of heterogeneity is tau = .15, 95%CI = .12 to .16. The model that specified p-hacking with a step at p = .005 (one-tailed) showed evidence of p-hacking, selection weight = 3.60, 95%CI = 2.36 to 4.84. The estimated average increased to d = .29, 95%CI = .24 to .35, and the estimate of heterogeneity was also higher, tau = .20, 95%CI = .16 to .24.

These results show that selection models do not require non-significant results and can diagnose and correct for p-hacking even if only positive significant results are available.

Conclusion

It is well-established that psychology journals publish too many statistically significant results. This bias has been attributed to publication bias or p-hacking. While publication bias and p-hacking both inflate effect sizes estimates with methods that do not correct for bias, they have different effects on methods that aim to correct for bias. Traditional specifications of selection models assume selection bias and do not take p-hacking into account. This leads to underestimation of the average effect size and heterogeneity. Here I presented a solution to this problem. P-hacking tends to produce just-significant results. Thus, p-hacking can be diagnosed by specifying a range of p-values between .05 and .01 (two-sided). Rather than selection, we are expecting overrepresentation (eights > 1) for this bin of p-values. I confirmed this prediction with a few simulation studies and showed that a model with a step at .005 produces better estimates of the average and heterogeneity of population effect sizes. This even works when only positive and significant results are available. The ability of selection models to estimate averages and heterogeneity, while taking different biases into account, makes these models very attractive for effect size meta-analysis. The key drawback of other methods (p-curve, p-uniform, PET-PEESE) is that they do not provide information about the amount of heterogeneity. The main problem of traditional meta-analytic methods is that they overestimate the average and heterogeneity when bias is present. The inclusion of an extra step to model p-hacking may further improve the performance of this model.

The estimated average effect size is d = .42, 95%CI = .33 to .51. This is close to the true value and the true value is included in the 95% confidence interval. The model also has a higher estimate of heterogeneity, tau = .20, 95%CI = .00 to .30, but the 95%CI still includes a value of zero and the model fails to reject the false null-hypothesis of heterogeneity.

P-uniform is similar to p-curve (McShane et al., 2016). I used p-uniform because p-curve does not have an r-package. I used the LN1MINP method because it shows better performance than the LNP method that tends to overestimate effect sizes and did not fit the data as well in the diagnostic plot for these data. The p-uniform estimate was d = .35, 95%CI = .26 to .44. This result shows underestimation of the true average, but the 95%CI includes the true parameter and the difference is not practically significant. Thus, p-hacking


Karsten T. Hansen

, including p-curve and a very similar method called p-uniform.

Here

However, several statistical tools try to correct for bias in observed data to evaluate a specific literatue.

Leave a Reply