A Clarification of P-Curve Results: The Presence of Evidence Does Not Imply the Absence of Questionable Research Practices

If you are interested in p-curve analysis, you may find additional posts interesting.

Z-Curve: An even better p-curve


https://replicationindex.com/2018/05/10/an-even-better-p-curve/

This post is not a criticism of p-curve.  The p-curve authors have been very clear in their writing that p-curve is not designed to detect publication bias.  However, numerous articles make the surprising claim that they used p-curve to test publication bias.  The purpose of this post is to simply correct a misunderstanding of p-curve.

Questionable Research Practices and Excessive Significance

Sterling (1959) pointed out that psychology journals have a surprisingly high success rate. Over 90% of articles reported statistically significant results in support of authors’ predictions.  This success rate would be surprising, even if most predictions in psychology are true.  The reason is that the results of a study are not only influenced by cause-effect relationships.  Another factor that influences the outcome of a study is sampling error.  Even if researchers are nearly always right in their predictions, some studies will fail to provide sufficient evidence for the predicted effect because sampling error makes it impossible to detect the effect.  The ability of a study to show a true effect is called power.  Just like bigger telescopes are needed to detect more distant stars with a weaker signal, bigger sample sizes are needed to detect small effects (Cohen, 1962; 1988).  Sterling et al. (1995) pointed out that the typical power of studies in psychology does not justify the high success rate in psychology journals.  In other words, the success rate was too good to be true.  This means, published articles are selected for significance.

The bias in favor of significant results is typically called publication bias (Rosenthal, 1979).  However, the term publication bias does not explain the discrepancy between estimates of statistical power and success rates in psychology journals.  John et al. (2012) listed a number of questionable research practices that can inflate the percentage of significant results in published articles.

One mechanism is simply to not report non-significant result.  Rosenthal (1979) suggested that non-significant results end up in the proverbial file-drawer.  That is, a whole data set remains unpublished.  The other possibilities is that researchers use multiple exploratory analyses to find a significant result and do not disclose their fishing expedition.  These practices are now widely known as p-hacking.

Unlike John et al. ,(2012), the p-curve authors make a big distinction between not disclosing an entire dataset (publication bias) and not disclosing all statistical analyses of a dataset (p-hacking).

QRP = Publication Bias + P-Hacking

We Don’t Need Tests of Publication Bias

The p-curve authors assume that publication bias is unavoidable.

“Journals tend to publish only statistically significant evidence, creating a scientific record that markedly overstates the size of effects. We provide a new tool that corrects for this bias without requiring access to nonsignificant results.”  (Simonsohn, Nelson, Simmons, 2014).

“By the Way, of Course There is Publication Bias. Virtually all published studies are significant (see, e.g., Fanelli, 2012; Sterling, 1959; Sterling, Rosenbaum, & Weinkam,
1995), and most studies are underpowered (see, e.g., Cohen, 1962). It follows that a considerable number of unpublished failed studies must exist. With this knowledge already in hand, testing for publication bias on paper after paper makes little
sense” (Simonsohn, 2012, p. 597).

“Yes, p-curve ignores p>.05 because it acknowledges that we observe an unknowably small and non-random subset of p-values >.05.”  (personal email, January 18, 2015).

I hope these quotes make it crystal clear that p-curve is not designed to examine publication bias because the authors assume that selection for significance is unavoidable.  Any statistical test that reveals no evidence of publication bias is a false negative result because the sample size was not large enough to detect it.

Another concern by Uri Simonsohn is that bias tests may reveal statistically significant bias that has no practical consequences.

Consider a literature with 100 studies, all with p < .05, but where the implied statistical
power is “just” 97%. Three expected failed studies are missing. The test from the critiques would conclude there is statistically significant publication bias; its magnitude, however, is trivial. (Simonsohn, 2012, p. 598). 

k.sig = 100; k.studies = 100; power = .97; pbinom(k.studies-k.sig,k.studies,1-power) =
0.048.

This is a valid criticism that applies to all p-values.  A p-value only provides information about the contribution of random sampling error.  A p-value of .048 suggest that it is unlikely to observe only significant results, even if 100 studies have 97% power to show a significant result.   However, with 97% observed power, the 100 studies provide credible evidence for an effect and even the inflation of the average effect size is minimal.

A different conclusion would follow from a p-value less than .05 in a set of 7 studies that all show significant results.

k.sig = 7; k.studies = 7; power = .62; pbinom(k.studies-k.sig,k.studies,1-power) = 0 .035

Rather than showing small bias with a large set of studies, this finding shows large bias with a small set of studies.  P-values do not distinguish between these two scenarios. Both outcomes are equally unlikely.  Thus, information about the probability of an event should always be interpreted in the context of the effect.  The effect size is simply the difference between the expected and observed rate of significant results.  In Simonsohn’s example, the effect size is small (1 – .97 = .03).  In the second example, the discrepancy is large (1 – .62 = .38).

The previous scenarios assume that only significant results are reported. However, in sciences that use preregistration to reduce deceptive publishing practices (e..g, medicine), non-significant results are more common.  When non-significant results are reported, bias tests can be used to assess the extent of bias.

For example, a literature may report 10 studies with only 4 significant results and the median observed power is 30%.  In this case, the bias is small (.40 – .30 = .10) and a conventional meta-analysis would produce only slightly inflated estimates of the average effect size.  In contrast, p-curve would discard over 50% of the studies because it assumes that the non-significant results are not trustworthy.  This is an unnecessary loss of information that could be avoided by testing for publication bias.

In short, p-curve assumes that publication bias is unavoidable. Hence, tests of publication bias are unnecessary and non-significant results should always be discarded.

Why Do P-Curve Users Think P-Curve is a Publication Bias Test?

Example 1

I conducted a literature research on studies that used p-curve and I was surprised by numerous claims that p-curve is a test of publication bias.

Simonsohn, Nelson, and Simmons (2014a, 2014b, 2016) and Simonsohn, Simmons, and Nelson (2015) introduced pcurve as a method for identifying publication bias (Steiger & Kühberger, 2018, p. 48).   

However, the authors do not explain how p-curve detects publication bias. Later on, they correctly point out that p-curve is a method that can correct for publication bias.

P-curve is a good method to correct for publication bias, but it has drawbacks. (Steiger & Kühberger, 2018, p. 48).   

Thus, the authors seem to confuse detection of publication bias with correction for publication bias.  P-curve corrects for publication bias, but it does not detect publication bias; it assumes that publication bias is present and a correction is necessary.

Example 2

An article in the medical journal JAMA Psychiatry also claimed that they used p-curve and other methods to assess publication bias.

Publication bias was assessed across all regions simultaneously by visual inspection of funnel plots of SEs against regional residuals and by using the excess significance test,  the P-curve method, and a multivariate analogue of the Egger regression test (Bruger & Howes, 2018, p. 1106).  

After reporting the results of several bias tests, the authors report the p-curve results.

P-curve analysis indicated evidential value for all measures (Bruger & Howes, 2018, p. 1106).

The authors seem to confuse presence of evidential value with absence of publication bias. As discussed above,  publication bias can be present even if studies have evidential value.

Example 3

To assess publication bias, we considered multiple indices. Specifically, we evaluated Duval and Tweedie’s Trim and Fill Test, Egger’s Regression Test, Begg and Mazumdar Rank Correlation Test, Classic Fail-Safe N, Orwin’s Fail-Safe N, funnel plot symmetry, P-Curve Tests for Right-Skewness, and Likelihood Ratio Test of Vevea and Hedges Weight-Function Model.

As in the previous example, the authors confuse evidence for evidential value (significant right-skwed p-curve) with evidence for the absence of publication bias.

Example 4

The next example even claims that p-curve can be used to quantify the presence of bias.

Publication bias was investigated using funnel plots and the Egger regression asymmetry test. Both the trim and fill technique (Duval & Tweedie, 2000) and p-curve (Simonsohn, Nelson, & Simmons, 2014a, 2014b) technique were used to quantify the presence of bias (Korrel et al., 2017, p. 642).

The actual results section only reports that the p-curve is right skewed.

The p-curve for the remaining nine studies (p < .025) was significantly right skewed
(binomial test: p = .002; continuous test full curve: Z = -9.94, p < .0001, and half curve Z = -9.01, p < .0001) (Korrel et al., 2017, p. 642)

These results do not assess or quantify publication bias.  One might consider the reported z-scores a quantitative measure of evidential value as larger z-scores are less probable under the nil-hypothesis that all significant results are false positives. Nevertheless, strong evidential value (e.g., 100 studies with 97% power) does not imply that publication bias is absent, nor does it mean that publication bias is small .

A set of 1000 studies with 10% power is expected to produce 900 non-significant results and 100 significant results.  Removing the non-significant results produces large publication bias, but a p-curve analysis shows strong evidence against the nil-hypothesis that all studies are false positives.

set.seed(3)
Z = rnorm(1000,qnorm(.10,1.96))
Stouffer.Z = sum(Z[Z > 1.96]-1.96)/sqrt(length(Z.sig))
Stouffer.Z = 4.89

The reason is that p-curve is a meta-analysis and the results depend on the strength of evidence in individual studies and the number of studies.  Strong evidence can be result of many studies with weak evidence or a few studies with strong evidence.  Thus, p-curve is a meta-analytic method that combines information from several small studies to draw inferences about a population parameter.  The main difference to older meta-analytic methods is that older methods assumed that publication bias is absent, whereas p-curve assumes that publication bias is present. Neither method assesses whether publication bias is present, nor do they quantify the amount of publication bias.

Example 5

Sala and Gobet (2017) explicitly make the mistake to equate evidence for evidence with evidence against publication bias.

Finally, a p-curve analysis was run with all the p values < .05 related to positive effect sizes (Simonsohn, Nelson, & Simmons, 2014). The results showed evidential values (i.e., no evidence of publication bias), Z(9) = -3.39, p = .003.  (p. 676).

As discussed in detail before, this is not a valid inference.

Example 6

Ironically, the interpretation of p-curve results as evidence that there is no publication bias contradicts the fundamental assumption of p-curve that we can safely assume that publication bias is always. present.

The danger is that misuse of p-curve as a test of publication bias may give the false impression that psychological scientists are reporting their results honestly, while actual bias tests show that this is not the case.

It is therefore problematic if authors in high impact journals (not necessarily high quality journals) claim that they found evidence for the absence of publication bias based on a p-curve analysis.

To check whether this research field suffers from publication bias, we conducted p-curve analyses (Simonsohn, Nelson, & Simmons, 2014a, 2014b) on the most extended data set of the current meta-analysis (i.e., psychosocial correlates of the dark triad traits), using an on-line application (www.p-curve.com). As can be seen in Figure 2, for each of the dark triad traits, we found an extremely right-skewed p-curve, with statistical tests indicating that the studies included in our meta-analysis, indeed, contained evidential value (all ps < .001) and did not point in the direction of inadequate evidential value (all ps non-significant). Thus, it is unlikely that the dark triad literature is affected by publication bias (Muris, Merckelbach, Otgaar, & Meijer, 2017).

Once more, presence of evidential value does not imply absence of publication bias!

Evidence of P-Hacking  

Publication bias is not the only reason for the high success rates in psychology.  P-hacking will also produce more significant results than the actual power of studies warrants. In fact, the whole purpose of p-hacking is to turn non-significant results into significant ones.  Most bias tests do not distinguish between publication bias and p-hacking as causes of bias.  However, the p-curve authors make this distinction and claim that p-curve can be used to detect p-hacking.

Apparently, we should not assume that p-hacking is just as prevalent as publication bias, which makes testing for p-hacking irrelevant.

The problem is that it is a lot harder to distinguish p-hacking and publication bias as the p-curve authors imply and that their p-curve test of p-hacking will only work under very limited conditions.  Most of the time, the p-curve test of p-hacking will fail to provide evidence for p-hacking and this result can be misinterpreted as evidence that results were obtained without p-hacking, which is a logical fallacy.

This mistake was made by Winternitz, Abbate, Huchard, Havlicek, & Gramszegi (2017).

Fourth and finally, as bias for publications with significant results can rely more on the P-value than on the effect size, we used the Pcurve method to test whether the distribution of significant P-values, the ‘P-curve’, indicates that our studies have evidential value and are free from ‘p-hacking’ (Simonsohn et al. 2014a, b).

The problem is that the p-curve test of p-hacking only works when evidential value is very low and for some specific forms of p-hacking. For example, researchers can p-hack by testing many dependent variables. Selecting significant dependent variables is no different from running many studies with a single dependent variable and selecting entire studies with significant results; it is just more efficient.  The p-curve would not show the left-skewed p-curve that is considered diagnostic of p-hacking.

Even a flat p-curve would merely show lack of evidential value, but it would be wrong to assume that p-hacking was not used.  To demonstrate this I submitted the results from Bem’s (2011) infamous “feeling the future” article to a p-curve analysis (http://www.p-curve.com/).  pcurve.bem.png

The p-curve analysis shows a flat p-curve.  This shows lack of evidential value under the assumption that questionable research practices were used to produce 9 out of 10 significant (p < .05, one-tailed) results.  However, there is no evidence that the results are p-hacked if we were to rely on a left-skewed p-curve as evidence for p-hacking.

One possibility would be that Bem did not p-hack his studies. However, this would imply that he ran 20 studies for each significant result. with sample sizes of 100 particpants per study, this would imply that he tested 20,000 participants.  This seems unrealistic and Bem states that he reported all studies that were conducted.  Moreover, analyses of the raw data showed peculiar patterns that suggest some form of p-hacking was used.  Thus, this example shows that p-curve is not very effective in revealing p-hacking.

It is also interesting that the latest version of p-curve, p-curve4.06, no longer tests for left-skewedness of distributions and doesn’t mention p-hacking.  This change in p-curve suggests that the authors realized the ineffectiveness of p-curve in detecting p-hacking (I didn’t ask the authors for comments, but they are welcome to comment here or elsewhere on this change in their app).

It is problematic if meta-analysts assume that p-curve can reveal p-hacking and infer from a flat or right-skewed p-curve that the data are not p-hacked.  This inference is not warranted because absence of evidence is not the same as evidence of absence.

Conclusion

P-curve is a family of statistical tests for meta-analyses of sets of studies.  One version is an effect size meta-analysis; others test the nil-hypothesis that the population effect size is zero.  The novel feature of p-curve is that it assumes that questionable research practices undermine the validity of traditional meta-analyses that assume no selection for significance. To correct for the assumed bias, observed test statistics are corrected for selection bias (i.e., p-values between .05 and 0 are multiplied by 20 to produce p-values between 0 and 1 that can be analyzed like unbiased p-values).  Just like regular meta-analysis, the main result of a p-curve analysis is a combined test-statistic or effect size estimate that can be used to test the nil-hypothesis.  If the nil-hypothesis can be rejected, p-curve analysis suggests that some effect was observed.  Effect size p-curve also provides an effect size estimate for the set of studies that produced significant results.

Just like regular meta-analyses, p-curve is not a bias test. It does not test whether publication bias exists and it fails as a test of p-hacking under most circumstances. Unfortunately, users of p-curve seem to be confused about the purpose of p-curve or make the logical mistake to infer from the presence of evidence that questionable research practices (publication bias; p-hacking) are absent. This is a fallacy.  To examine the presence of publication bias, researchers should use existing and validated bias tests.

4 thoughts on “A Clarification of P-Curve Results: The Presence of Evidence Does Not Imply the Absence of Questionable Research Practices

  1. Hi.
    Very intersting posting. I want to analyze publication bias in latin-american journals, and compare if we have worst resulta in comparison english language journals. I was thinking in using pcurve after read Simonhonson paper, but now i am think about what are the best test to evaluate publicatiom bias, because in the end of your post i was craved to continue reading about that.

    Regards

    Thank

  2. I´m just really curious if the methods of p-curve study is really suitable and their conclusions are that accurate?

Leave a Reply to Dr. RCancel reply