Category Archives: P-Hacking.

Z-Curve: An even better p-curve

Abstract

P-Curve was a first attempt to take the problem of selection for significance seriously and to evaluate whether a set of studies provides credible evidence against the null-hypothesis (evidential value). Here I showed that p-curve has serious limitations and provides misleading evidence about the strength of evidence against the null-hypothesis.

I showed that all of the information that is provided by a p-curve analysis is also provided by a z-curve analysis. Moreover, z-curve provides additional information about the presence of selection bias and the risk of false positive results. I also show how alpha levels can be adjusted to separate significant results with weak and strong evidence to select credible findings even when selection for significance is present.

As z-curve does every thing that p-curve does and more, the rational choice is to choose z-curve for the meta-analysis of p-values.

Introduction

In 2011, it dawned on psychologists that something was wrong with their science. Daryl Bem had just published an article with nine studies that showed an incredible finding. Participants’ responses were influenced by random events that had not yet occurred. Since then, the flaws in research practices have become clear and it has been shown that they are not limited to mental time travel (Schimmack, 2020). For decades, psychologists assumed that statistically significant results reveal true effects and reported only statistically significant results (Sterling, 1959). However, selective reporting of significant results undermines the purpose of significance testing to distinguish true and false hypotheses. If only significant results are reported, most published results could be false positive results like those reported by Bem (2011).

Selective reporting of significant results also undermines the credibility of meta-analyses (Rosenthal, 1979), which explains why meta-analyses also suggest humans posses psychic abilities (Bem & Honorton, 1994). This sad state of affairs stimulated renewed interest in methods that detect selection for significance (Schimmack, 2012) and methods that correct for publication bias in meta-analyses. Here I focus on a comparison of p-curve (Simonsohn et al., 2014a, Simonsohn et al., 2014b), and z-curve (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020).

P-Curve

P-curve is a name for a family of statistical tests that have been combined into the p-curve app that researchers can use to conduct p-curve analyses, henceforth called p-curve . The latest version of p-curve is version 4.06 that was last updated on November 30, 2017 (p-curve.com).

The first part of a p-curve analysis is a p-curve plot. A p-curve plot is a histogram of all significant p-values where p-values are placed into five bins, namely p-values ranging from 0 to .01, .01 to .02, .02 to .03, .03 to .04, and .04 to .05. If the set of studies contains mostly studies with true effects that have been tested with moderate to high power, the plot shows decreasing frequencies as p-values increase (more p-values between 0 and .01 than between .04 and .05). This pattern has been called a right-skewed distribution by the p-curve authors. If the distribution is flat or reversed (more p-values between .04 and .05 than between 0 and .01), most p-values may be false positive results.

The main limitation of p-curve plots is that it is difficult to evaluate ambiguous cases. To aid in the interpretation of p-curve plots, p-curve also provides statistical tests of evidential value. One test is a significance tests against the null-hypothesis that all significant p-values are false positive results. If this null-hypothesis can be rejected with the traditional alpha criterion of .05, it is possible to conclude that at least some of the significant results are not false positives.

The main problem with significance tests is that they do not provide information about effect sizes. A right-skewed p-curve with a significant p-values may be due to weak evidence with many false positive results or strong evidence with few false positives.

To address this concern, the p-curve app also provides an estimate of statistical power. This estimate assumes that the studies in the meta-analysis are homogeneous because power is a conditional probability under the assumption that an effect is present. Thus, power does not apply to a meta-analysis of studies that contain true positive and false positive results because power is not defined for false positive results.

To illustrate the interpretation of p-curve analysis, I conducted a meta-analysis of all studies published by Leif D. Nelson, one of the co-authors of p-curve analysis. I found 119 studies with codable data and coded the most focal hypothesis for each of these studies. I then submitted the data to the online p-curve app. Figure 1 shows the output.

Visual inspection of the p-curve plot shows a right-skewed distribution with 57% of the p-values between 0 and .01 and only 6% of p-values between .04 and .05. The statistical tests against the null-hypothesis that all of the significant p-values are false positives is highly significant. Thus, at least some of the p-values are likely to be true positives. Finally, the power estimate is very high, 97%, with a tight confidence interval ranging from 96% to 98%. Somewhat redundant with this information, the p-curve app also provides a significance test for the hypothesis that power is less than 33%. This test is not significant, which is not surprising given the estimated power of 96%.

The next part of a p-curve output provides more details about the significance tests, but does not add more information.

The next part provides users with an interpretation of the results.

The interpretation informs readers that this set of p-values provides evidential value. Somewhat surprisingly, this automated interpretation does not mention the power estimate to quantify the strength of evidence. The focus on p-values is problematic because p-values are influenced by the number of tests. The p-value could be lower wit 100 studies with 40% power than with 10 studies with 99% power. As significance tests are redundant with confidence intervals, it is sufficient to focus on the confidence interval of the power estimate. With a 90% confidence interval ranging from 96% to 98%, we would be justified to conclude that this set of p-values provides strong support for the hypotheses tested in Nelson’s articles.

Z-Curve

Like p-curve, z-curve analyses also start with a plot of the p-values. The main difference is that p-values are converted into z-scores using the formula for the inverse normal distribution; z = qnorm(1-p/2). The second difference is that significant and non-significant p-values are plotted. The third difference is that z-curve plots have a much finer resolution than p-curve plots. Whereas p-curve bins all z-scores from 2.58 to infinity into one bin (p < .01), z-curve uses the information about the distribution of z-scores all the way up to z = 6 (p = .000000002; 1/500,000,000).

Visual inspection of the z-curve plot reveals something that the p-curve plot does not show, namely there is clear evidence for the presence of selection bias. Whereas p-curve suggests that “highly” significant results (0 to .01) are much more common than “just” significant results (.04 to .05), z-curve shows that just significant results (.05 to .005) are much more frequent than highly significant (p < .005) results. The difference is due to the implicit definition of high and low in the two plots. The high frequency of highly significant (p < .01) results in the p-curve plots is due to the wide range of values that are lumped together into this bin. Once it is clear that many p-values are clustered just below .05 (z > 1.96, the vertical red line), it is immediately notable that there are too few just non-significant (z < 1.96) values. This steep drop is not consistent with random sampling error. To summarize, z-curve plots provide more information than p-curve plots. Whereas z-curve plots make the presence of selection for significance visible, p-curve plots provide no means to evaluate selection bias. Even worse, right skewed distributions are often falsely interpreted as evidence that there is no selection for significance. This example shows that notable right-skewed distributions can be found even when selection bias is present.

The second part of a z-curve analysis uses a finite mixture model to estimate two statistical parameters of the data. These parameters are called the estimated discovery rate and the estimated replication rate (Bartos & Schimmack, 2021). Another term for these parameters is mean power before selection and mean power after selection for significance (Brunner & Schimmack, 2020). The meaning of these terms is best understood with a simple example where a researcher tests 100 false hypotheses and 100 true hypotheses with 100% power. The outcome of this study produces significant and non-significant p-values. The expected value for the frequency of significant p-values is 100 for the 100 true hypotheses tested with 100% power and 5% for the 100 false hypotheses that produce 5 significant results when alpha is set to 5%. Thus, we are expecting 105 significant results and 95 non-significant results. Although we know the percentages of true and false hypotheses, this information is not available with real data. Thus, any estimate of average power changes the meaning of power. It now includes false hypotheses with a power equal to alpha. We call this unconditional power to distinguish it from the typical meaning of power conditioned on a true hypothesis.

It is now possible to compute mean unconditional power for two populations of studies. One population of studies are all studies that were conducted. In this example, this population are all 200 studies (100 true, 100 false hypotheses). The average power for these 200 studies is easy to compute as (.5*100 + 1*100)/200 = 52.5%. The second population of studies focuses only on the significant studies. After selecting only significant studies, mean unconditional power is (.05*5 + 1*100)/105 = 95.5%. The reason why power is so much higher after selection for significance is that the significance filter keeps most false hypotheses out of the population of studies with a significant result (95 of the 100 studies to be exact). Thus, power is mostly determined by the true hypotheses that were tested with perfect power. Of course, real data are not as clean as this simple example, but the same logic applies to all sets of studies with a diverse range of power values for individual studies (Brunner & Schimmack, 2020).

Mean power before selection of significance determines the percentage of significant results for a number of tests. With 50% mean power before selection, 100 tests are expected to produce 50 significant results (Brunner & Schimmack, 2020). It is common to refer to statistically significant results as discoveries (Soric, 1989). Importantly, discoveries could be true or false, just like a significant result could be a true effect or a type-I error. In our example, there were 105 discoveries. Normally we would not know that 100 of these discoveries are true discoveries. All we know is the percentage of significant results. I use the term estimated discovery rate (EDR) to refer to mean unconditional power before selection, which is a mouthful. In short, EDR is an estimate of the percentage of significant results in a series of statistical tests.

Mean power after selection for significance is relevant because power of significant results determines the probability that a significant result can be successfully replicated in a direct replication study with the same sample size (Brunner & Schimmack, 2020). Using the EDR would be misleading. In the present example, the EDR of 52.5% would dramatically underestimate replicability of significant results, which is actually 95.5%. Using the EDR would punish researchers who conduct high-powered tests of true and false hypotheses. To assess the replicability of this researchers, it is necessary to compute power only for the studies that produced significant results. The problem with traditional meta-analyses is that selection for significance leads to inflated effect size estimates even if the researcher reported all non-significant results. To estimate the replicability of the significant results, the data are conditioned on significance, which inflates replicability estimates. Z-curve models this selection process and corrects for regression to the mean in the estimation of mean unconditional power after selection for significance. I call this statistic the estimated replication rate. The reason is that mean unconditional power after selection for significance determines the percentage of significant results that is expected in direct replication studies of studies with a significant result. In short, the ERR is the probability that a direct replication study with the same sample size produces a significant result.

I start discussion of the z-curve results for Nelson’s data with the estimated replication rate because this estimate is conceptually similar to the power estimate in the p-curve analysis. Both estimates focus on the population of studies with significant results and correct for selection for significance. Thus, one would expect similar results. However, the p-curve estimate of 97%, 95%CI = 96% to 98%, is very different from the z-curve estimate of 52%, 95%CI = 40% to 68%. The confidence intervals do not overlap, showing that the difference between these estimates is statistically significant itself.

The explanation for this discrepancy is that p-curve estimates are inflated estimates of the ERR when power is heterogeneous (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020). This is even true, if effect sizes are homogeneous and studies vary only in sample sizes (Brunner, 2018). The p-curve authors have been aware of this problem since 2018 (Datacolada), and have not updated the p-curve app in response to this criticism of their app. The present example shows that using the p-curve app can lead to extremely misleading conclusions. Whereas p-curve suggests that nearly every study by Nelson would produce a significant result again in a direct replication attempt, the correct z-curve estimates suggests that only every other result would replicate successfully. This difference is not only statistically significant, but also practically significant in the evaluation of Nelson’s work.

In sum, p-curve is not only redundant with z-curve. It also produces false information about the strength of evidence in a set of p-values.

Unlike p-curve, z-curve.2.0 also estimates the discovery rate based on the distribution of the significant p-values. The results are shown in Figure 2 as the grey curve in the range of non-significant results. As can be seen, while z-curve predicts a large number of non-significant results, the actual studies reported very few non-significant results. This suggests selection for significance. To quantify the amount of selection bias, it is possible to compare the observed discovery rate (i.e., the actual percentage of significant results), 87%, to the estimated discovery rate, EDR = 27%. The 95% confidence interval around the EDR can be used for a significance test. As 87% is well outside the 95%CI of the EDR, 5% to 51%, the results provide strong evidence that the reported results were selected from a larger set of tests with non-significant results that were not reported. In this specific case, this inference is consistent with the authors’ admission that questionable research practices were used (Simmons, Nelson, & Simonsohn, 2011).

Our best guess was that so many published findings were false because researchers
were conducting many analyses on the same data set and just reporting those that were statistically significant, a behavior that we later labeled “p-hacking” (Simonsohn, Nelson, & Simmons, 2014). We knew many researchers—including ourselves—who readily admitted
to dropping dependent variables, conditions, or participants to achieve significance.
” (Simmons, Nelson, & Simonsohn, 2018, p. 255).

The p-curve authors also popularized the idea that selection for significance may have produced many false positive results (Simmons et al., 2011). However, p-curve does not provide an estimate of the false positive risk. In contrast, z-curve provides information about the false discovery risk because the false discovery risk is a direct function of the discovery rate. Using the EDR with Soric’s formula, shows that the false discovery risk for Nelson studies is 14%, but due to the small number of tests, the 95%CI around this estimate ranges from 5% to 100%. Thus, even though the ERR suggests that half of the studies can be replicated, it is possible that the other half of the studies contain a fairly large number of false positive results. Without the identification of moderator variables, it would be impossible to say whether a result is a true or a false discovery.

The ability to estimate the false positive risk makes it possible to identify a subset of studies with a low false positive risk by lowering alpha. Lowering alpha reduces the false positive risk for two reasons. First, it follows logically that a lower alpha produces a lower false positive risk. For example, in the prior example with 100 true and 100 false hypothesis, an alpha of 5% produced 105 significant results that included 5 non-significant results and the false positive rate was 5/105 = 4.76%. Lowering alpha to 1%, produces only 101% significant results and the false positive rate is 1/100 = 1.00%. Second, questionable research practices are much more likely to produce false positive results with alpha = .05 than with alpha = .01.

In a z-curve analysis can be set to different values to examine the false positive rate. A reasonable criterion is to aim for a false discovery rate of 5%, which many psychologists falsely assume is the goal of setting alpha to 5%. For Nelson’s 109 publications, alpha can be lowered to .01 to achieve a false discovery risk of 5%.

With alpha = .01, there are still 60 out of 119 (50%) significant results. It is therefore not necessary to dismiss all of the published results because some results were obtained with questionable research practices.

For Nelson’s studies, a plausible moderator is timing. As Nelson and colleagues reported, he used QRPs before he himself drew attention to the problems with these practices. In response, he may have changed his research practices. To test this hypothesis, it is possible to fit a z-curve analysis to articles published before and after 2012 (due to publication lack, articles in 2012 are likely to still contain QRPs).

Consistent with the hypothesis, The EDR for 2012 and before is only 11%, 95%CI 5% to 31%, and the false discovery risk increases to 42%, 95%CI = 12% to 100%. Even with alpha = .01, the FDR is still 11%, and with alpha = .005 it is still 10%. With alpha = .001, it is reduced to 2% and 18 results remain significant. Thus, most of the published results lack credible evidence against the null-hypothesis.

Results look very different after 2012. The EDR is 83% and not different from the ODR, suggesting no evidence that selection for significance occurred. The high EDR implies a low false discovery risk even with the conventional alpha criterion of 5%. Thus, all 40 results with p < .05 provide credible evidence against the null-hypothesis.

To see how misleading p-curves can be, I also conducted a p-curve analysis for the studies published in the years up to 2012. The p-curve analysis shows merely that the studies have evidential value and provides a dramatically inflated estimate of power (84% vs. 35%). It does not show evidence that p-values are selected for significance and it does not provide information to distinguish p-hacked studies from studies with evidential value.

Conclusion

P-Curve was a first attempt to take the problem of selection for significance seriously and to evaluate whether a set of studies provides credible evidence against the null-hypothesis (evidential value). Here I showed that p-curve has serious limitations and provides misleading evidence about the strength of evidence against the null-hypothesis.

I showed that all of the information that is provided by a p-curve analysis is also provided by a z-curve analysis. Moreover, z-curve provides additional information about the presence of selection bias and the risk of false positive results. I also show how alpha levels can be adjusted to separate significant results with weak and strong evidence to select credible findings even when selection for significance is present.

As z-curve does every thing that p-curve does and more, the rational choice is to choose z-curve for the meta-analysis of p-values.

When DataColada kissed Fiske’s ass to publish in Annual Review of Psychology

One of the worst articles about the decade of replication failures is the “Psychology’s Renaissance” article by the datacolada team (Leif Nelson, Joseph Simmons, & Uri Simonsohn).

This is not your typical Annual Review article that aims to give a review over developments in the field. it is an opinion piece filled with bold claims that lack empirical evidence.

The worst claim is that p-hacking is so powerful that pretty much every study can be made to work.

Experiments that work are sent to a journal, whereas experiments that fail are sent to the file drawer (Rosenthal 1979). We believe that this “file-drawer explanation” is incorrect. Most failed studies are not missing. They are published in our journals, masquerading as successes.

We can all see that not publishing failed studies is a bit problematic. Even Bem’s famous manual for p-hackers warned that it is unethical to hide contradictory evidence. “The integrity of the scientific enterprise requires the reporting of disconfirming results” (Bem). Thus, the idea that researchers are sitting on a pile of failed studies that they failed to disclose makes psychologists look bad and we can’t have that in Fiske’s Annual Review of Psychology journal. Thus, psychologists must have been doing something that is not dishonest and can be sold as normal science.

“P-hacking is the only honest and practical way to consistently get underpowered studies to be statistically significant. Researchers did not learn from experience to increase their sample sizes precisely because their underpowered studies were not failing.” (p. 515).

This is utter nonsense. First, researchers have file-drawers of studies that did not work. Just ask them and they may tell you that they do.

“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.” (Roy Baumeister, personal email communication)

Leading social psychologists, Gilbert and Wilson provide an even more detailed account of their research practices that produce many non-significant results that are not reported (a.k.a. a file drawer), which has been preserved thanks to Greg Francis.

First, it’s important to be clear about what “publication bias” means. It doesn’t mean that anyone did anything wrong, improper, misleading, unethical, inappropriate, or illegal. Rather it refers to the well known fact that scientists in every field publish studies whose results tell them something interesting about the world, and don’t publish studies whose results tell them nothing. Francis uses sophisticated statistical tools to discover what everyone already knew—and what he could easily have discovered simply by asking us. Yes, of course we ran some studies on “consuming experience” that failed to show interesting effects and are not reported in our JESP paper. Let us be clear: We did not run the same study over and over again until it yielded significant results and then report only the study that “worked.” Doing so would be clearly unethical. Instead, like most researchers who are developing new methods, we did some preliminary studies that used different stimuli and different procedures and that showed no interesting effects. Why didn’t these studies show interesting effects? We’ll never know. Failed studies are often (though not always) inconclusive, which is why they are often (but not always) unpublishable. So yes, we had to mess around for a while to establish a paradigm that was sensitive and powerful enough to observe the effects that we had hypothesized. In one study we might have used foods that didn’t differ sufficiently in quality, in another we might have made the metronome tick too fast for people to chew along. Exactly how good a potato chip should be and exactly how fast a person can chew it are the kinds of mundane things that scientists have to figure out in preliminary testing, and they are the kinds of mundane things that scientists do not normally report in journals (but that they informally share with other scientists who work on similar phenomenon). Looking back at our old data files, it appears that in some cases we went hunting for potentially interesting mediators of our effect (i.e., variables that might make it larger or smaller) and although we replicated the effect, we didn’t succeed in making it larger or smaller. We don’t know why, which is why we don’t describe these blind alleys in our paper. All of this is the hum-drum ordinary stuff of day-to-day science.

Aside from this anecdotal evidence, the datacolada crew actually had access to empirical evidence in an article that they cite, but maybe never read. An important article in the 2010s reported a survey of research practices (John, Loewenstein, & Prelec, 2012). The survey asked about several questionable research practices, including not reporting entire studies that failed to support the main hypothesis.

Not reporting studies that “did not work” was the third most frequently used QRP. Unfortunately, this result contradicts datacolada’s claim that there are no studies in file-drawers and so they ignore this inconvenient empirical fact to tell their fairy tail of honest p-hackers that didn’t know better until 2011 when they published their famous “False Positive Psychology” article.

This is a cute story that isn’t supported by evidence, but that has never stopped psychologists from writing articles that advance their own career. The beauty of review articles is that you don’t even have to phack data. You just pick and choose citations or make claims without evidence. As long as the editor (Fiske) likes what you have to say, it will be published. Welcome to psychology’s renaissance; same bullshit as always.