The File-Drawer Problem
A single study is rarely enough to provide sufficient evidence for a theoretically derived hypothesis. To make sense of inconsistent results across multiple studies, psychologists began to conduct meta-analysis. The key contribution of meta-analyses is that pooling evidence from multiple studies reduces sampling error and allows for more precise estimation of effect sizes.
The key problem of meta-analysis is the assumption that individual studies are an unbiased sample of all studies that were conducted. Selective publishing of statistically significant results in favor of a prediction leads to inflated effect size estimates. This problem has been dubbed the file drawer problem. Whereas significant results are submitted for publication, non-significant results are put into a file drawer.
Rosenthal (1979) pointed out that a literature consisting entirely of statistically significant findings could, in principle, reflect no true effects. However, such a scenario was often considered unlikely under the assumption of honest testing with a fixed Type I error rate, because studies without real effects produce a significant result only about 1 out of 20 times when the error rate is controlled with alpha = .05. In addition, Rosenthal proposed a statistic, fail-safe N, to evaluate this risk, and meta-analyses often found that fail-safe N was large enough to infer a real effect.
The assessment of the published literature in psychology shifted dramatically in the early 2010s. Critically, Simmons, Nelson, and Simonsohn (2011) showed with simulations that a few statistical tricks could increase the probability of significant results without real effects from 1:20 to nearly 1:2. Even consistent statistically significant results in several studies were no longer unlikely. Presenting 9 significant results would not require 180 studies, but only 18 (Bem, 2011), or even fewer with more extreme use of questionable statistical practices that later became known as p-hacking. This highly cited article created a crisis of confidence in single studies, and by extension also in meta-analytic findings.
The False Positive Detection Problem
A few years later, Simonsohn, Nelson, and Simmons (2014) published a statistical method, p-curve, to probe the credibility of sets of statistically significant results, often drawn from meta-analyses. They referred to this method as a “key to the file drawer.” The analogy is potentially confusing. P-curve does not test whether publication bias exists, nor does it determine whether questionable statistical practices were used. It also does not estimate how many studies are in the proverbial file drawer. In fact, p-hacking implies that file drawers can be small, even if significant results are false positives. When p-hacking is present, the size of the file drawer is therefore no longer informative.
What p-curve does, and what it does better than fail-safe N, is to assess whether the observed set of statistically significant results is inconsistent with the hypothesis that all tested effects are exactly zero. Simonsohn et al. (2014) call this property evidential value. Formally, significance tests are applied to the distribution of significant p-values to evaluate the null hypothesis implied by this scenario. When this hypothesis can be rejected using conventional significance tests, the data are said to have evidential value. Later versions of p-curve also include stronger tests, but the underlying logic remains the same.
Equipped with a new diagnostic tool, psychologists had a different way to evaluate published studies. While p-curve still shares a limitation of significance testing—namely that it cannot provide affirmative evidence for the null hypothesis, such as the claim that all published significant results are false positives—it can nevertheless show that a set of studies fails to provide evidence against this extreme null hypothesis. Lack of evidence is still valuable information, especially when a set of statistically significant results appears to suggest strong support for a hypothesis, but this support is potentially driven by selective reporting or p-hacking rather than true effects.
P-Curve Coding
P-curve made it possible to evaluate the hypothesis that many, if not most (Ioannidis, 2005), published results are false positives. If this were the case, many p-curve meta-analyses of a specific hypothesis should be flat or left-skewed. In contrast, true hypotheses should produce right-skewed p-curves. Surprisingly, this simple approach to examine the false positive rate has not been systematically applied.
I conducted a review of p-curve articles to see what we have learned about false positives from a decade of p-curve analyses. The article introducing p-curve has been cited more than 1,000 times (WoS, December 30, 2025). I reviewed two samples of articles. First, I sampled the most highly cited articles. These articles are of interest because they may introduce many readers to p-curve, including readers who are not experts in meta-analysis. These articles are also more likely to report high-visibility results. The second set of articles focused on the most recent articles. The rationale is that more recent articles reflect current practice in how p-curve is used and how p-curve results are interpreted.
P-curve results were coded first in terms of evidential value (right-skewed vs. not right-skewed). The second classification concerned the proper interpretation of right-skewed p-curves. Correct interpretations were limited to claims about evidential value. However, some articles misinterpreted p-curve as a bias test and falsely inferred a low risk of bias from a p-curve with evidential value.
The coding scheme had three categories. First, articles that did not report a p-curve analysis were coded as irrelevant. Second, articles that reported a p-curve analysis and correctly limited discussion to evidential value were coded as correct. Third, some articles reported a p-curve analysis but made claims about bias, selection bias, or p-hacking that are invalid. Namely, these articles interpreted results showing evidential value to conclude that publication bias or p-hacking were not a concern. This conclusion is invalid because data can show evidential value while biases can still inflate effect size estimates simultaneously. These articles were coded as incorrect.
Articles were found using WebOfScience. Articles classified as editorial material were excluded. The list of coded articles and their coding is available on the Open Science Framework (OSF) project.
P-Curve Results
I coded 142 articles. A large number of them (k = 95; 67%) cited the p-curve article but did not report a p-curve analysis. An additional two articles stated that a p-curve analysis had been conducted but did not provide a clear description of the results. All 45 articles that reported a p-curve analysis found evidential value. Some of these articles showed flat p-curves for specific subsets of studies, but this pattern was attributed to theoretically meaningful moderators (e.g., Tucker-Drob & Bates, 2015). Importantly, none of the reviewed p-curve analyses suggested that all reported results were false positives.
To further probe this issue, I conducted an automated search of all 990 abstracts retrieved from Web of Science for references to p-curve results indicating no evidential value or flat p-curves. This search did not identify a single abstract reporting such a result.
In terms of interpretations, the results are also notable. More articles misrepresented p-curve as a bias test (k = 28) than correctly presented p-curve as a test of evidential value. Because p-curves were almost always right-skewed, these misinterpretations frequently led authors to infer a low risk of bias, which is not a valid inference from a right-skewed p-curve.
Once, p-curve was even used to discount evidence of bias obtained with other methods. For example, “Funnel plot analysis showed evidence of publication bias, but p-curve analysis suggested that our results could not be caused by selective reporting” (Goubran et al., 2025).
Discussion
Two influential theoretical articles raised concerns that many published rejections of null hypotheses could be false positive results (Ioannidis, 2005; Simmons et al., 2011). P-curve provides an opportunity to evaluate this prediction empirically, but the evidence obtained in p-curve meta-analyses has not been systematically examined. I found that p-curve results always showed evidential value. This finding is in stark contrast to scenarios that suggest the majority of statistically significant results are false (Ioannidis, 2005).
At the same time, p-curve is often misunderstood and misinterpreted as a bias test. This interpretation may lead to a false sense of the credibility of published results. Just as replication failures do not justify the inference that the original result was a false positive (Maxwell et al., 2006), evidential value does not imply that results can be replicated.
There are several possible explanations for the failure to find evidence of false positive results in meta-analyses. One explanation is that false positives are more likely to arise in individual studies than in meta-analyses, which require multiple studies testing the same hypothesis. Sustaining a literature of false positives would therefore require repeated and consistent use of extremely questionable research practices. Few researchers may be motivated to use extreme p-hacking repeatedly to force significant results in the absence of a real effect. Bem (2011) may represent an unusual case in that he appeared to be highly motivated to convince skeptical scientists of the existence of extrasensory perception and to present evidence that met prevailing methodological standards in experimental social psychology. More commonly, researchers may advance claims based on selective or suggestive evidence without attempting to build a cumulative evidential record.
Another explanation is that the statistical null-hypothesis is unlikely to be true (Cohen, 1994). What are the chances that an experimental manipulation has no effect whatsoever on behavior? Subliminal stimuli are often cited as candidates, but even in this literature concerns have been raised that effects may be driven by partial stimulus detection. In correlational research, it is even less likely that two variables have a true correlation of exactly zero. As a result, p-hacking may often inflate effect sizes rather than generate false positive results in the strict sense of rejecting a false null hypothesis.
The problem is when rejection of the nil-hypothesis is confused with credible evidence for a meaningful effect. For example, a p-curve analysis of ego depletion shows evidential value (Carter et al., 2019), but even the original authors were unable to replicate the effect (Vohs et al., 2019). This example illustrates that evidential value is a necessary but not sufficient condition for a credible science. Even if effect sizes are not exactly zero, they can be dramatically inflated. As p-curve is limited to the assessment of evidential value, other methods are required to (a) assess whether published results are biased by selection or p-hacking, (b) estimate population effect sizes while correcting for bias, and (c) estimate the false positive risk in heterogeneous meta-analyses, where a subset of statistically significant results may be false positives.
However, it is also possible that p-curve results are biased and provide spurious evidence of evidential value, that is, evidential value itself may constitute a meta-level false positive. In this case, p-curve would falsely reject the null hypothesis that all statistically significant results are false positives. One possible source of bias is that studies with stronger (but false) evidence may be more likely to be included in meta-analyses than studies with weaker (false) evidence. For example, some researchers may p-hack to more stringent thresholds (e.g., α = .01) or apply Bonferroni corrections, while standard meta-analytic coding practices may mask these selection processes. However, p-hacking of this kind would be expected to produce left-skewed or flat p-curves, such that explaining the near-absence of flat p-curves would require the additional assumption that extreme p-hacking is rare. At present, this possibility cannot be ruled out, but it appears unlikely to account for the overwhelming predominance of right-skewed p-curves.
A more plausible explanation is selective reporting of p-curve results. Because reporting p-curve analyses is optional, meta-analysts may be more likely to include p-curve results when they show evidential value and omit them when p-curves are flat or left-skewed. Evaluating this form of meta-analytic selection bias requires auditing meta-analyses that did not report p-curve results and applying the method retrospectively.
Conclusion
The most important finding is that concerns about many false positive results in psychology journals are not based on empirical evidence. False positives in single studies are not important because no single study can serve as an empirical foundation for a theory. There is no evidence that entire literatures are just a collection of false positive results. This does not mean that published results are credible. Publication bias, inflation of effect sizes, low replicability, method factors in correlational studies, and lack of construct validation remains serious obstacles that have sometimes been overshadowed by concerns about false positive results. These issues deserve more attention in the future.
This is a nice account from a purely frequentist viewpoint. But it is now nearly 40 years since Berger & Sellke (1987) published “Testing A Point Null Hypothesis – the Irreconcilability of P-Values and Evidence” Journal of the American Statistical Association, 82, 112-122. This paper showed that, if you have observed p = 0.049, the minimum risk that this is a false positive is 29%, rather than 5%. Of course this calculation makes different assumptions from the usual frequentist result, but they are entirely plausible assumptions. The fact that there is such a big difference illustrates means, in my view, that science is harder than most people think. Even a simple likelihood argument shows that p values are aften wildly overoptimistic. Fisher so dominated the field for many decades that his approach has become nearly universal, but I’d argue that some of its popularity results from the ease with which it gives “statistical significance”.
See also
Benjamin and Berger, ‘Three Recommendations for Improving the Use of P-Values’. The American Statistician, 73:sup1, 186-191, DOI: 10.1080/00031305.2018.1543135
and
Colquhoun (2019) The False Positive Risk: A Proposal Concerning What to Do About p-Values, The American Statistician, 73:sup1, 192-201, DOI: 10.1080/00031305.2018.1529622
Risk is different from rate. To make a type-I error (report a false positive result), you have to first find a true null hypothesis. Claims of high false positive rates often rely on assumptions that only 1 out of 10 or fewer hypotheses are true (H0 is false). This may be the problem with false positive paranoia. As we are only testing is there a non-null effect in one or the other direction, most hypotheses are likely to be true.
The real problems are of course low power, trivial effect sizes, inflated estimates, low replicability, but true H0 like Bem’s ESP studies are probably rare.