Replicability Index: A Blog by Dr. Ulrich Schimmack

Blogging about statistical power, replicability, and the credibility of statistical results in psychology journals since 2014. Home of z-curve, a method to examine the credibility of published statistical results.

Show your support for open, independent, and trustworthy examination of psychological science by getting a free subscription. Register here.

For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication” (Cohen, 1994).

DEFINITION OF REPLICABILITYIn empirical studies with sampling error, replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion (Schimmack, 2017). 

See Reference List at the end for peer-reviewed publications.

Mission Statement

The purpose of the R-Index blog is to increase the replicability of published results in psychological science and to alert consumers of psychological research about problems in published articles.

To evaluate the credibility or “incredibility” of published research, my colleagues and I developed several statistical tools such as the Incredibility Test (Schimmack, 2012); the Test of Insufficient Variance (Schimmack, 2014), and z-curve (Version 1.0; Brunner & Schimmack, 2020; Version 2.0, Bartos & Schimmack, 2021). 

I have used these tools to demonstrate that several claims in psychological articles are incredible (a.k.a., untrustworthy), starting with Bem’s (2011) outlandish claims of time-reversed causal pre-cognition (Schimmack, 2012). This article triggered a crisis of confidence in the credibility of psychology as a science. 

Over the past decade it has become clear that many other seemingly robust findings are also highly questionable. For example, I showed that many claims in Nobel Laureate Daniel Kahneman’s book “Thinking: Fast and Slow” are based on shaky foundations (Schimmack, 2020).  An entire book on unconscious priming effects, by John Bargh, also ignores replication failures and lacks credible evidence (Schimmack, 2017).  The hypothesis that willpower is fueled by blood glucose and easily depleted is also not supported by empirical evidence (Schimmack, 2016). In general, many claims in social psychology are questionable and require new evidence to be considered scientific (Schimmack, 2020).  

Each year I post new information about the replicability of research in 120 Psychology Journals (Schimmack, 2021).  I also started providing information about the replicability of individual researchers and provide guidelines how to evaluate their published findings (Schimmack, 2021). 

Replication is essential for an empirical science, but it is not sufficient. Psychology also has a validation crisis (Schimmack, 2021).  That is, measures are often used before it has been demonstrate how well they measure something. For example, psychologists have claimed that they can measure individuals’ unconscious evaluations, but there is no evidence that unconscious evaluations even exist (Schimmack, 2021a, 2021b). 

If you are interested in my story how I ended up becoming a meta-critic of psychological science, you can read it here (my journey). 

References

Brunner, J., & Schimmack, U. (2020). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta-Psychology, 4, MP.2018.874, 1-22
https://doi.org/10.15626/MP.2018.874

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17, 551–566
http://dx.doi.org/10.1037/a0029487

Schimmack, U. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychology/Psychologie canadienne, 61(4), 364–376. 
https://doi.org/10.1037/cap0000246

Mastodon

Publication Bias: The Caliper Test

Replicability Index Encyclopedia: Caliper Test

Caliper Test of Publication Bias

The caliper test is a statistical method for detecting publication bias introduced by Gerber and Malhotra (2008a, 2008b). It tests whether the distribution of test statistics is continuous and approximately locally symmetric around a significance threshold, typically z = 1.96, corresponding to p = .05. The key assumption is that, in the absence of publication bias or p-hacking, the expected density of z-scores in a narrow band just above the threshold should be approximately equal to the expected density just below it. A significant excess of results just above the threshold suggests that researchers or publication processes have shifted results across the boundary, either through selective reporting or analytical flexibility.

Procedure

Published p-values are converted to z-scores (z = Φ⁻¹(1 − p/2)). A caliper of width w is placed symmetrically around the threshold, creating two bins: one from 1.96 to 1.96 + w (just significant) and one from 1.96 − w to 1.96 (just nonsignificant). Under the null hypothesis of no bias, the counts in the two bins should be equal. The test is conducted as a one-sided binomial test with expected probability 0.50. Gerber and Malhotra (2008a) recommended bandwidths of 5%, 10%, 15%, and 20% of the threshold value. A 10% caliper around z = 1.96, for example, compares counts in the intervals [1.764, 1.96) and [1.96, 2.156].

Applications

Gerber and Malhotra applied the caliper test to leading political science journals (APSR, AJPS) and sociology journals (ASR, AJS) and found strong evidence of publication bias (Gerber & Malhotra, 2008a; Gerber & Malhotra, 2008b). The test was subsequently adopted in economics, most notably by Brodeur, Lé, Sangnier, and Zylberberg (2016) and Brodeur, Cook, and Heyes (2020), who documented significant bunching of test statistics just above conventional thresholds across top economics journals. Berning and Weiß (2016) applied the caliper test to German social science journals, again finding evidence of bias. The test has become a standard tool in the meta-science toolkit for discipline-wide assessments of publication practices.

Strengths

The caliper test has several practical advantages. The logic is intuitive and easy to communicate. It requires only test statistics or p-values, not standardized effect sizes, making it applicable to heterogeneous literatures where effect-size metrics vary across studies and designs. For discipline-wide analyses where studies address different research questions with different effects, the caliper test avoids the strong assumptions about comparability or homogeneity required by many other methods.

Limitations

The caliper test’s local-symmetry assumption is exact for normally distributed z-values only when the noncentrality parameter equals the critical value. For the conventional threshold z = 1.96, this corresponds to a study with approximately 50% power. If power is lower, the expected distribution slopes downward across the threshold, producing more just-nonsignificant than just-significant results. If power is higher, the distribution slopes upward across the threshold, producing more just-significant than just-nonsignificant results even in the absence of publication bias. Thus, deviations from caliper symmetry can reflect the power distribution of studies rather than selective publication or p-hacking.

This vulnerability becomes more influential with wider caliper intervals. With negative slopes near the threshold, as in low-powered settings, the assumption of local flatness reduces the power of the caliper test to detect publication bias. With positive slopes near the threshold, as in high-powered settings, there are more observations in the interval above the criterion value than below it even without bias. Thus, the caliper test can falsely identify publication bias when the literature has high power or when the mixture distribution slopes upward around the significance threshold. It is therefore unclear whether positive caliper-test results in some applications reflect bias or the expected shape of the z-value distribution.

Schneck (2017) conducted a Monte Carlo simulation comparing the caliper test to Egger’s test, p-uniform, and the test for excess significance (TES). He found that the 5% caliper maintained acceptable false-positive rates but had low power with fewer than 1,000 studies. The 10% and 15% calipers showed inflated false-positive rates at large K, because wider calipers span a larger portion of the density curve where the local-uniformity assumption can break down. Schneck recommended the 5% caliper for discipline-wide analyses with large K. However, a small caliper does not solve the problem of true asymmetric distributions. With large K, even small departures from local symmetry can be estimated precisely, and the caliper test can become significant even if there is no publication bias.

Simulation studies using z-curve’s heterogeneous effect-size framework reveal the problem more starkly. In a simulation with high average power, fewer than 200 studies, and no bias, the caliper test detected bias 100% of the time. Thus, the test should not be interpreted as evidence of publication bias without inspecting the expected or observed shape of the z-value distribution.

This is not merely a calibration problem that can be fixed by adjusting the significance level or caliper width. Narrower calipers can reduce curvature-induced artifacts, but they cannot remove the conceptual mismatch between what the test assumes, local symmetry, and the actual distribution of z-values when the density slopes across the threshold.

This limitation is not shared by all bias-detection methods. Methods that model the full distribution of z-scores, such as z-curve (Brunner & Schimmack, 2020; Bartoš & Schimmack, 2022), can estimate the expected shape of the z-value distribution under heterogeneous power and selection. The advantage of the caliper test is that it can have high power to detect threshold-related discontinuities in some conditions. Its disadvantage is that it can also provide false evidence of bias when the expected distribution is asymmetric. Therefore, the caliper test should be used together with a plot of the z-value distribution. A positive slope for significant values is a red flag because it violates the local-symmetry assumption of the caliper test.

Summary

The caliper test is a simple, widely used tool for detecting threshold-related publication bias in large literatures. It is most reliable when the expected distribution of test statistics is approximately locally symmetric around the significance threshold in the absence of bias. In literatures where the z-value distribution slopes across the threshold — whether because of high power, low power, or heterogeneous true effects — the test can mistake the expected shape of the distribution for evidence of selective publication or p-hacking. This problem is especially relevant in discipline-wide analyses in the social sciences, where studies often address different hypotheses, use different designs, and have heterogeneous statistical power. Researchers using the caliper test in such settings should interpret positive results with caution and consider model-based alternatives that account for the expected shape of the z-score distribution.

References

Bartoš, F., & Schimmack, U. (2022). Z-curve 2.0: Estimating replication rates and discovery rates. Meta-Psychology, 6, MP.2020.2720.

Berning, C. C., & Weiß, B. (2016). Publication bias in the German social sciences: An application of the caliper test to three top-tier German social science journals. Quality & Quantity, 50, 901–917.

Brodeur, A., Cook, N., & Heyes, A. (2020). Methods matter: p-hacking and publication bias in causal analysis in economics. American Economic Review, 110(11), 3634–3660.

Brodeur, A., Lé, M., Sangnier, M., & Zylberberg, Y. (2016). Star Wars: The empirics strike back. American Economic Journal: Applied Economics, 8(1), 1–32.

Brunner, J., & Schimmack, U. (2020). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta-Psychology, 4, MP.2018.874.

Gerber, A. S., & Malhotra, N. (2008a). Do statistical reporting standards affect what is published? Publication bias in two leading political science journals. Quarterly Journal of Political Science, 3(3), 313–326.

Gerber, A. S., & Malhotra, N. (2008b). Publication bias in empirical sociological research: Do arbitrary significance levels distort published results? Sociological Methods & Research, 37(1), 3–30.

Gerber, A. S., Malhotra, N., Dowling, C. M., & Doherty, D. (2010). Publication bias in two political behavior literatures. American Politics Research, 38(4), 591–613.

Schneck, A. (2017). Examining publication bias — a simulation-based evaluation of statistical tests on publication bias. PeerJ, 5, e4115.

Power Failure: The Test of Excessive Significance for Heterogenous Data is Excessively Conservative

Creative people come up with new ideas. This does not mean that all of their new ideas are good or worthwhile. Personally, I have had some ideas that seemed brilliant at the moment, but did not survive closer inspection, simulation studies, or critical comments by people like my collaborator Jerry Brunner, when we developed z-curve.

In my opinion, one of the best ideas of T. D. Stanley was to focus on the top 10% of publications when conducting a meta-analysis (Stanley, 2010). It is well known that many studies have low power and inflated effect sizes estimates. Instead of trying to correct for this bias, a better strategy is to focus on studies with strong evidence.

The same is also true for articles. Given pressure to publish, the incentive structure rewards publishing articles that make only a small contribution to the literature. Moreover, articles often are sold by exaggerating the importance and novelty of work. The main problem is that it is difficult to make sense of articles that claim novelty, simply by ignoring other work that is relevant.

An article by Stanley et al. (2021) illustrates my point. The article introduced a new method to detect publication bias. The main limitation of this article is that it did not examine the performance of this new test across a wide range of scenarios and it did not compare the new method to other tests that may have more power to detect publication bias. Here I show that the new test performs much worse than existing methods. Thus, this article adds nothing to the literature on publication bias, demonstrating that not all new ideas are good ideas.

Fixing the Test of Excessive Significance

There are many approaches to detecting publication bias. Older approaches focus on the association between effect-size estimates and sampling error. For example, small studies with large standard errors often produce larger effect-size estimates than large studies with small standard errors, a pattern that may indicate publication bias. However, this association can also be produced by other mechanisms, including genuine heterogeneity, design differences, or scale artifacts.

A newer class of tests takes a different approach. Instead of examining the relation between effect sizes and standard errors, these tests compare the observed success rate of a set of studies with the success rate that would be expected from their statistical power (Sterling et al., 1995; Ioannidis & Trikalinos, 2007; Bartos & Schimmack, 2022). The logic is simple. In the absence of publication bias, the percentage of statistically significant results should match the average power of the studies. If the observed discovery rate is much higher than the expected discovery rate, this provides evidence that nonsignificant results are missing or that significant results were produced by questionable research practices.

The main challenge for power-based tests is that true power is unknown and has to be estimated. Different tests therefore differ mainly in how they estimate average power. This difference is crucial. If average power is overestimated, the test will be conservative and may fail to detect bias. If average power is underestimated, the test may falsely suggest bias. Thus, the performance of any test of excess significance depends on whether it can estimate average power accurately in realistic literatures with heterogeneous sample sizes, designs, measures, and effect sizes.

The Test of Excess Significance (TES) was first proposed for meta-analyses of relatively similar studies, where it may be reasonable to assume that studies estimate the same or similar population effects (Ioannidis & Trikalinos, 2007). In a common application of TES, a meta-analytic effect-size estimate is used, together with each study’s sample size or standard error, to compute the expected power of each study. The expected number of significant results is then compared with the observed number of significant results. This approach can work well when population effect sizes are relatively homogeneous, but it becomes problematic when studies differ substantially in their true effects (Johnson & Yuan, 2007; Ioannidis, 2013; Schneck, 2017). This limitation matters because many psychological meta-analyses show moderate to large between-study heterogeneity (van Erp et al., 2017).

This problem was known since TES was published, but Stanely et al. (2021) were the first to propose a fix for this problem. The new test used the same approach to estimate average power based on a meta-analytic effect size. However, they added an estimate of heterogeneity to the estimate of power for individual studies. This new test was called Proportion of Statistical Significance Test (PSST).

PSST allows true effects to vary across studies while still relying on a meta-analytic average effect size as the center of the model. This creates tension in the estimation of expected significance. The true effect in a particular study may be smaller or larger than the meta-analytic average. PSST treats this variation as additional uncertainty by adding the heterogeneity variance to the sampling-error variance. This makes the test more conservative or, in other words, less sensitive. The tradeoff is the same as lowering alpha in a significance test: the correction reduces Type I errors, because the test is less likely to mistake real heterogeneity for publication bias, but it increases Type II errors, because the test becomes less likely to detect publication bias when it is present.

Inspired by Sterling et al. (1995), Schimmack (2012) developed another approach to estimating average power that can be used with heterogeneous literatures. The Incredibility Index computes observed power separately for each study, based on the reported statistical result. Observed power can be computed directly from test statistics or indirectly by transforming p-values into power estimates. Because the calculation is study-specific, the method does not require all studies to share the same population effect size.

Schimmack introduced the Incredibility Index as a continuous measure of credibility rather than as a dichotomous significance test. However, the same probability model also yields a p-value: the probability of obtaining at least as many significant results as observed, given the estimated power of the studies. In the present simulations, I use this p-value with alpha = .05 to compare its Type I error rate and power with PSST and the caliper test. In this operational sense, the Incredibility Index is evaluated as the Incredibility Test.

The key drawback of the Incredibility Index is that selection for significance inflates observed power. However, this makes the method conservative because it overestimates average true power. As observed power is already inflated by selection and the observed success rate still exceeds mean observed power, the evidence for bias is especially strong.

PSST and the Incredibility Test are both conservative tests, but for different reasons. PSST is conservative because it treats heterogeneity as additional uncertainty, making the test less sensitive to excess significance. The Incredibility Test is conservative because it uses observed power, which is itself inflated by selection for significance. The substantive question is therefore not whether either test avoids false positives, but which test is more sensitive to publication bias when bias is actually present. The following simulation studies examine this question.

Simulation Studies

I conducted simulation studies to examine the performance of the caliper test, the incredibility test, and TESS. The results are part of a broader examination of bias detection methods with heterogenous data, but the results for the TESS are clear and can be reported in a short blog post. The simulation has several parameters. Here the sample size is N = 100, the average effect size is d = 0.4, and the heterogeneity in effect sizes is moderate, tau = 0.4. The set of studies depends on power and selection bias and the requirement to have 100 significant results. The simulation of sample sizes and effect sizes implies about 50% power. Thus, without publication bias there is roughly the same number of non-significant and significant results. Figure 1 shows the results of one simulation run without publication bias.

Here the exact numbers were 85 non-significant results and 100 significant results. Figure 1 is a z-curve plot that also shows another way to examine publication bias. Z-curve fits a model to the distribution of significant results and then extrapolates the distribution in the range of non-significant z-values. The Expected Discovery Rate (EDR) is just another measure of average power without selection that can be compared to the observed discovery rate. Here the EDR of 53% is close to the ODR of 54% and the model prediction (dotted red line) fits the observed distribution (light purple bars) well. However, even with 100 significant results, the CI around the EDR is very wide, implying low power to detect bias when bias is present.

The caliper test can be slightly bias with large k (Schneck, 2017), but in this scenario all tests produce fewer than 5 out of 100 significant results: incredibility test (0), TESS (0), and caliper test (4). Thus, higher power to detect bias does not come at the expense of inflated type-I error rates.

Results

Table 1 shows the results for different levels of publication bias. The results are clear. PSST is less powerful than existing tests like the incredibility test and the caliper test. It is also noteworthy that all tests have low power to detect publication bias even if 50% of all non-significant results are missing. Thus, type-II error and low power are a major concern and using less powerful bias tests is likely to result in a high rate of false negative results.

Below are some illustrations of the simulations with different amounts of publication bias.

100% Selection Bias

In some pathological literatures, researchers condition on significance and report only significant results (Sterling et al., 1995). No statistical test is needed to see publication bias in this scenario (Figure 2).

In the actual simulation, all three tests showed evidence of bias with alpha = .05. To examine the power of these tests to detect publication bias, I lowered the amount of publication bias, starting with 90% selection bias (Figure 3).

90% Selection Bias

Visual inspection of the z-curve plot still strongly suggests that bias is present. Z-curve estimates that the true power of all studies (Expected Discovery Rate) is only 66% and that many non-significant results are missing.

The power of the IT in this scenario is 100%. The caliper test has 98% power. The PSST has 95% power. So, this scenario does not clearly distinguish between the three tests.

80% Selection Bias

Visual inspection of the z-curve plot still suggests publication bias. With 80% selection bias, power of the IT is 98%. The caliper test still has 88% power, but the PSST drops to 30%. Thus, visual inspection of a z-curve plot is often more powerful than the PSST.

Discussion

The results show that the heterogeneity correction in PSST can substantially reduce sensitivity to publication bias. The correction addresses a real problem with the original TES: when true effects vary across studies, TES can mistake genuine heterogeneity for publication bias. However, the present simulations show that the correction may overcompensate. By treating heterogeneity as additional uncertainty, PSST becomes more conservative, but this reduction in Type I error comes at the cost of a large increase in Type II error.

In the present simulations, PSST performed well only when selection bias was extreme. When selection bias was still very large but less extreme, PSST had much lower power than the Incredibility Test and the caliper test. This suggests that PSST may fail to detect publication bias in precisely the kinds of heterogeneous literatures where bias tests are most needed.

This is important because bias tests are often used rhetorically in meta-analyses. If a low-powered bias test fails to reject the null hypothesis, this result can be misinterpreted as evidence that publication bias is absent. When different bias tests produce different results, authors may also describe the evidence as inconsistent or inconclusive, rather than asking which tests have adequate power under realistic conditions. To avoid this problem, publication-bias tests need to be evaluated with standard operating characteristics, especially Type I error rates and power (Renkewitz & Keiner, 2019).

The Incredibility Test has received little attention in these comparisons, even though it has an important advantage for heterogeneous literatures: it estimates expected significance at the study level rather than assuming a common population effect size. The present simulations are not a complete evaluation of the Incredibility Test, the caliper test, or PSST. However, they show that PSST should not be treated as a generally superior solution to the heterogeneity problem in TES. Its conservatism can come at a substantial cost, making it much less sensitive to publication bias than alternative tests.

In conclusion, PSST fixes one problem of TES by creating another: it reduces false positives under heterogeneity, but it can also make real publication bias difficult to detect.

Meta-Discussion

The results provide clear evidence that the Incredibility Test and the Caliper Test with a 20% caliper are more powerful tests of publication bias than the PSST. The PSST was a new idea to detect publication bias in heterogeneous sets of studies, but not all new ideas are better ideas.

My AI collaborator puts it this way:

“A methods paper that introduces a strictly inferior tool doesn’t advance the field even if the underlying reasoning is sound. It actually does harm: practitioners who adopt PSST will fail to detect real bias and may cite the nonsignificant result as evidence of no bias. That’s worse than having no correction at all, because at least with uncorrected TES the inflated Type I rate occasionally catches real bias.”

My AI advisor also told me to delete the following meta-scientific discussion of publishing articles that do not advance science.

Stanley et al.’s article adds to the 90% of articles that are better forgotten. It also shows the problem of publishing new ideas too quickly. A careful examination of the existing literature may reveal that an older idea is better and move on. This is particularly valuable advice for Ioannidis who has published over 1,300 article. Maybe thinking a bit more and reading others work may lead to real advances that are worthy of publishing. Even with 61 citations so far, this article may not add to Ioannidis’s H-index and lower his quality-adjusted H-Index (Schimmack, 2026). However, until the reward structure of science changes, we will continue to see many articles published that are best ignored. A new role of meta-scientists is to tell busy scientists to pick the 10% of articles that are worthwhile reading.

A Quality Adjusted H-Index: Less is More

Measures are necessary because unmeasured work is easily ignored. If universities want to reward merit, they need some way to identify it. The problem is that every measure can be gamed. A recent example comes from prediction markets: French authorities investigated possible tampering with a weather sensor at Charles de Gaulle Airport after unusual temperature spikes coincided with profitable bets on Paris temperatures on Polymarket. The case illustrates a general principle: once a number has consequences, people have incentives to influence the number rather than the underlying reality. (WSKG)

The same problem exists in academia. Professors’ work is difficult to evaluate. Research quality, originality, theoretical importance, mentorship, service, and long-term influence are not easily reduced to a single number. Yet universities still need to make decisions about hiring, promotion, salaries, awards, and prestige. This need has encouraged the use of quantitative performance indicators, such as the number of publications, total citations, and the H-index.

These measures capture something real, but each creates distortions. Publication counts reward volume, even when many papers have little influence. Citation counts reward visibility and cumulative attention, but can be inflated by a few highly cited papers, large collaborations, review articles, field size, and self-citation. The H-index improves on both by requiring a body of cited work, but it has its own blind spot: it ignores how many low-impact papers were produced alongside the influential ones.nt:

It is well known that academic metrics are biased because researchers can influence both citation counts and publication counts. Self-citations are relatively easy to detect, and can be excluded if necessary. Citations from close peer networks are harder to evaluate. Mutual citation practices, honorary coauthorship, strategic review writing, conference visibility, social media promotion, and aggressive self-marketing can all increase citation counts without necessarily reflecting greater intellectual merit.

Publication counts are even easier to inflate. The simplest strategy is to divide research into many small papers, submit weak papers repeatedly until they are accepted somewhere, or publish in journals with low rejection rates. In some cases, this includes pay-to-publish outlets that rely more on publication fees than on rigorous peer review. These practices do not imply that all highly productive scholars are gaming the system, but they show why raw publication counts are poor indicators of quality.

In principle, the problem could be solved by independent evaluations of scientific quality. In practice, this is difficult. Quality is multidimensional: a paper may be technically rigorous but unimportant, original but wrong, influential but misleading, or methodologically imperfect but theoretically generative. Expert judgment is necessary, but it is also subjective, costly, and vulnerable to reputation, ideology, personal networks, and disciplinary fashions.

As a result, universities rely on imperfect quantitative proxies. These proxies are attractive because they are easy to count, but they are incomplete. They measure visibility and productivity more easily than they measure quality. The challenge is not to find a perfect metric, but to design metrics that are harder to game and that capture dimensions of merit ignored by existing indicators.

This is where the low-impact tail becomes relevant. A publication record with many highly cited papers and few low-cited papers conveys something different from a publication record with the same H-index but hundreds of additional papers that attracted little attention. The conventional H-index ignores this distinction. A quality-adjusted H-index makes it visible.

The draft has a strong argument. The main correction is that the quality-adjusted H-index is not a percentage. Efficiency is a percentage, H/N. But QH = H²/N is on an H-index-like scale. So Ioannidis’s QH is 25.9, not 25.9%, and Diener’s is 32.2, not 32.2%. The uploaded text also has several typos: “ward” should be “reward,” “publication” should be “publications,” “meta-scientists” should be “meta-scientist,” and “Ioanndisis’s” should be “Ioannidis’s.”

Here is a refined version of the section from “A long tail…” onward:

A long tail of low-impact publications has negative effects on science. It crowds out potentially better work by other researchers. It also consumes resources, especially when publications are supported by publicly funded grants or paid publication fees. It may even hurt the authors themselves. Time spent producing many low-impact articles is time not spent developing fewer, more substantial contributions. Rewarding efficiency may therefore benefit science by shifting incentives away from maximizing publication counts and toward producing work that has durable influence.

The proposed index is simple. It requires only two pieces of information: the total number of publications, N, and the H-index, H. The H-index rewards a sustained body of impactful work. It does not solve the problem that citations are only a proxy for quality, but that is not the purpose of the new index. The purpose is to adjust the H-index for publication efficiency.

Efficiency can be defined as the proportion of publications that belong to the H-core:

Efficiency = H / N

A researcher with an H-index of 100 and 400 publications is more efficient than a researcher with the same H-index and 1,000 publications. Both have the same citation core, but the second author needed many more publications to achieve it.

Combining impact and efficiency gives:

QH-index = Impact × Efficiency
QH-index = H × H/N
QH-index = H²/N

Examples

John P. A. Ioannidis is a prominent Stanford scientist and meta-scientist whose work has focused on improving scientific credibility and reducing false findings. He has an impressive H-index of 190 and an even more impressive total of 1,396 publications. Based on traditional metrics, this is an extraordinary record.

However, the record looks different when efficiency is taken into account. To achieve an H-index of 190, Ioannidis produced 1,396 publications. His efficiency is therefore:

190 / 1,396 = .136

Thus, 13.6% of his publications are in the H-core. His quality-adjusted H-index is:

190² / 1,396 = 25.9

Ed Diener was one of the most influential social and personality psychologists and helped establish the scientific study of subjective well-being. His H-index is 126, which is lower than Ioannidis’s H-index of 190. However, Diener produced 493 publications. His efficiency is therefore:

126 / 493 = .256

Thus, 25.6% of his publications are in the H-core, nearly twice Ioannidis’s efficiency. His quality-adjusted H-index is:

126² / 493 = 32.2

The conventional H-index ranks Ioannidis higher. The QH-index ranks Diener higher because Diener achieved a large citation core with a much smaller publication record.

Conclusion

No simple quantitative indicator is a perfect measure of merit. Still, universities and funding agencies need some measures to allocate limited resources. The H-index was designed to avoid rewarding researchers merely for producing many low-impact publications. It improved on simple publication counts by requiring a body of cited work. Yet the H-index still has a blind spot: once the H-core is established, additional low-impact publications carry no penalty.

The QH-index addresses this problem. It preserves the central virtue of the H-index by rewarding sustained impact, but it discounts this impact when it is accompanied by a large number of low-impact publications. Publishing more articles is beneficial only when it increases the citation core. Producing a large long tail of low-impact work lowers the score. This corrective may help reduce incentives to publish as much as possible without regard to the quality or influence of the work.

Closed Review == Censorship

Anonymous Closed Peer-Review is Censorship

Every self-interested entity in power wants to control public opinion. Billionaires buy newspapers, not to make more money, but to use their money to push their personal agenda. Totalitarian governments control access to free information to keep their citizens’ uninformed. The same human behavior is also visible in science, but it is often ignored.

British lords invented the “peer” (not you and me, but other lords) review system when they engaged in scientific debates as a hobby. Today, science is a billion dollar industry and scientists are self-interested actors in this system. Closed peer-review is still used to sell the public the impression that scientists control themselves to ensure that published articles meet the highest standards of scientific research. In reality, the closed peer-review system is used to control information and repress criticism.

The ability to influence the information that gets the stamp of peer-review approval is also the main motivation to take on the thankless job as an editor. The only reward is to decide which small number of submissions will get published or not. High rejection rates are used to claim rigorous quality control, but in reality, they give editors power to influence the narrative.

The problem is amplified at journals that focus on a specific narrow topic. These journals were often created by scientists who were not able to publish their work in other journals because their work was not considered important to the editors of those journals. For example, Cognition and Emotion was created in 1991 because psychology shunned research on emotions and even after the affective revolution in the 1980s, it was difficult to publish emoiton research in mainstream psychology journals.

Creating a journal to publish important work itself is a positive response to censorship. Rickard Carlson and I also used this approach to make it easier to publish research on meta-psychological topics that were difficult to publish elsewhere. However, the danger is that oppressed groups become oppressors, when they gain power. And closed peer-review gives editors at these new journals the power to control the narrative, just that it is now their narrative and their self-interests that decide what gets published. The only way to avoid this trap is to dismantle the power structure. That is what Rickard did with Meta-Psychology. First, articles are not rejected. They are improved until they meet basic scientific standards. Thus, there is no tool to suppress work because it is “not novel enough,” “only a small increment,” “outside of the scope of this journal,” or just a desk rejection with a note that the journal just cannot publish all of the important work that is done. The real reason is often that the editor did not like a paper.

In short, closed peer-review is not what the general public thinks it is. Rather than ensuring that research meets basic scientific standards, it is used to reward people to follow the party line and punish people who want to publish critical work.

Open Science Reforms

In psychology, the academic discipline I know because I worked in it for over 30 years now, the problem of censorship became apparent during the replication crisis in the 2010s. Peer-review had failed to ensure that published results are scientifically valid. Lack of training and understanding of science itself was partly to blame, but the bigger reason was that peer-reviewers were happy to publish bad research because they were doing the same bad research and were interested in publishing these results that benefited their own work. Yes, I am talking about the implicit revolution (Greenwald’s words, not mine) that seemed to show that much of human behavior was caused by mindless responses to situational cues without even noticing it. Call it implicit, automatic, or unconscious, experiment after experiment seemed to support these claims. In reality, research on the unconscious worked very much like Freud’s model of unconscious process. Undesirable results were repressed and only results that showed support for researcher’s claims were published. This became apparent after Bem even showed time reversed unconscious processes, which nobody was willing to believe. When other studies were replicated, they also failed to provide support for other claims and the implicit revolution imploded. Peer-review had failed as a quality control mechanism. Rather censorship had created a bubble of false findings. It doesn’t take a psychoanalyst to realize that the realization was painful and that many old researches resorted to defense mechanisms to avoid the emotional consequences of realizing that their achievements were illusory.

Open science requires open sharing of all findings and arguments. It also requires that conclusions are consistent with the evidence and logically consistent. This open exchange cannot happen in a closed peer-review system where editors control the narrative. The new quality assurance is not “peer-reviewed,” but “open peer reviews,” and publication of all arguments on both sides. It is also important to get rid of journal rankings to evaluate the quality of research. Journal rankings only ensure that editors of prestigious journals have even more power to control the narrative. I experienced this first hand. When I submitted my first critique of the Implicit Association Test to the prestigious journal “Perspectives on Psychological Science,” the editor rejected it. When I tried again several years later, a new editor accepted it. Neither decision was based on the quality of the work or the argument, it was just a personal preference.

A Scientific Utopia

Most editors also do not read articles they handle or provide their own comments. The bias is often introduced by picking reviewers that will like or dislike a paper (I know, I was Ed Diener’s henchmen, his words, not mine). So, they really do not add anything of value. Even current AI (large language models) are better able to evaluate the scientific merits of a paper and we can replace human editors with AI, a faster, more cost effective, and less biased way to make decisions about publications that are essential for young scientists’ careers.

Losing Sight of the Sign: ANOVA and Significance Testing


When Fisher developed the F-test at Rothamsted Experimental Station in the 1920s, he was solving a real problem. Agricultural field trials had multiple treatment conditions — different fertilizers, different watering regimes, different crop varieties — and the question was whether any of these treatments affected yield. The F-test answered exactly that question: is there more variation between treatments than within them? Any departure from the null was practically interesting because farmers don’t care about direction — they care about which treatment produces the most wheat.

The F-test does this by squaring the differences. The test statistic is always positive. Direction disappears. This is a feature, not a bug, when you have five fertilizers and want to know whether they differ. The omnibus test screens for something worth following up. Fisher then followed up with his Least Significant Difference — ordinary pairwise comparisons gated by the significant F. The omnibus test was a screening step, not the conclusion.

Psychology imported this machinery wholesale, starting in the 1940s. Gigerenzer has told the story of how Fisher’s methods were “cleansed of their agricultural odor” by textbook writers who created the null ritual — mechanical significance testing at p < .05 without specifying alternatives or computing power. But there is a more specific problem that has received less attention: the F-test, by squaring away the sign, trained psychologists to think about hypotheses in unsigned terms. “Is there an effect?” replaced “What is the effect and in which direction?”

This matters less than you might think for multi-group designs, where the omnibus F is doing real work. Testing whether five conditions differ before examining pairwise comparisons is not a ritual — it is a principled gating procedure for multiplicity control. MANOVA before univariate ANOVAs follows the same logic. These are legitimate steps in a testing hierarchy.

But psychology was not mostly running five-group designs. The workhorse experiment had two conditions: treatment versus control. With two groups, F(1, n−2) = t²(n−2). The tests are mathematically identical, but they look different. The t-test has a sign. You can see whether the treatment group scored higher or lower. The F-test strips that away. Psychologists reported F(1, 58) = 4.12, p < .05 when they could have reported t(58) = 2.03, p < .05, and every reader would have immediately seen the direction.

Things got silly. Mark Rubin (2022) argued that it is illegitimate to infer the direction of an effect from a two-sided test — that a significant F or two-sided t only licenses the claim that the means differ, not which is larger. Formally, he is correct that the F-test does not output a direction. But the means are part of the same analysis, and pretending you cannot look at them is a confusion of the test statistic with the inference. Observing that the treatment mean is 12.4 and the control mean is 8.7, and reporting F(1, 58) = 4.12, p < .05, does tell you the treatment increased the outcome. The test confirms the difference is unlikely under the null; the means tell you which way it goes.

This is where the critique of nil-hypothesis testing enters, and where it went partly wrong. Meehl (1967) and Cohen (1994) argued that rejecting the nil hypothesis is scientifically uninformative. They were right that “the means differ somewhere” is a weak conclusion — but this criticism applied most forcefully to the multi-group omnibus F, where Fisher intended it as a screening step. For two-group comparisons, the F-test was always testing a directional effect with the sign obscured.

Here is the point that seems to have been missed. With two groups, a significant F(1, df) at α = .05 is identical to a significant two-sided t(df) at α = .05. And a two-sided test at α = .05 is equivalent to two one-sided tests at α = .025 each. When you reject H₀: μ₁ = μ₂ with a two-sided test and observe that the treatment mean is higher, you have also rejected the directional null H₀: μ₁ ≤ μ₂ at p/2. If you reject μ₁ = μ₂ and observe μ₁ > μ₂, you have rejected μ₁ ≤ μ₂. The sign was always being tested — the F-test just made it invisible.

This means that much of the criticism of NHST — including Gelman and Carlin’s concern about Type S (sign) errors — rests on a misunderstanding. The worry is that a significant result might have the wrong sign, especially in underpowered studies. But a two-sided test that rejects the nil hypothesis does test the sign. The alternative hypothesis is not just “the means differ” — it is partitioned into “the treatment mean is higher” and “the treatment mean is lower,” and the data tell you which one. The sign error problem is real in the sense that underpowered studies can produce unreliable estimates, but it is not a gap in the logic of the test itself. The F-test merely hid this from view.

The real lesson is that statistical tools shape how scientists think. The F-test did not just analyze data — it structured how psychologists formulated hypotheses. By squaring away the sign, it turned every research question into “is there an effect?” rather than “how big, in which direction?” Two generations of methodologists then criticized significance testing for answering a trivial question, when the real problem was that the wrong test statistic was being used for the wrong design. Using t-tests preserves the sign. A positive and significant t-value means the treatment group has a higher mean than the control group, and that this difference is unlikely to have occurred without a real effect. A negative t-value has the opposite implications. Even when the F-test obscures the direction, the observed means still show it, and the significant test licenses the directional inference.

All of these problems dissolve with confidence intervals. A CI that excludes zero and is entirely positive tells you the sign, the magnitude, and the uncertainty in a single object. The F-test, the t-test, and the one-sided versus two-sided debate all become unnecessary.


Heterogeneity in the Replicability of Psychological and Social Sciences

Concerns about research credibility have stimulated the growth of meta-science, a field that examines the reproducibility, robustness, and replicability of scientific findings (Ioannidis, 2005; Munafò et al., 2017). This literature has documented publication bias, low statistical power, inflated effect size estimates, and disappointing replication rates in some areas of research (Button et al., 2013; Ioannidis, 2005; Open Science Collaboration, 2015; Tyner et al., 2026). While initial studies focused on psychology and neuroscience, but a recent article suggested that the problems are more general. Tyner et al. (2026) reported that only about 50% of originally significant claims were successfully replicated.

A replication rate of 50% invites different interpretations. An optimistic interpretation is that most original studies detected effects in the correct direction, but that the average probability of obtaining another significant result in a new sample was only about 50%. In this scenario, selective publication of significant results inflates observed effect sizes, so replication studies often fail even when the original studies were not false positives. Many of the failures are therefore false negatives. A pessimistic interpretation is that many original results were false positives, whereas the remaining studies examined true effects with high power. In that case, the same 50% replication rate could arise from a mixture of null effects and highly powered true effects. Thus, the average replication rate alone is consistent with very different underlying realities.

To move beyond average replication rates, it is necessary to avoid reducing results to a dichotomy of significant versus non-significant. A cutoff at z = 1.96 is useful for decision making, but it discards quantitative information about the strength of evidence. A result with z = 6 provides much stronger evidence for a positive effect than a result with z = 2, just as z = -6 provides much stronger evidence for a negative effect than z = -2. This point is straightforward, but broad evaluations of replication outcomes have largely ignored differences in original evidential strength.

I used z-curve to examine heterogeneity in the strength of evidence across the original significant findings included in the two large replication projects (Brunner & Schimmack, 2020; Bartoš & Schimmack, 2022). Z-curve uses the distribution of significant z-values and corrects for the inflation in observed test statistics introduced by selection for significance. It provides two key estimates. The first is the Expected Replication Rate (ERR), which is the average probability that a significant result would be significant again in an exact replication with a new sample of the same size. The second is the Expected Discovery Rate (EDR), which is the estimated proportion of all studies, including unpublished non-significant ones, that would be expected to yield a significant result.

The EDR can be used to evaluate publication bias and to derive an upper bound on the false discovery rate using Sorić’s (1989) formula. Performance of z-curve has been examined in extensive simulation studies, which show that its 95% confidence intervals perform well when at least 100 significant results are available (Bartoš & Schimmack, 2022). Because z-curve is designed to accommodate heterogeneity in evidential strength, it is especially suitable for a diverse set of studies such as those included in the replication projects. Previous applications have shown substantial variation in ERR and EDR across research areas (Schimmack, 2020; Schimmack & Bartoš, 2023; Soto & Schimmack, 2024; Credé & Sotola, 2024; Sotola, 2022, 2024).”One limitation of previous applications is that they sometimes relied on automatically extracted p-values or focused on specific literatures. The replication projects provide gold-standard test statistics from a representative sample of social science research, avoiding both concerns. This makes it possible to examine heterogeneity in replicability across a broad range of research areas.

All original studies in the two replication projects were eligible for inclusion. For articles with multiple claims, the focal claim was identified from the abstract using a large language model (see OSF for details and cross-validation). When exact p-values were not reported in the project materials, the original articles were consulted to recover the necessary information. Articles without exact p-values were excluded. Original studies that claimed an effect without meeting the conventional significance threshold of p < .05 were also excluded. A small number of studies were further excluded because the replication reports did not provide sufficient information to evaluate the replication outcome. This screening process yielded k = 222 significant results (k1 = 88, k2 = 134), including k = 130 from psychology and k = 92 from other social sciences. The replication rate in this subset was similar to that in the full set of studies: 43% overall (project 1: 33%, project 2: 49%; psychology: 37%; other social sciences: 51%; see OSF for details). Figure 1 shows the z-curve analysis of these 222 original significant results.

The most striking result is that the expected replication rate (ERR) is substantially higher than the observed replication rate in the replication studies (68% versus 42%). Even the lower bound of the 95% confidence interval for the ERR, 59%, exceeds the observed replication rate. This discrepancy is especially noteworthy because the replication studies often used larger sample sizes than the original studies, which should have increased, not decreased, the probability of obtaining a significant result. Thus, the lower effect sizes observed in the replication studies cannot be attributed to regression to the mean alone. An additional factor appears to be that population effect sizes in the replication studies were systematically smaller than in the original studies.

Z-curve also limits the range of scenarios that are compatible with the data. The estimated EDR of 48% implies that no more than 6% of the significant results can be false positive results (Soric, 1989). Even the lower limit of the EDR confidence interval, 17%, limits the false positive rate to no more than 26%. With 50% replication failures, this suggests that no more than half of the replication failures are false positives. This finding shows the importance of distinguishing clearly between replication rates and false positive rates (Maxwell et al., 2015).

The false positive risk also varies as a function of the significance criterion. Marginally significant results are more likely to be false positives than results with high z-values (Benjamin et al., 2018). Z-curve makes it possible to address Benjamini and Hechtlinger’s (2014) call to control, rather than merely estimate, the science-wise false discovery rate. A stricter alpha criterion reduces the discovery rate, but it reduces the false discovery rate more. Benjamin et al. (2018) suggested reducing the false positive risk by lowering the significance criterion to alpha = .005. A z-curve analysis with this criterion estimated the FDR at 2% and the upper limit of the 95% CI was 6%. This finding provides empirical support for Benjamin et al.’s (2018) suggestion. It also addresses Lakens et al.’s (2018) concern that alpha levels should be justified. Here the strength of evidence provides the justification. In other literatures, alpha = .01 is sufficient to keep the FDR below 5% (Schimmack & Bartoš, 2023; Soto & Schimmack, 2024), but sometimes even alpha = .001 is insufficient to control false positives (Chen et al., 2025; Schimmack, 2025).

Heterogeneity in strength of evidence also makes it possible to predict replication outcomes as a function of z-values. Figure 1 shows power for z-value intervals below the x-axis. Expected replication rates increase from 54% for just significant results to over 90% for z-values greater than 5. Another 36 z-values have z-values greater than 6 that are practically guaranteed to replicate in exact replication studies. Figure 2 shows the expected replication rates and the observed replication rates for z-value ranges.

Studies with modest evidence (z = 2 to 3.5) replicate at significantly lower rates than expected based on z-curve. As expected, replication rates increase with stronger evidence. Given the small number of observations per bin, it is not possible to test whether z-curve predictions remain too optimistic at moderate z-values. The most surprising finding is that observed replication rates for studies with strong evidence (z > 6) fall below the expected rate.

In exploratory analyses, I examined possible reasons for these surprising replication failures. I used two large language models (ChatGPT and Claude) to score the replication reports of studies with strong original evidence (z > 6). Studies were coded on five dimensions (match of populations, materials, design, time period, and implementation) with scores from 0 to 2 each to produce total scores ranging from 0 to 10. Inter-rater agreement for the total scores was high, ICC(A,1) = .85, 95%CI = .73, .92. I averaged the two scores and used a total of 7 or higher as the criterion for a close match. Of the 24 close replications, 21 were successful (88%). Of the 12 studies that were not close replications, only 6 were successful (50%).

I further examined the three close replications that failed. While Farris et al. (2008) closely matched the original in many aspects, the original participants were from the US and the replication was conducted in the UK. Subsequent studies have replicated the finding with US samples (Farris et al., 2009/2010; Treat et al., 2017), ruling out a simple false positive explanation. The replication failure of Hurst and Kavanagh (2017) likely reflects a sampling problem in the original study. Participants from the general population and users of community mental health services were analyzed in a single analysis, which can inflate effect sizes (Preacher et al., 2005). McDevitt examined the influence of plumbing business names starting with numbers or A to be first in the yellow pages. A replication in 2020 cannot reproduce this effect because google searches replaced yellow pages.

While these exploratory results are based on a small sample, they support the broader claim that original results with strong evidence (z > 6) are likely to replicate in close replications and that failures may stem from meaningful differences in study design.

Conclusion

Z-curve analysis of two major replication projects reveals that replicability in the social sciences is not a single number. The expected replication rate based on the strength of original evidence (68%) substantially exceeds the observed replication rate (42%), indicating that effect size shrinkage beyond statistical regression to the mean contributes to replication failures. The false discovery rate is low (6%), confirming that most replication failures reflect reduced effect sizes rather than false positives. Adjusting the significance criterion to alpha = .005 reduces the estimated false discovery rate to 2%.

The most practically useful finding is that original results with strong evidence (z > 6) are highly replicable when the replication closely matches the original study design (88% success rate). Replication failures among these strong results were attributable to identifiable differences between the original and replication studies — different populations, changed market conditions, or heterogeneous samples. This suggests that the strength of statistical evidence, combined with methodological similarity, is a reliable predictor of replication success.

These findings argue against treating all significant results as equally credible and against interpreting average replication rates as informative about any particular study. Replicability is predictable from information already available in the original publication.

The P-Curve/Z-Curve Exchange: A Methodological Dispute in Real Time

Background

In the interest of open science, this blog post summarizes a private email exchange between Uri Simonsohn — principal developer of p-curve — and me — principal developer of z-curve. The correspondence itself is not reproduced here at Simonsohn’s request. I used AI throughout the communication, and this account of the exchange was written by Claude, who was asked to write it from a neutral third-party perspective. This does not rule out the possibility of bias, but Uri is welcome to use the comment section to share his own perspective — an option that is not available on his own blog, DataColada.

Key Points

  • I shared simulations showing that p-curve overestimates average power under realistic heterogeneity while z-curve does not. Simonsohn did not challenge these results.
  • Simonsohn’s own public position since 2018 is that p-curve is biased when some studies have power above 90%. Uncertainty about effect sizes guarantees that real data will include such studies, making bias the norm rather than the exception.
  • Simonsohn argued that average power is not a meaningful quantity under heterogeneity. If so, the p-curve app should stop displaying it. If average power is meaningful, z-curve estimates it better.
  • P-curve confidence intervals do not have 95% coverage. Z-curve.3.0 has 95% coverage even with homogeneous data.
  • Z-curve provides information that p-curve cannot: estimates of average power for all studies (EDR), quantification of publication bias, and bounds on the false discovery risk.
  • Simonsohn did not address any of these points. His public position remains unchanged from 2018.

I shared simulations showing that p-curve overestimates average power under realistic heterogeneity while z-curve does not. Simonsohn did not challenge these results. Simonsohn’s own public position since 2018 is that p-curve is biased when some studies have power above 90%. Uncertainty about effect sizes guarantees that real data will include such studies, making bias the norm rather than the exception. Simonsohn argued that average power is not a meaningful quantity under heterogeneity. If so, the p-curve app should stop displaying it. If average power is meaningful, z-curve estimates it better. P-curve does not provide confidence intervals for its power estimates. Z-curve does, with 95% coverage validated across a wide range of simulation conditions. Z-curve provides information that p-curve cannot: estimates of average power for all studies (EDR), quantification of publication bias, and bounds on the false discovery risk. Simonsohn did not address any of these points. His public position remains unchanged from 2018.

I shared simulations showing that p-curve overestimates average power under realistic heterogeneity while z-curve does not. Simonsohn did not challenge these results. Simonsohn’s own public position since 2018 is that p-curve is biased when some studies have power above 90%. Uncertainty about effect sizes guarantees that real data will include such studies, making bias the norm rather than the exception. Simonsohn argued that average power is not a meaningful quantity under heterogeneity. If so, the p-curve app should stop displaying it. If average power is meaningful, z-curve estimates it better. P-curve does not provide confidence intervals for its power estimates. Z-curve does, with 95% coverage validated across a wide range of simulation conditions. Z-curve provides information that p-curve cannot: estimates of average power for all studies (EDR), quantification of publication bias, and bounds on the false discovery risk. Simonsohn did not address any of these points. His public position remains unchanged from 2018.

Selection Models: P-Curve and Z-Curve

P-curve and z-curve are both methods that use the distribution of significant p-values to estimate the average statistical power of a set of studies. They share the same goal but differ in a critical respect: p-curve fits a single power parameter to the data, assuming all studies have the same power, while z-curve fits a mixture model that allows power to vary across studies. When power is truly homogeneous, p-curve’s simpler model is more efficient. When power is heterogeneous — as it typically is in meta-analyses of conceptual replications — p-curve produces inflated estimates with confidence intervals that are too narrow (Brunner & Schimmack, 2020). The question at the center of this exchange was whether, and under what conditions, this difference matters in practice.

The Opening: Procedural Deflection

The exchange began when Schimmack presented evidence that p-curve overestimated average power in the Reproducibility Project data. Simonsohn’s initial response did not address the overestimation. Instead, he objected that the data had not been presented in the format of a p-curve disclosure table — a procedural requirement he had developed for auditing p-curve analyses. Schimmack pointed out that the Reproducibility Project had a uniquely transparent selection process, with key findings identified collaboratively and often with input from original authors, making the disclosure table requirement a matter of form rather than substance. Simonsohn did not contest this point but instead pivoted to personal history, characterizing the dispute as a grudge, and closed the conversation with “I will switch gears and return to my current interests.”

The Simulations: A Deck Stacked for P-Curve

Several weeks later, Simonsohn re-engaged by sharing simulation code originally developed for a 2018 blog post (DataColada 67). He reported that z-curve performed worse than p-curve in most scenarios, with the exception of one scenario Schimmack had provided. His conclusion was that “z-curve is generally slightly worse, except when there are extreme power values that bias p-curve but not z-curve.”

Examination of the simulation parameters revealed two problems. First, the effect size distributions used standard deviations of 0.05 to 0.15 in Cohen’s d units, producing near-homogeneous power across studies. Typical meta-analyses in psychology show heterogeneity of 0.3 to 0.4 or higher (van Erp et al., 2017). Under near-homogeneity, p-curve’s assumption is met by design, making the comparison uninformative about realistic conditions. Second, the simulations used only 20 to 25 studies — too few for z-curve’s mixture model to leverage its structural advantage over p-curve’s simpler model.

Rather than confronting these limitations directly, Schimmack conceded that p-curve could outperform z-curve under some conditions and asked Simonsohn to identify the key moderator determining when each method performed better. Simonsohn did not answer this question directly, responding “I have no time right now.”

Discovering the Estimand Distinction

When the exchange resumed, Simonsohn’s responses revealed that he was encountering the distinction between the Expected Replication Rate (ERR) and the Expected Discovery Rate (EDR) for the first time. He wrote: “ah, it seems you do have a different estimand.” This distinction had been published in Brunner and Schimmack (2020) six years earlier and was printed as standard output by the z-curve R package that Simonsohn had been using in his simulations.

Simonsohn further questioned whether p-curve’s estimand was even well-defined under heterogeneity. Schimmack pointed out that this was precisely the problem: p-curve had a clearly defined estimator (fit a single power parameter) but an ill-defined estimand, while z-curve had clearly defined estimands (ERR and EDR) estimated by a more complex model. Under homogeneity the distinction is invisible because ERR equals EDR. Under heterogeneity it is central.

Schimmack also raised concerns about whether Simonsohn’s simulation architecture — which used an inverse CDF method to generate only significant results rather than simulating natural selection for significance — could adequately distinguish between the quantities the two methods were designed to estimate. The full implications of this concern were clarified only later in the exchange, but the immediate practical question remained: when evaluated against the correct benchmark using realistic parameters, which method performed better?

The Decisive Simulation

Schimmack provided modified code using Simonsohn’s own simulation framework with more realistic parameters: 50 studies, mean effect size d = 0.3, standard deviation of d = 0.25, and mean sample size of 40. These values fall well within the range observed in actual psychology meta-analyses.

The results were clear. True average power was 43%. P-curve estimated 50%, overestimating by 7 percentage points. Z-curve estimated 41%, underestimating by only 2 percentage points. The difference in accuracy was statistically significant. Z-curve’s 95% confidence intervals achieved 96% coverage. Uri’s code did not include confidence intervals to examine coverage of p-curve’s confidence intervals, whereas my own simulations showed better coverage for z-curve than for p-curve.

The Retreat to Philosophy

Faced with these results, Simonsohn shifted from methodological engagement to philosophical objection. He argued that p-curve’s bias under heterogeneity had been known since 2018, that he had acknowledged it in print, and that the bias was “not super consequential” because it occurred only with “extreme power values.” He maintained that averaging power across heterogeneous studies was inherently meaningless, that “most meta-analyses are a waste of everyone’s time,” and that the choice between p-curve and z-curve was “second order” compared to problems of study selection.

Schimmack asked Simonsohn to clarify what he meant by studies with power below 90% — whether he referred to true power (a simulation parameter under the researcher’s control) or observed power (a noisy post-hoc estimate). Simonsohn dismissed this as unimportant: “That’s one of the least important things I wrote.”

The Logical Corner

Schimmack identified a logical inconsistency in Simonsohn’s position. If average power was not a meaningful quantity under heterogeneity, then the natural conclusion would be to remove the power estimate from the p-curve app, which continued to display it to users. Most researchers relied on p-curve’s test of evidential value rather than its power estimate. Removing the estimate would be consistent with Simonsohn’s stated views, would eliminate the known bias, and would not change how most researchers used the tool. Researchers who wanted power estimates could use z-curve, which was designed for that purpose.

Simonsohn did not respond to this suggestion.

Final Conclusion

After the exchange documented above, Simonsohn provided a final response reiterating his original positions: that z-curve performs worse in most scenarios, that p-curve’s bias is caused by “extreme values” rather than heterogeneity, and that average power should not be computed at all when studies are heterogeneous. He did not address the simulation results showing p-curve’s significant overestimation under realistic heterogeneity, nor the absence of confidence intervals in p-curve’s output, nor the suggestion to remove the power estimate from the p-curve app. He requested that only his public writings be cited. His public position remains unchanged from 2018.

The exchange revealed a pattern characteristic of methodological disputes in which a method’s developer has strong ownership over it. Each time the argument narrowed to a point where p-curve’s limitations were empirically exposed, the grounds of discussion shifted — from procedural objections, to personal framing, to redefinition of the relevant quantity, to philosophical dismissal of the enterprise itself. The substantive question — which method gives researchers better estimates under realistic conditions — was answered by the simulations but never acknowledged.

Postscript

I was invited to write a tutorial about the differences between p-curve and z-curve in the Journal of Communication Methods and Measures (2021-2026). My graduate student and I wrote a draft (Schimmack & Soto, 2026). The manuscript shows the simulation results for different levels of heterogeneity (Table 1). Uri Simonsohn was invited to write a commentary and declined to do so.

Table 1

Mean Estimated Replication Rate (ERR), Root Mean Square Error (RMSE), and 95% CI Coverage by Heterogeneity (Tau) and Method

TauCriterionTrueDensity 2.0EM2.0EM3.0EM3- NormP-curve
0.05Mean Est.434438404044
 RMSE 1211101012
 Coverage 9382949292
0.15Mean Est.504943464550
 RMSE 111191111
 Coverage 9793969592
0.25Mean Est.595755585763
 RMSE 101091012
 Coverage 9891959679
0.35Mean Est.656463666775
 RMSE 10109913
 Coverage 9794989467
0.45Mean Est.716867717282
 RMSE 1098613
 Coverage 9896989858
0.55Mean Est.737171757688
 RMSE 778515
 Coverage 99951009935

Note. Bold values indicate the best-performing method for each condition and criterion. True = population ERR; Density 2.0 = density-based estimator; EM2.0/EM3.0 = expectation-maximization z-curve variants; EM3-Norm = EM3.0 with normal mixture; P-curve = p-curve power estimator. Coverage = proportion of simulations in which the 95% confidence interval contains the true value; values close to .95 indicate proper calibration, values below .95
indicate that confidence intervals are too narrow.

Erratum: More concerns about the z-curve method

Scientific progress has been slow because humans are not disinterested processors of information. Once they have concluded that some belief is true, their information processing is biased towards verifying that truth rather than looking for disconfirming evidence.

Willful ignorance is the selective processing of confirmatory information and the avoidance of sources that may expose the believer to contradictory information. However, sometimes challenging information is unavoidable. Scientists who want to publish their work are constantly exposed to negative comments. When confronted with criticism, there are a number of strategies that serve different purposes. A constructive response examines the validity of the criticism, responds to valid concerns, adjusts claims accordingly, and may still make a useful contribution. A defensive response to valid criticism engages in pseudo-scientific arguments that avoid the key concern and leads to an unproductive exchange that cannot have a resolution because the goal is to maintain a false belief.

While critics initiate a discussion about potential errors, the roles are not fixed. Once the criticism is made, the person criticized responds to it and may find errors in the critic’s arguments. Now the roles are reversed and the critic may respond to this criticism in defensive ways, accusing the person being criticized of being defensive. This exchange quickly deteriorates into a childish exchange of shouting “I am right. You are wrong” at each other. A more mature response is to allow for errors being made on both sides and carefully examine the arguments. This is the aim of my response to Erik van Zwet’s second blog post about z-curve, “More concerns about z-curve.

The Substance

In this second post, Erik reports one new simulation scenario. In that scenario, he points to two problems. The main criticism is that the confidence interval for the Expected Discovery Rate (EDR) does not achieve its nominal 95% coverage. The second concern is that the confidence interval for the null-component weight can collapse to zero width, which he interprets as a sign of instability or misspecification in the internal mixture fit.

The second point is the less important one. Z-curve is a finite-mixture model that approximates the distribution of test statistics using weights on several discrete components. It is well understood that these component weights are not themselves substantively meaningful parameters when the true data-generating process is continuous. Different mixtures can yield nearly identical estimates of the quantities z-curve is designed to recover. For that reason, poor coverage of confidence intervals for individual component weights is not, by itself, a serious problem. In particular, the weight of the zero component is not used in z-curve the way a null-component weight is used in models that directly estimate false positive rates. These intervals appear in the output, but they are not the primary inferential target.

What matters is coverage for the main estimands: the Expected Replication Rate (ERR) and the Expected Discovery Rate (EDR). Erik does not mention that the ERR interval appears to perform adequately in this scenario. Thus, the central substantive criticism is narrower: in this particular simulation setting, the EDR confidence interval appears to undercover.

The Response

The specific scenario assumed that all studies had the same power, which implies not only the same sample size, but also the same population effect size. Brunner and Schimmack (2020) already noted that z-curve can have problems in this situation when the true noncentrality parameter falls between two default components. That is exactly Erik’s scenario: mean power is 32%, corresponding to z = 1.5, midway between the default components at z = 1 and z = 2.

Brunner and Schimmack (2020) did not emphasize this problem because most real datasets show substantial heterogeneity in sample sizes and effect sizes (van Erp et al., 2017). Even direct replications of the same paradigm across labs vary in effect size (Klein et al., 2017). Thus, Erik’s critique is based on a known difficult case for z-curve.2.0, but not one that resembles most real applications.Use this instead:

To address this valid concern, z-curve 3.0 was revised to first test for very low heterogeneity. When the data appear unusually homogeneous, the model estimates where a single component would best fit the distribution and then shifts the default grid so that one component is centered near that value. In Erik’s scenario, this places a component near z = 1.5 instead of forcing the fit to choose between z = 1 and z = 2.

The new results are therefore limited to Erik’s specific concern: whether z-curve.2.0 provides adequate coverage for homogeneous data when the true noncentrality parameter falls between two default components.

I validated z-curve 3.0 with the standard simulation code that was used to validate z-curve.2.0 in the Uli simulation design. These simulations across 192 scenarios were validated with just 50 significant results to produce coverage over 95% in most scenarios. To simulate a non-centrality parameter of z = 1.5, I used a standardized mean difference of d = .30 and a sample size of N = 100 (.3 / (2/sqrt(100) = 1.5) . Figure 1 shows the results for 50,000 significant results. Z-curve is able to predict the distribution of the non-significant results based on the model fitted to the significant results well and the estimates of EDR and ERR are accurate and the confidence intervals are tight.

Coverage for the ERR and EDR estimates was tested with k = 50, 500, 5,000, and 50,000. All simulations showed coverage over 95% (Results). In short, z-curve.3.0 now also performs well with homogenous data and can do so quickly with the density method.

In sum, Erik noted that the default method of z-curve.2.0 fails to produce adequate confidence intervals for the EDR estimate in one simulation with homogenous data and a non-centrality parameter between two default components. I responded to this valid criticism by improving z-curve. Z-curve.3.0 now handles homogeneity and heterogeneity in power well and provides credible confidence intervals.

In the comment section Erik writes. “Indeed, as I wrote: “Note that I’m violating the assumption of the z-curve method, but in a way that would be difficult to detect from limited data. That’s the point: You can fix this by changing the default “mu grid”, but you wouldn’t know that.”

As I showed here, this statement is an error. It is very easy to diagnose the problem by estimating the heterogeneity of the data and then adjust the grid according to a preliminary model that is more consistent with the data. The ability of z-curve.3.0 to work in this scenario shows that the problem is fixable. Thus, Erik’s criticism is invalidated by the evidence. Any new evaluations of the z-curve method need to examine the performance of z-curve.3.0.

Gelman’s Type-S Error: A Misunderstanding of Hypothesis Testing

Andrew Gelman is well known for strong opinions about psychological science, including its methods and research culture (Fiske, 2017. For the most part, he writes as if psychologists are still following a statistical ritual that cannot produce meaningful results. This criticism is not new. It was already made by influential psychologists and methodologists, including Cohen (1990, 1994) and Gigerenzer (2004). The problem with Gelman’s critique is that it is outdated and largely ignores the discussion of null-hypothesis significance testing that took place in psychology during the 1990s. As evidence for this claim, one can simply inspect the reference list of Gelman and Carlin (2014). An article published in Perspectives on Psychological Science does not cite Cohen (1990, 1994), Gigerenzer (2004), or Tukey’s directional reformulation of significance testing (Tukey, 1991; Jones & Tukey, 2000). Although an outsider perspective can be useful for challenging untested assumptions, a commentary that ignores key insights produced by eminent statisticians and methodologists within psychology is unlikely to do so.

The Null-Hypothesis Significance Testing Strawman

As Gigerenzer (2004) pointed out, statistics is often taught as a ritual to be followed rather than as a principled approach to drawing conclusions from data. Rituals are not necessarily bad, but in science it is usually better to understand the rationale and assumptions underlying routine practices.

Null-hypothesis significance testing (NHST) has been described and criticized for decades (Tukey, 1991; Cohen, 1994). Most students of psychology will recognize the following brief description of it. First, researchers collect data that relate one variable to another. Ideally, this is an experiment in which one variable is experimentally manipulated (the independent variable) and the other is observed (the dependent variable). In experiments, a relationship between the independent and dependent variable may justify causal claims, but NHST itself is indifferent to causality. It can be applied to both experimental and correlational data. The main information produced by statistical analyses is the p-value. P-values below a conventional threshold are called statistically significant; those above the threshold are treated as not significant (ns). Significant results are easier to publish. As a result, data analysis often becomes a series of statistical tests searching for statistically significant results (Bem, 2010).

This approach to data analysis has been criticized for several reasons. First, statistical significance by itself does not provide information about effect size. For this reason, psychologists have increasingly reported effect-size estimates in addition to tests of statistical significance, in large part due to Cohen’s (1990) emphasis on effect sizes. Second, NHST has been criticized for its focus on statistically significant findings. Psychology journals have long reported rates of over 90% statistically significant results (Sterling, 1959; Sterling et al., 1995). Publication bias in favor of significant results then leads to inflated effect-size estimates (Rosenthal, 1979).

Most importantly, NHST has been criticized because it appears to reject a null hypothesis that is known to be false before any data are collected. Cohen (1994) called this the nil hypothesis. The nil hypothesis assumes that the population effect size is exactly zero. Statistical significance is then taken to imply that this hypothesis is unlikely to be true and can be rejected. The problem is that rejecting one specific possible effect size tells us very little about the data. It would be equally uninformative to test the hypothesis that the effect size equals any other single value, such as Cohen’s d = .20. So what if the effect size can be said not to be 0 or .20? It could still be 0.01 or 1.99. In short, hypothesis testing with a single point as the null hypothesis is meaningless. Yet that is exactly what psychological articles seem to be reporting when they state p < .05.

What Psychological Scientists Are Implicitly Doing

In reality, however, psychological scientists are doing something different. It may look as if they are testing the nil hypothesis, but in practice they are often testing two directional hypotheses at the same time (Kaiser 1960; Lakens et al., 2025; Tukey, 1991; Jones & Tukey, 2000). When the nil hypothesis is rejected, researchers do not merely conclude that there is a difference. They also inspect the sign of the effect size estimate and infer that the experimental manipulation increased or decreased behavior.

Some authors have argued that drawing directional conclusions from a two-sided test is conceptually problematic (e.g., Rubin, 2020). However, Jones and Tukey explain the rationale for doing so. The easiest way to see this is to reinterpret the standard nil-hypothesis test as two directional tests with two complementary null hypotheses. One null hypothesis states that the effect size is zero or negative. The other states that the effect size is zero or positive. Rejecting the first leads to the inference that the effect is probably positive. Rejecting the second leads to the inference that the effect is probably negative. Viewed this way, zero is simply the boundary between two rejection regions.

Because NHST can be understood in this way as involving two directional possibilities, alpha must be allocated across both tails to maintain the long-run error rate. No psychology student would be surprised to see a t distribution with 2.5% of the area in each tail. Each tail represents the error rate for one directional rejection, and together they produce the familiar two-sided alpha level of 5%.

Most psychology students are not taught that they are implicitly conducting directional tests when they interpret significant p values, but their actual practice shows that this is what they are doing. They routinely draw directional inferences from NHST, and this is a legitimate use of the procedure. It also makes NHST more meaningful than the strawman version in which researchers merely reject an exact value of zero that is often known in advance to be false.

Using NHST to infer the direction of population effects is meaningful because researchers often do not know that direction before data are collected. Empirical data can therefore provide genuinely new information. This is not a full defense of NHST, because effect size and practical importance can still be ignored, but it does show that psychologists have not spent decades and millions of dollars merely to establish that effect sizes are not exactly zero.

Gelman’s Type-S Error

Gelman and Tuerlinckx (2000) criticized NHST because “the significance of comparisons … is calibrated using the Type 1 error rate, relying on the assumption that the true difference is zero, which makes no sense in many applications.” To replace this framework, they proposed focusing on Type S error, where S stands for sign. A Type S error occurs when a researcher makes a confident directional claim even though the true effect has the opposite sign.

The label Type S error is potentially confusing because it suggests a replacement for the Type I error framework rather than a refinement of it. A Type I error is the unconditional long-run probability of falsely rejecting a null hypothesis across all tests that are conducted. For example, suppose a researcher conducts 100 tests with a significance criterion (alpha) of 5%. This criterion ensures that in the long run no more than 5% of all tests will be false positives. Testing at least some real effects will reduce the probability of a false positive. For example, if all studies have high power to detect a true effect, the probability of a false positive is zero (Soric, 1989). Thus, alpha sets a range of the relative frequency of false positives between 0 and alpha.

This unconditional probability must be distinguished from the conditional probability of error among the subset of studies that produced statistically significant results. In the previous example, if only 5 results were significant, it is likely that all 5 rejections were errors and that the conditional probability of a false positive given a significant result is 5 / 5 = 100% (Sorić, 1989). The proportion of false rejections among statistically significant results is called the false discovery rate (FDR), and the estimation and control of FDRs has become a large literature in statistics (Benjamini & Hochberg, 1995).

Applying Jones and Tukey’s interpretation of NHST to false discovery rates, a false discovery occurs not only when the true effect size is zero but also when it is in the opposite direction of the significant result. Gelman’s Type S error rate, also called the false sign rate (Stephens, 2017), assumes that effect sizes are never zero and counts only false rejections with the opposite sign. False sign rates are necessarily smaller than false discovery rates because wrong-sign rejections are only a subset of all false rejections. Exact-zero effects can produce significant results in either direction, whereas nonzero effects make correct-sign rejections more likely and wrong-sign rejections less likely.

The key source of confusion is that Gelman’s criticism of NHST and FDR estimation rests on a misunderstanding of NHST (Gelman, 2021). He maintains that FDR estimates are limited to the unlikely scenario that an effect is exactly zero and ignores sign errors. However, as Jones and Tukey (2000) pointed out, psychological researchers routinely use NHST as a directional sign test. Once NHST is understood in this way, Type S errors are no longer a fundamentally new kind of inferential problem and are already included in conditional and unconditional error rates. Moreover, NHST provides researchers with concrete statistical tools to estimate and control error rates, whereas Gelman’s Type S error is not something that can be estimated and was introduced as a rhetorical tool without practical use (Gelman, 2025; Lakens et al., 2025). In contrast, estimation of false discovery rates and false sign rates is an active area of research in statistics that builds on the foundations of NHST (Benjamini & Hochberg, 1995; Stephens, 2017) and has been largely ignored in psychology.

Statistical Power

So far, the distinction between Type I and (unconditional) Type S errors is mostly harmless. It may even help clarify that NHST is really used as a test of the sign of the population effect size rather than as a literal test of the nil hypothesis (Jones & Tukey, 2000). However, the wheels come off when Gelman and Carlin (2014) extend this critique from Type I error to Type II error and statistical power.

The distinction between Type I and Type II errors was introduced by Neyman and Pearson. A Type II error is the probability of failing to reject a false null hypothesis. Neyman and Pearson were cautious and avoided framing results as inferences about a true effect or as acceptance of a true hypothesis. In practice, however, failure to reject a false hypothesis means that either the population effect is positive and the study failed to produce a statistically significant result with a positive sign, or the population effect is negative and the study failed to produce a statistically significant result with a negative sign.

Statistical power is simply the complementary probability of obtaining a statistically significant result with the correct sign. Unlike the discussion of Type I errors, there is no important distinction here between a point null and an opposite-sign error. Power calculations are inherently directional. Researchers assume either a positive or a negative effect and then choose a design and sample size that reduce sampling error while controlling the Type I error rate. For example, a comparison of two groups with n = 50 per group, a population effect size of half a standard deviation (Cohen’s d = .50), and alpha = .05 has about a 70% probability of producing a statistically significant result with the correct sign.

By definition, then, power already concerns rejections with the correct sign. At this point, there is no meaningful difference between standard NHST and Gelman’s Type S framework (Stephens, 2017). The only minor difference arises in hypothetical scenarios with extremely low power. For two-sided (non-directional) power calculations, low power can produce significant results with sign errors. To use NHST as a sign-test in Jones and Tukey framework of two simultaneous one-sided tests, power should be estimated for one-sided directional tests with alpha/2. However, in practice, this distinction is irrelevant because Gelman and Carlin already showed that even modest power of 50% renders sign errors practically impossible.

Thus, the main concern about Gelman and Carlin’s (2014) article is the false implication that power calculations ignore sign errors and that researchers must move “beyond power” to control them. Grounding NHST in Jones and Tukey’s (2000) framework of two simultaneous directional tests shows that power calculations are not flawed. High power prevents both false negatives and sign errors. Gelman’s critique rests on a false premise: the assumption that NHST is nil-hypothesis testing. Under that assumption, power appears disconnected from sign errors. But once NHST is understood as directional inference, the criticism is invalid. Power analysis is not only useful but essential for controlling sign errors and the false sign rate.

Implications

Here’s a shortened version:

Implications

Gelman positions the Type S error as a new concept that requires moving “beyond power” because “power analysis is flawed” (p. 641). On closer inspection, power analysis is necessary and sufficient to control Type S error rates. Studies with high power ensure that most significant results have the correct sign, and high power also ensures a high discovery rate, which limits the proportion of false discoveries (Sorić, 1989). Power delivers everything needed to make significant results credible. It is paradoxical to criticize psychology for relying on small samples while also criticizing the tool that tells researchers how to avoid them. Cohen’s lasting contribution was precisely this: demonstrating that many studies lack power to detect plausible but small effect sizes and providing the tools to do better (Cohen, 1962).

Gelman and Carlin’s (2014) framing of power as flawed may have added to misunderstandings about the role of power in ensuring credible results. NHST and power analysis are not flawed. They are statistical tools for drawing conclusions about the direction of population effect sizes (Maxwell, Kelley, & Rausch, 2008). It would be desirable to conduct all studies with enough precision to provide informative effect size estimates, but limited resources often make this impossible. Meta-analysis of smaller studies can yield precise estimates, provided results are reported without selection bias. Reporting outcomes regardless of statistical significance is the most effective way to address selection bias, which remains the biggest threat to the credibility of NHST in practice (Sterling, 1959).

The real problem of NHST is not solved by a focus on Type S errors. The real problem is that non-significant results are inconclusive because failure to provide evidence for a positive or negative effect does not allow inferring the absence of an effect (Altman & Bland, 1995). The solution is to distinguish three hypotheses (Rice & Krakauer, 2023): (a) the effect is positive and larger than a smallest effect size of interest, (b) the effect is negative and larger in magnitude than a smallest effect size of interest, and (c) the effect falls within a region of practical equivalence around zero. Evidence for absence is established if the confidence interval falls entirely within the middle region. Replacing the point nil hypothesis with a range of practically equivalent values is an important addition to statistics for psychologists (Lakens, 2017; Lakens, Scheel, & Isager, 2018). It helps distinguish between statistical and practical significance, and it can turn non-significant results into significant evidence for the absence of a meaningful effect. However, providing evidence for absence often requires large samples because precise confidence intervals are needed to fit within a narrow region around zero. Power analysis remains essential for planning studies with this goal.

Conclusion

Continued controversy about NHST shows that better education about its underlying logic is needed. Jones and Tukey (2000) provided a clear explanation that deserves to be foundational for the teaching of NHST. Understanding NHST as two simultaneous directional tests avoids the confusion created by decades of criticism directed at a strawman version of the procedure. NHST has persisted for nearly a century despite harsh criticism because it provides a minimal but useful inference: determining the likely sign of a population effect size. Students need to learn about the real limitations of NHST and how they can be addressed. Changing statistical methods does not solve the problem that researchers need to publish and that precise effect size estimates are often out of reach. Even power to infer the sign of an effect is often low. Honest reporting of a single well-powered study is more important than reporting multiple underpowered studies that are p-hacked or selected for significance (Schimmack, 2012). With good data, different statistical approaches lead to the same conclusion. Open science reforms that improve the quality of data are more important than new statistical methods. The main reason NHST continues to attract criticism is that criticism is easy, but finding a better solution is harder. Real progress requires a real analysis of the problem NHST has many problems, but ignoring sign errors is not one of them.

References

Fiske, S. T. (2017). Going in many right directions, all at once. Perspectives on Psychological Science, 12, 652–655. https://doi.org/10.1177/1745691617706506

Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45(12), 1304–1312.

Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49(12), 997–1003.

Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics, 33(5), 587–606.

Tukey, J. W. (1991). The philosophy of multiple comparisons. Statistical Science, 6(1), 100–116.

Jones, L. V., & Tukey, J. W. (2000). A sensible formulation of the significance test. Psychological Methods, 5(4), 411–414.