Tag Archives: Pek

An Average Power Primer

An Average Power Primer: Clarifying Misconceptions about Average Power and Replicability

Cohen (1988) introduced power analysis for the planning of studies to reduce false negative (type-II error) rates in psychological science. After the replication crisis, the importance of a priori power analyses has gained increasing attention. However, the estimation of actual power of studies remains neglected. This article clarifies important differences between power analyses with hypothetical effect sizes to plan studies and power analyses of actual studies that have been completed. Knowing the actual power of completed studies is important because it can be used to assess publication bias. Sets of studies that have high success rates, but low power do not provide credible evidence for a hypothesis.

Concerns About Z-Curve: Evidence From New Simulations With Few Studies

Scientific progress depends on criticism, especially when it is used to identify limitations of statistical methods and to improve them. z-curve is no exception. Over the past year, several critiques have raised questions about the robustness of z-curve estimates, particularly with respect to the expected discovery rate (EDR). These critiques deserve careful examination, but they also require accurate characterization of what z-curve assumes, what it estimates, and under which conditions its estimates are informative.

Two recent lines of criticism are worth distinguishing. First, Pek et al. (2025) show that z-curve estimates can be biased when the publication process deviates from the assumed selection model. The default z-curve model assumes that selection operates primarily on statistical significance at the conventional α = .05 threshold (z = 1.96). Pek et al. demonstrate that if researchers also suppress statistically significant results with small effect sizes—for example, not publishing a result with p = .04 because the standardized mean difference is only d = .40—then z-curve estimates can become optimistic. This result is correct: z-curve cannot diagnose selective reporting based on effect size rather than statistical significance.

There is limited direct evidence that routine selection on effect-size magnitude (beyond statistical significance) is widespread; the QRPs most commonly reported in self-surveys are largely significance-focused (John et al., 2012). In any case, imperfect correction is not a reason to ignore selection bias entirely, because uncorrected meta-analyses can markedly overestimate population effects and replicability (Carter et al., 2019).

Moreover, the selection mechanism examined by Pek et al. has a clear directional implication: when statistically significant results are additionally filtered by effect size, z-curve’s estimates of EDR and ERR can be biased upward. This matters for interpretation. If z-curve already yields low EDR or ERR estimates, then the type of misspecification studied by Pek et al. would, if present in practice, imply that the underlying parameters could be even lower. For example, an estimated EDR of 20% under the default selection model could correspond to a substantially lower true discovery rate if significant-but-small effects are systematically suppressed. Whether such effect-size–based suppression is common enough to materially affect typical applications remains an empirical question.

A second critique has been advanced by Erik van Zwet, a biostatistician who has applied models of z-value distributions developed in genomics to meta-analyses of medical trials. These models were designed for settings in which the full set of test statistics is observed and therefore do not incorporate selection bias. When applied to literatures where selection bias is present, such models can yield biased estimates. In contrast, z-curve is explicitly designed to assess the presence of selection bias and to correct for it, when it is present. When no bias is present, z-curve can also be fitted to the full z-curve, including non-significant results.

van Zwet has published a few blog posts arguing that z-curve performs poorly when estimating the expected discovery rate (EDR). Importantly, his simulations do not show problems for the expected replication rate (ERR). Thus, z-curve’s ability to estimate the average true power of published significant results is not in question. The disputed issue concerns inference about the broader population of studies, including unpublished nonsignificant results.

Some aspects of this critique require clarification. van Zwet has suggested that z-curve was evaluated only in a small number of simulations. This is incorrect. Prior work includes two large simulation studies—one conducted by František Bartoš and one conducted by me—that examined EDR confidence-interval coverage across a wide range of conditions. Based on these results, the width of the nominal 95% confidence intervals was conservatively expanded by ±5 percentage points to achieve near-nominal coverage across a wide range of realistic scenarios (see details below). Thus, EDR interval estimation was already empirically validated across many conditions with 100 or more significant results.

However, these simulations did not examine performance of z-curve with small sets of significant results. Because z-curve can technically be fit with as few as 10 significant results, it is reasonable to ask whether EDR confidence-interval coverage remains adequate when the number of significant studies is substantially smaller than 100. To address this question directly, I conducted a new simulation study focusing on the case of 50 significant results.

In addition, I introduced two diagnostics designed to assess when EDR estimation is likely to be weakly identified. Estimation of the EDR relies disproportionately on significant results from low-powered studies or false positives, because these observations provide information about the number of missing nonsignificant results. When nearly all significant results come from highly powered studies, the observed z-value distribution contains little information about what is missing. The first diagnostic therefore counts how many significant z-values fall in the interval from 1.96 to 2.96. Very small counts in this range signal that EDR estimates are driven by limited information. The second diagnostic examines the slope of the z-value density in this interval. A decreasing slope indicates information consistent with a mixture that includes low-powered studies, whereas an increasing slope reflects dominance of high-powered studies and weak identification of the EDR.

Reproducible Results of Simulation Study with 50 Significant Results

The simulation used a fully crossed factorial design with four values for each of four parameters, yielding 192 conditions. Population-level standardized mean differences were set to 0, .2, .4, or .6. Heterogeneity was modeled using normally distributed effect sizes with standard deviations (τ) of 0, .2, .4, or .6. In addition, a separate population of true null studies was included, with the proportion of false discoveries among significant results set to 0, .2, .4, or .6. Sample sizes varied across conditions, starting at n = 50 (25 observations per group). For each condition, simulations were run with exactly 50 significant results.

The simulation code is available here. The results are available here.

Across all scenarios coverage is 96%. The percentage is higher than the nominal 95% because the conservative adjustment leads to higher coverage in less changing scenarios.

The slope diagnostic works as expected. When the slope is decreasing, coverage is 97%, but when the slope is increasing it drops to 83%. Increasing slopes are more likely to lead to an overestimation than underestimation of the EDR (75%). Increasing slopes occurred in only 5% of all simulations because these scenarios assume that the majority of studies have over 50% power, which requires large samples and moderate to large effect sizes.

The number of z-values in the range between 1.96 and 2.96 also matters. At least 12 values in this range are needed to have 95% coverage. However, the slope criterion is more diagnostic than the number of z-values in this range.

A logistic regression with CI coverage (yes = 1, no = 0) as outcome and slope direction, d, SD, se 2/sqrt(N), and FDR proportion as predictors showed a strong effect of slope direction, FDR, and a slope direction x FDR interaction. Based on these results, I limited the analysis to scenarios with decreasing or flat slopes.

The effect of FDR remained significant (b = 3.55, SE = 1.47), as did the main effect of effect size (b = −2.33, SE = 1.01) and the effect size × SD interaction (b = 6.93, SE = 2.99), indicating systematic variation in coverage across conditions.

These effects are explained by how the design parameters shape the distribution of observed z-values in the critical range used to estimate the EDR (1.96–2.96). Higher FDR values imply a larger proportion of true null effects, which produces a steeper declining slope in the truncated z-distribution and increases information about the mass of missing non-significant results. In contrast, larger effect sizes generate a greater share of high-powered studies with z-values well above the truncation point, which reduces the relative influence of marginally significant results and makes the EDR less identifiable from the observed distribution.

The significant effect size × SD interaction reflects the moderating role of heterogeneity. When heterogeneity is present, even large average effect sizes produce a mixture of moderate- and high-power studies, increasing the density of z-values near the significance threshold and partially restoring information about missing results. As a consequence, the adverse effect of large average effect sizes on coverage is attenuated when heterogeneity is non-zero.

Overall, the most challenging scenarios for EDR estimation are characterized by low heterogeneity and shallow slopes in the just-significant range. In these settings, the observed z-distribution contains limited information about the unobserved, non-significant portion of the distribution, so EDR is weakly identified from the selected data alone.

Inspection of the 192 design cells indicates that the largest coverage shortfalls are concentrated in homogeneous conditions, especially when SD = 0 and FDR = 0. This limitation of the default discrete mixture approximation under near-homogeneity has been documented previously (Brunner & Schimmack, 2020). In practice, it can be addressed by fitting a homogeneity-appropriate specification, such as a single-component model with a free mean and normally distributed heterogeneity (with SD allowed to approach 0), as implemented in z-curve 3.0.

Restricting attention to scenarios with heterogeneous data (SD > .2), 89% of conditions achieved at least 95% coverage, and only 2 conditions (1.4%) fell below 90% coverage. Thus, even with adjusted confidence intervals, nominal coverage is not guaranteed in all edge cases. The remaining coverage problems arise for two reasons: (a) the selected z-distribution can be nearly uninformative about the amount of missing, non-significant evidence when the just-significant slope is shallow, and (b) the default heterogeneous specification can be misspecified when applied to approximately homogeneous data. A conservative approach to the lower coverage in some cases is to report the results as 85% confidence interval even if coverage on average is over 95%.

Conclusion

Pek et al. and van Zwet have raised broad concerns about z-curve’s estimates of the expected discovery rate (EDR), which is used to assess publication bias and to quantify the extent of missing nonsignificant results. Their arguments rely heavily on a small set of stylized scenarios. These scenarios do not show that z-curve generally produces untrustworthy results. In contrast, prior large-scale simulation studies, together with the present extension to datasets with only 50 significant results, indicate that z-curve’s EDR confidence intervals achieve near-nominal coverage across a wide range of conditions, including many that are plausible for applied research.

Importantly, the new simulations also validate simple diagnostics that indicate when EDR estimation is likely to be less reliable. In particular, the shape of the significant z-value distribution in the critical range from 1.96 to 2.96—especially whether the density decreases or increases just above the significance threshold—helps identify weak-information regimes in which EDR may be overestimated and confidence-interval coverage may be reduced. Users can therefore compare the near-threshold shape of their observed z-value distribution to the patterns observed in simulation to assess whether EDR should be interpreted cautiously in a given application.

Overall, these results support the conclusion that z-curve provides credible estimates of the expected replication rate (ERR) and, under diagnostically identifiable conditions, the expected discovery rate (EDR), and that these quantities remain useful for evaluating the credibility of literatures in which selection on statistical significance is present.

How to Respond to Criticisms of Z-Curve in Peer-Review

The pattern is now familiar. I received another anonymous review by Reviewer 2 from a z-curve article that repeated Pek’s concerns about the performance of z-curve. To deal with biased reviewers, journals allow authors to mention potentially biased reviewers. I suggest doing so for Pek. I also suggest sharing a manuscript with me to ensure proper interpretation of results and to make it “reviewer-safe.”

To justify the claim that Pek is biased, researchers can use this rebuttal of Pek’s unscientific claims about z-curve.

Reviewer 2

Reviewer Report:

The manuscript “A review and z-curve analysis of research on the palliative association of system justification” (Manuscript ID 1598066) extends the work of Sotola and Credé (2022), who used Z-curve analysis to evaluate the evidential value of findings related to system justification theory (SJT). The present paper similarly reports estimates of publication bias, questionable research practices (QRPs), and replication rates in the SJT literature using Z-curve. Evaluating how scientific evidence accumulates in the published literature is unquestionably important.

However, there is growing concern about the performance of meta-analytic forensic tools such as p-curve (Simonsohn, Nelson, & Simmons, 2014; see Morey & Davis-Stober, 2025 for a critique) and Z-curve (Brunner & Schimmack, 2020; Bartoš & Schimmack, 2022; see Pek et al., in press for a critique). Independent simulation studies increasingly suggest that these methods may perform poorly under realistic conditions, potentially yielding misleading results.

Justification for a theory or method typically requires subjecting it to a severe test (Mayo, 2019) – that is, assuming the opposite of what one seeks to establish (e.g., a null hypothesis of no effect) and demonstrating that this assumption leads to contradiction. In contrast, the simulation work used to support Z-curve (Brunner & Schimmack, 2020; Bartoš & Schimmack, 2022) relies on affirming belief through confirmation, a well-documented cognitive bias.

Findings from Pek et al. (in press) show that when selection bias is presented in published p-values — the very scenario Z-curve was intended to be applied — estimates of the expected discovery rate (EDR), expected replication rate (ERR), and Sorić’s False Discovery Risk (FDR) are themselves biased.

The magnitude and direction of this bias depend on multiple factors (e.g., number of p-values, selection mechanism of p-values) and cannot be corrected or detected from empirical data alone. The manuscript’s main contribution rests on the assumption that Z-curve yields reasonable estimates of the “reliability of published studies,” operationalized as a high ERR, and that the difference between the observed discovery rate (ODR) and EDR quantifies the extent of QRPs and publication bias.

The paper reports an ERR of .76, 95% CI [.53, .91] and concludes that research on the palliative hypothesis may be more reliable than findings in many other areas of psychology. There are several issues with this claim. First, the assertion that Sotola (2023) validated ERR estimates from the Z-curve reflects confirmation bias – I have not read Röseler (2023) and cannot comment on the argument made in it. The argument rests solely on the descriptive similarly between the ERR produced by Z-curve and the replication rate reported by the Open Science Collaboration (2015). However, no formal test of equivalence was conducted, and no consideration was given to estimate imprecision, potential bias in the estimates, or the conditions under which such agreement might occur by chance.

At minimum, if Z-curve estimates are treated as predicted values, some form of cross-validation or prediction interval should be used to quantify prediction uncertainty. More broadly, because ERR estimates produced by Z-curve are themselves likely biased (as shown in Pek et al., in press), and because the magnitude and direction of this bias are unknown, comparisons about ERR values across literatures do not provide a strong evidential basis for claims about the relative reliability of research areas.

Furthermore, the width of the 95% CI spans roughly half of the bounded parameter space of [0, 1], indicating substantial imprecision. Any claims based on these estimates should thus be contextualized with appropriate caution.

Another key result concerns the comparison of EDR = .52, 95% CO [.14, .92], and ODR = .81, 95% CI = [.69, .90]. The manuscript states that “When these two estimates are highly discrepant, this is consistent with the presence of questionable research practices (QRPS) and publication bias in this area of research (Brunner & Schimmack, 2020).

But in this case, the 95% CIs for the EDR and ODR in this work overlapped quite a bit, meaning that they may not be significantly different…” (p. 22). There are several issues with such a claim. First, Z curve results cannot directly support claims about the presence of QRPs.

The EDR reflects the proportion of significant p values expected under no selection bias, but it does not identify the source of selection bias (e.g., QRPs, fraud, editorial decisions). Using Z curve requires accepting its assumed missing data mechanism—a strong assumption that cannot be empirically validated.

Second, a descriptive comparison between two estimates cannot be interpreted as a formal test of difference (e.g., eyeballing two estimates of means as different does not tell us whether this difference is not driven by sampling variability). Means can be significantly different even if their confidence intervals overlap (Cumming & Finch, 2005).

A formal test of the difference is required. Third, EDR estimates can be biased. Even under ideal conditions, convergence to the population values requires extremely large numbers of studies (e.g., > 3000, see Figure 1 of Pek et al., in press).

The current study only has 64 tests. Thus, even if a formal test of the difference of ODR – EDR was conducted, little confidence could be placed on the result if the EDR estimate is biased and does not reflect the true population value.

Although I am critical of the outputs of Z curve analysis due to its poor statistical performance under realistic conditions, the manuscript has several strengths. These include adherence to good meta analytic practices such as providing a PRISMA flow chart, clearly stating inclusion and exclusion criteria, and verifying the calculation of p values. These aspects could be further strengthened by reporting test–retest reliability (given that a single author coded all studies) and by explicitly defining the population of selected p values. Because there appears to be heterogeneity in the results, a random effects meta analysis may be appropriate, and study level variables (e.g., type of hypothesis or analysis) could be used to explain between study variability. Additionally, the independence of p values has not been clearly addressed; p values may be correlated within articles or across studies. Minor points: The “reliability” of studies should be explicitly defined. The work by Manapat et al. (2022) should be cited in relation to Nagy et al. (2025). The findings of Simmons et al. (2011) applies only to single studies.

However, most research is published in multi-study sets, and follow-up simulations by Wegener at al. (2024) indicate that the Type I error rate is well-controlled when methodological constraints (e.g., same test, same design, same measures) are applied consistently across multiple studies – thus, the concerns of Simmons et al. (2011) pertain to a very small number of published results.

I could not find the reference to Schimmack and Brunner (2023) cited on p. 17.


Rebuttal to Core Claims in Recent Critiques of z-Curve

1. Claim: z-curve “performs poorly under realistic conditions”

Rebuttal

The claim that z-curve “performs poorly under realistic conditions” is not supported by the full body of available evidence. While recent critiques demonstrate that z-curve estimates—particularly EDR—can be biased under specific data-generating and selection mechanisms, these findings do not justify a general conclusion of poor performance.

Z-curve has been evaluated in extensive simulation studies that examined a wide range of empirically plausible scenarios, including heterogeneous power distributions, mixtures of low- and high-powered studies, varying false-positive rates, different degrees of selection for significance, and multiple shapes of observed z-value distributions (e.g., unimodal, right-skewed, and multimodal distributions). These simulations explicitly included sample sizes as low as k ≈ 100, which is typical for applied meta-research in psychology.

Across these conditions, z-curve demonstrated reasonable statistical properties conditional on its assumptions, including interpretable ERR and EDR estimates and confidence intervals with acceptable coverage in most realistic regimes. Importantly, these studies also identified conditions under which estimation becomes less informative—such as when the observed z-value distribution provides little information about missing nonsignificant results—thereby documenting diagnosable scope limits rather than undifferentiated poor performance.

Recent critiques rely primarily on selective adversarial scenarios and extrapolate from these to broad claims about “realistic conditions,” while not engaging with the earlier simulation literature that systematically evaluated z-curve across a much broader parameter space. A balanced scientific assessment therefore supports a more limited conclusion: z-curve has identifiable limitations and scope conditions, but existing simulation evidence does not support the claim that it generally performs poorly under realistic conditions.


2. Claim: Bias in EDR or ERR renders these estimates uninterpretable or misleading

Rebuttal

The critique conflates the possibility of bias with a lack of inferential value. All methods used to evaluate published literatures under selection—including effect-size meta-analysis, selection models, and Bayesian hierarchical approaches—are biased under some violations of their assumptions. The existence of bias therefore does not imply that an estimator is uninformative.

Z-curve explicitly reports uncertainty through bootstrap confidence intervals, which quantify sampling variability and model uncertainty given the observed data. No evidence is presented that z-curve confidence intervals systematically fail to achieve nominal coverage under conditions relevant to applied analyses. The appropriate conclusion is that z-curve estimates must be interpreted conditionally and cautiously, not that they lack statistical meaning.


3. Claim: Reliable EDR estimation requires “extremely large” numbers of studies (e.g., >3000)

Rebuttal

This claim overgeneralizes results from specific, highly constrained simulation scenarios. The cited sample sizes correspond to conditions in which the observed data provide little identifying information, not to a general requirement for statistical validity.

In applied statistics, consistency in the limit does not imply that estimates at smaller sample sizes are meaningless; it implies that uncertainty must be acknowledged. In the present application, this uncertainty is explicitly reflected in wide confidence intervals. Small sample sizes therefore affect precision, not validity, and do not justify dismissing the estimates outright.


4. Claim: Differences between ODR and EDR cannot support inferences about selection or questionable research practices

Rebuttal

It is correct that differences between ODR and EDR do not identify the source of selection (e.g., QRPs, editorial decisions, or other mechanisms). However, the critique goes further by implying that such differences lack diagnostic value altogether.

Under the z-curve framework, ODR–EDR discrepancies are interpreted as evidence of selection, not of specific researcher behaviors. This inference is explicitly conditional and does not rely on attributing intent or mechanism. Rejecting this interpretation would require demonstrating that ODR–EDR differences are uninformative even under monotonic selection on statistical significance, which has not been shown.


5. Claim: ERR comparisons across literatures lack evidential basis because bias direction is unknown

Rebuttal

The critique asserts that because ERR estimates may be biased with unknown direction, comparisons across literatures lack evidential value. This conclusion does not follow.

Bias does not eliminate comparative information unless it is shown to be large, variable, and systematically distorting rankings across plausible conditions. No evidence is provided that ERR estimates reverse ordering across literatures or are less informative than alternative metrics. While comparative claims should be interpreted cautiously, caution does not imply the absence of evidential content.


6. Claim: z-curve validation relies on “affirming belief through confirmation”

Rebuttal

This characterization misrepresents the role of simulation studies in statistical methodology. Simulation-based evaluation of estimators under known data-generating processes is the standard approach for assessing bias, variance, and coverage across frequentist and Bayesian methods alike.

Characterizing simulation-based validation as epistemically deficient would apply equally to conventional meta-analysis, selection models, and hierarchical Bayesian approaches. No alternative validation framework is proposed that would avoid reliance on model-based simulation.


7. Implicit claim: Effect-size meta-analysis provides a firmer basis for credibility assessment

Rebuttal

Effect-size meta-analysis addresses a different inferential target. It presupposes that studies estimate commensurable effects of a common hypothesis. In heterogeneous literatures, pooled effect sizes represent averages over substantively distinct estimands and may lack clear interpretation.

Moreover, effect-size meta-analysis does not estimate discovery rates, replication probabilities, or false-positive risk, nor does it model selection unless explicitly extended. No evidence is provided that effect-size meta-analysis offers superior performance for evaluating evidential credibility under selective reporting.


Summary

The critiques correctly identify that z-curve is a model-based method with assumptions and scope conditions. However, they systematically extend these points beyond what the evidence supports by:

  • extrapolating from selective adversarial simulations,
  • conflating potential bias with lack of inferential value,
  • overgeneralizing small-sample limitations,
  • and applying asymmetrical standards relative to conventional methods.

A scientifically justified conclusion is that z-curve provides conditionally informative estimates with quantifiable uncertainty, not that it lacks statistical validity or evidential relevance.


A Response to Pek et al.’s Commentary on Z-Curve: Clarifying the Assumptions of Selection Models

This is the final version of our response to Pek et al.’s criticism of z-curve in Cognition and Emotion that is now accepted for publication. I share it here as the actual response is hidden behind a paywall.

Abstract

Pek et al. (2026) comment on Soto and Schimmack (2025) and raise concerns about the use of z-curve to evaluate the credibility of emotion research. Their central criticism is based on simulations showing that z-curve can overestimate the expected discovery rate when selection operates not only at the level of statistical significance but also within the set of significant results as a function of effect size. This point is correct: if researchers selectively publish larger significant effects while suppressing smaller significant ones, selection models that assume threshold-based filtering can be biased. However, this limitation is not unique to z-curve and applies equally to other selection models used in meta-analysis. More importantly, there is currently little empirical evidence for effect-size bias, while there is ample evidence of selection based on significance. Under these more realistic conditions, z-curve provides informative estimates of (a) selection bias, (b) the expected replication rate, and (c) the false positive risk. Our results also demonstrate substantial inflation of effect size estimates in traditional meta-analyses that ignore selection processes. For these reasons, we reject the recommendation to rely solely on standard meta-analytic approaches and advocate for the use of selection models to obtain more realistic estimates.



Guest Post by Jerry Brunner: Response to an Anonymous Reviewer

Introduction

Jerry Brunner is a recent emeritus from the Department of Statistics at the University of Toronto Mississauga. Jerry first started in psychology, but was frustrated by the unscientific practices he observed in graduate school. He went on to become a professor in statistics. Thus, he is not only an expert in statistis. He also understands the methodological problems in psychology.

Sometime in the wake of the replication crisis around 2014/15, I went to his office to talk to him about power and bias detection. . Working with Jerry was educational and motivational. Without him z-curve would not exist. We spend years on trying different methods and thinking about the underlying statistical assumptions. Simulations often shattered our intuitions. The Brunner and Schimmack (2020) article summarizes all of this work.

A few years later, the method is being used to examine the credibility of published articles across different research areas. However, not everybody is happy about a tool that can reveal publication bias, the use of questionable research practices, and a high risk of false positive results. An anonymous reviewer dismissed z-curve results based on a long list of criticisms (Post: Dear Anonymous Reviewer). It was funny to see how ChatGPT responds to these criticisms (Comment). However, the quality of ChatGPT responses is difficult to evaluate. Therefore, I am pleased to share Jerry’s response to the reviewer’s comments here. Let’s just say that the reviewer was wise to make their comments anonymously. Posting the review and the response in public also shows why we need open reviews like the ones published in Meta-Psychology by the reviewers of our z-curve article. Hidden and biased reviews are just one more reason why progress in psychology is so slow.

Jerry Brunner’s Response

This is Jerry Brunner, the “Professor of Statistics” mentioned the post. I am also co-author of Brunner and Schimmack (2020). Since the review Uli posted is mostly an attack on our joint paper (Brunner and Schimmack, 2020), I thought I’d respond.

First of all, z-curve is sort of a moving target. The method described by Brunner and Schimmack is strictly a way of estimating population mean power based on a random sample of tests that have been selected for statistical significance. I’ll call it z-curve 1.0. The algorithm has evolved over time, and the current z-curve R package (available at https://cran.r-project.org/web/packages/zcurve/index.html) implements a variety of diagnostics based on a sample of p-values. The reviewer’s comments apply to z-curve 1.0, and so do my responses. This is good from my perspective, because I was in on the development of z-curve 1.0, and I believe I understand it pretty well. When I refer to z-curve in the material that follows, I mean z-curve 1.0. I do believe z-curve 1.0 has some limitations, but they do not overlap with the ones suggested by the reviewer.

Here are some quotes from the review, followed by my answers.

(1) “… z-curve analysis is based on the concept of using an average power estimate of completed studies (i.e., post hoc power analysis). However, statisticians and methodologists have written about the problem of post hoc power analysis …”

This is not accurate. Post-hoc power analysis is indeed fatally flawed; z-curve is something quite different. For later reference, in the “observed” power method, sample effect size is used to estimate population effect size for a single study. Estimated effect size is combined with observed sample size to produce an estimated non-centrality parameter for the non-central distribution of the test statistic, and estimated power is calculated from that, as an area under the curve of the non-central distribution. So, the observed power method produces an estimated power for an individual study. These estimates have been found to be too noisy for practical use.

The confusion of z-curve with observed power comes up frequently in the reviewer’s comments. To be clear, z-curve does not estimate effect sizes, nor does it produce power estimates for individual studies.

(2) “It should be noted that power is not a property of a (completed) study (fixed data). Power is a performance measure of a procedure (statistical test) applied to an infinite number of studies (random data) represented by a sampling distribution. Thus, what one estimates from completed study is not really “power” that has the properties of a frequentist probability even though the same formula is used. Average power does not solve this ontological problem (i.e., misunderstanding what frequentist probability is; see also McShane et al., 2020). Power should always be about a design for future studies, because power is the probability of the performance of a test (rejecting the null hypothesis) over repeated samples for some specified sample size, effect size, and Type I error rate (see also Greenland et al., 2016; O’Keefe, 2007). z-curve, however, makes use of this problematic concept of average power (for completed studies), which brings to question the validity of z-curve analysis results.”

The reviewer appears to believe that once the results of a study are in, the study no longer has a power. To clear up this misconception, I will describe the model on which z-curve is based.

There is a population of studies, each with its own subject population. One designated significance test will be carried out on the data for each study. Given the subject population, the procedure and design of the study (including sample size), significance level and the statistical test employed, there is a probability of rejecting the null hypothesis. This probability has the usual frequentist interpretation; it’s the long-term relative frequency of rejection based on (hypothetical) repeated sampling from the particular subject population. I will use the term “power” for the probability of rejecting the null hypothesis, whether or not the null hypothesis is exactly true.

Note that the power of the test — again, a member of a population of tests — is a function of the design and procedure of the study, and also of the true state of affairs in the subject population (say, as captured by effect size).

So, every study in the population of studies has a power. It’s the same before any data are collected, and after the data are collected. If the study were replicated exactly with a fresh sample from the same population, the probability of observing significant results would be exactly the power of the study — the true power.

This takes care of the reviewer’s objection, but let me continue describing our model, because the details will be useful later.

For each study in the population of studies, a random sample is drawn from the subject population, and the null hypothesis is tested. The results are either significant, or not. If the results are not significant, they are rejected for publication, or more likely never submitted. They go into the mythical “file drawer,” and are no longer available. The studies that do obtain significant results form a sub-population of the original population of studies. Naturally, each of these studies has a true power value. What z-curve is trying to estimate is the population mean power of the studies with significant results.

So, we draw a random sample from the population of studies with significant results, and use the reported results to estimate population mean power — not of the original population of studies, but only of the subset that obtained significant results. To us, this roughly corresponds to the mean power in a population of published results in a particular field or sub-field.

Note that there are two sources of randomness in the model just described. One arises from the random sampling of studies, and the other from random sampling of subjects within studies. In an appendix containing the theorems, Brunner and Schimmack liken designing a study (and choosing a test) to the manufacture of a biased coin with probability of heads equal to the power. All the coins are tossed, corresponding to running the subjects, collecting the data and carrying out the tests. Then the coins showing tails are discarded. We seek to estimate the mean P(Head) for all the remaining coins.

(3) “In Brunner and Schimmack (2020), there is a problem with ‘Theorem 1 states that success rate and mean power are equivalent …’ Here, the coin flip with a binary outcome is a process to describe significant vs. nonsignificant p-values. Focusing on observed power, the problem is that using estimated effect sizes (from completed studies) have sampling variability and cannot be assumed to be equivalent to the population effect size.”

There is no problem with Theorem 1. The theorem says that in the coin tossing experiment just described, suppose you (1) randomly select a coin from the population, and (2) toss it — so there are two stages of randomness. Then the probability of observing a head is exactly equal to the mean P(Heads) for the entire set of coins. This is pretty cool if you think about it. The theorem makes no use of the concept of effect size. In fact, it’s not directly about estimation at all; it’s actually a well-known result in pure probability, slightly specialized for this setting. The reviewer says “Focusing on observed power …” But why would he or she focus on observed power? We are talking about true power here.

(4) “Coming back to p-values, these statistics have their own distribution (that cannot be derived unless the effect size is null and the p-value follows a uniform distribution).

They said it couldn’t be done. Actually, deriving the distribution of the p-value under the alternative hypothesis is a reasonable homework problem for a masters student in statistics. I could give some hints …

(5) “Now, if the counter argument taken is that z-curve does not require an effect size input to calculate power, then I’m not sure what z-curve calculates because a value of power is defined by sample size, effect size, Type I error rate, and the sampling distribution of the statistical procedure (as consistently presented in textbooks for data analysis).”

Indeed, z-curve uses only p-values, from which useful estimates of effect size cannot be recovered. As previously stated, z-curve does not estimate power for individual studies. However, the reviewer is aware that p-values have a probability distribution. Intuitively, shouldn’t the distribution of p-values and the distribution of power values be connected in some way? For example, if all the null hypotheses in a population of tests were true so that all power values were equal to 0.05, then the distribution of p-values would be uniform on the interval from zero to one. When the null hypothesis of a test is false, the distribution of the p-value is right skewed and strictly decreasing (except in pathological artificial cases), with more of the probability piling up near zero. If average power were very high, one might expect a distribution with a lot of very small p-values. The point of this is just that the distribution of p-values surely contains some information about the distribution of power values. What z-curve does is to massage a sample of significant p-values to produce an estimate, not of the entire distribution of power after selection, but just of its population mean. It’s not an unreasonable enterprise, in spite of what the reviewer thinks. Also, it works well for large samples of studies. This is confirmed in the simulation studies reported by Brunner and Schimmack.

(6) “The problem of Theorem 2 in Brunner and Schimmack (2020) is assuming some distribution of power (for all tests, effect sizes, and sample sizes). This is curious because the calculation of power is based on the sampling distribution of a specific test statistic centered about the unknown population effect size and whose variance is determined by sample size. Power is then a function of sample size, effect size, and the sampling distribution of the test statistic.”

Okay, no problem. As described above, every study in the population of studies has its own test statistic, its own true (not estimated) effect size, its own sample size — and therefore its own true power. The relative frequency histogram of these numbers is the true population distribution of power.

(7) “There is no justification (or mathematical derivation) to show that power follows a uniform or beta distribution (e.g., see Figure 1 & 2 in Brunner and Schimmack, 2000, respectively).”

Right. These were examples, illustrating the distribution of power before versus after selection for significance — as given in Theorem 2. Theorem 2 applies to any distribution of true power values.

(8) “If the counter argument here is that we avoid these issues by transforming everything into a z-score, there is no justification that these z-scores will follow a z-distribution because the z-score is derived from a normal distribution – it is not the transformation of a p-value that will result in a z-distribution of z-scores … it’s weird to assume that p-values transformed to z-scores might have the standard error of 1 according to the z-distribution …”

The reviewer is objecting to Step 1 of constructing a z-curve estimate, given on page 6 of Brunner and Schimmack (2020). We start with a sample of significant p-values, arising from a variety of statistical tests, various F-tests, chi-squared tests, whatever — all with different sample sizes. Then we pretend that all the tests were actually two-sided z-tests with the results in the predicted direction, equivalent to one-sided z-tests with significance level 0.025. Then we transform the p-values to obtain the z statistics that would have generated them, had they actually been z-tests. Then we do some other stuff to the z statistics.

But as the reviewer notes, most of the tests probably are not z-tests. The distributions of their p-values, which depend on the non-central distributions of their test statistics, are different from one another, and also different from the distribution for genuine z-tests. Our paper describes it as an approximation, but why should it be a good approximation? I honestly don’t know, and I have given it a lot of thought. I certainly would not have come up with this idea myself, and when Uli proposed it, I did not think it would work. We both came up with a lot of estimation methods that did not work when we tested them out. But when we tested this one, it was successful. Call it a brilliant leap of intuition on Uli’s part. That’s how I think of it.

Uli’s comment.
It helps to know your history. Well before psychologists focused on effect sizes for meta-analysis, Fisher already had a method to meta-analyze p-values. P-Curve is just a meta-analysis of p-values with a selection model. However, p-values have ugly distributions and Stouffer proposed the transformation of p-values into z-scores to conduct meta-analyses. This method was used by Rosenthal to compute the fail-safe-N, one of the earliest methods to evaluate the credibility of published results (Fail-Safe-N). Ironically, even the p-curve app started using this transformation (p-curve changes). Thus, p-curve is really a version of z-curve. The problem with p-curve is that it has only one parameter and cannot model heterogeneity in true power. This is the key advantage of z-curve.1.0 over p-curve (Brunner & Schimmack, 2020). P-curve is even biased when all studies have the same population effect size, but different sample sizes, which leads to heterogeneity in power (Brunner, 2018].

Such things are fairly common in statistics. An idea is proposed, and it seems to work. There’s a “proof,” or at least an argument for the method, but the proof does not hold up. Later on, somebody figures out how to fill in the missing technical details. A good example is Cox’s proportional hazards regression model in survival analysis. It worked great in a large number of simulation studies, and was widely used in practice. Cox’s mathematical justification was weak. The justification starts out being intuitively reasonable but not quite rigorous, and then deteriorates. I have taught this material, and it’s not a pleasant experience. People used the method anyway. Then decades after it was proposed by Cox, somebody else (Aalen and others) proved everything using a very different and advanced set of mathematical tools. The clean justification was too advanced for my students.

Another example (from mathematics) is Fermat’s last theorem, which took over 300 years to prove. I’m not saying that z-curve is in the same league as Fermat’s last theorem, just that statistical methods can be successful and essentially correct before anyone has been able to provide a rigorous justification.

Still, this is one place where the reviewer is not completely mixed up.

Another Uli comment
Undergraduate students are often taught different test statistics and distributions as if they are totally different. However, most tests in psychology are practically z-tests. Just look at a t-distribution with N = 40 (df = 38) and try to see the difference to a standard normal distribution. The difference is tiny and invisible when you increase sample sizes above 40! And F-tests. F-values with 1 experimenter degree of freedom are just squared t-values, so the square root of these is practically a z-test. But what about chi-square? Well, with 1 df, chi-square is just a squared z-score, so we can use the square root and have a z-score. But what if we don’t have two groups, but compute correlations or regressions? Well, the statistical significance test uses the t-distribution and sample sizes are often well above 40. So, t and z are practically identical. It is therefore not surprising to me that approximating empirical results with different test-statistics can be approximated with the standard normal distribution. We could make teaching statistics so much easier, instead of confusing students with F-distributions. The only exception are complex designs with 3 x 4 x 5 ANOVAs, but they don’t really test anything and are just used to p-hack. Rant over. Back to Jerry.

(9) “It is unclear how Theorem 2 is related to the z-curve procedure.”

Theorem 2 is about how selection for significance affects the probability distribution of true power values. Z-curve estimates are based only on studies that have achieved significant results; the others are hidden, by a process that can be called publication bias. There is a fundamental distinction between the original population of power values and the sub-population belonging to studies that produce significant results. The theorems in the appendix are intended to clarify that distinction. The reviewer believes that once significance has been observed, the studies in question no longer even have true power values. So, clarification would seem to be necessary.

(10) “In the description of the z-curve analysis, it is unclear why z-curve is needed to calculate “average power.” If p < .05 is the criterion of significance, then according to Theorem 1, why not count up all the reported p-values and calculate the proportion in which the p-values are significant?”

If there were no selection for significance, this is what a reasonable person would do. But the point of the paper, and what makes the estimation problem challenging, is that all we can observe are statistics from studies with p < 0.05. Publication bias is real, and z-curve is designed to allow for it.

(11) “To beat a dead horse, z-curve makes use of the concept of “power” for completed studies. To claim that power is a property of completed studies is an ontological error …”

Wrong. Power is a feature of the design of a study, the significance test, and the subject population. All of these features still exist after data have been collected and the test is carried out.

Uli and Jerry comment:
Whenever a psychologist uses the word “ontological,” be very skeptical. Most psychologists who use the word understand philosophy as well as this reviewer understands statistics.

(12) “The authors make a statement that (observed) power is the probability of exact replication. However, there is a conceptual error embedded in this statement. While Greenwald et al. (1996, p. 1976) state “replicability can be computed as the power of an exact replication study, which can be approximated by [observed power],” they also explicitly emphasized that such a statement requires the assumption that the estimated effect size is the same as the unknown population effect size which they admit cannot be met in practice.”

Observed power (a bad estimate of true power) is not the probability of significance upon exact replication. True power is the probability of significance upon exact replication. It’s based on true effect size, not estimated effect size. We were talking about true power, and we mistakenly thought that was obvious.

(13) “The basis of supporting the z-curve procedure is a simulation study. This approach merely confirms what is assumed with simulation and does not allow for the procedure to be refuted in any way (cf. Popper’s idea of refutation being the basis of science.) In a simulation study, one assumes that the underlying process of generating p-values is correct (i.e., consistent with the z-curve procedure). However, one cannot evaluate whether the p-value generating process assumed in the simulation study matches that of empirical data. Stated a different way, models about phenomena are fallible and so we find evidence to refute and corroborate these models. The simulation in support of the z-curve does not put the z-curve to the test but uses a model consistent with the z-curve (absent of empirical data) to confirm the z-curve procedure (a tautological argument). This is akin to saying that model A gives us the best results, and based on simulated data on model A, we get the best results.”

This criticism would have been somewhat justified if the simulations had used p-values from a bunch of z-tests. However, they did not. The simulations reported in the paper are all F-tests with one numerator degree of freedom, and denominator degrees of freedom depending on the sample size. This covers all the tests of individual regression coefficients in multiple regression, as well as comparisons of two means using two-sample (and even matched) t-tests. Brunner and Schmmack say (p. 8)

Because the pattern of results was similar for F-tests
and chi-squared tests and for different degrees of freedom,
we only report details for F-tests with one numerator
degree of freedom; preliminary data mining of
the psychological literature suggests that this is the case
most frequently encountered in practice. Full results are
given in the supplementary materials.

So I was going to refer the reader (and the anonymous reviewer, who is probably not reading this post anyway) to the supplementary materials. Fortunately I checked first, and found that the supplementary materials include a bunch of OSF stuff like the letter submitting the article for publication, and the reviewers’ comments and so on — but not the full set of simulations. Oops.

All the code and the full set of simulation results is posted at

https://www.utstat.utoronto.ca/brunner/zcurve2018

You can download all the material in a single file at

https://www.utstat.utoronto.ca/brunner/zcurve2018.zip

After expanding, just open index.html in a browser.

Actually we did a lot more simulation studies than this, but you have to draw the line somewhere. The point is that z-curve performs well for large numbers of studies with chi-squared test statistics as well as F statistics — all with varying degrees of freedom.

(14) “The simulation study was conducted for the performance of the z-curve on constrained scenarios including F-tests with df = 1 and not for the combination of t-tests and chi-square tests as applied in the current study. I’m not sure what to make of the z-curve performance for the data used in the current paper because the simulation study does not provide evidence of its performance under these unexplored conditions.”

Now the reviewer is talking about the paper that was actually under review. The mistake is natural, because of our (my) error in not making sure that the full set of simulations was included in the supplementary materials. The conditions in question are not unexplored; they are thoroughly explored, and the accuracy of z-curve for large samples is confirmed.

(15+) There are some more comments by the reviewer, but these are strictly about the paper under review, and not about Brunner and Schimmack (2020). So, I will leave any further response to others.

Dear Anonymous Reviewer…

Peer-review is the foundation of science. Peer-reviewers work hard to evaluate manuscript to see whether they are worthy of being published, especially in old-fashioned journals with strict page limitations. Their hard work often goes unnoticed because peer-reviews remain unpublished. This is a shame. A few journals have recognized that science might benefit from publishing reviews. Not all reviews are worthy of publication, but when a reviewer spends hours, if not days, to write a long and detailed comment, it seems only fair to share the fruits of their labor in public. Unfortunately, I am not able to give credit to Reviewer 1 who was too modest or shy to share their name. This does not undermine the value they created and I hope the reviewer may find the courage to take credit for their work.

Reviewer 1 was asked to review a paper that used z-curve to evaluate the credibility of research published in the leading emotion journals. Yet, going beyond the assigned task, Reviewer 1 gave a detailed and thorough review of the z-curve method that showed the deep flaws of this statistical method that had been missed by reviewers of articles that promoted this dangerous and misleading tool. After a theoretical deep-dive into the ontology of z-curve, Reviewer 1 points out that simulation studies seem to have validated the method. Yet, Reviewer 1 was quick to notice that the simulations were a shame and designed to show that z-curve works rather than to see it fail in applications to more realistic data. Deeply embarrassed, my co-thors, including a Professor of Statistics, are now contacting journals to retract our flawed articles.

Please find the damaging review of z-curve below.

P.S. We are also offering a $200 reward for credible simulation studies that demonstrate that z-curve is crap.

P.P.S Some readers seem to have missed the sarcasm and taken the criticism by Reviewer 1 seriously. The problem is lack of expertise to evaluate the conflicting claims. To make it easy I share an independent paper that validated z-curve with actual replication outcomes. Not sure how Reviewer 1 would explain the positive outcome. Maybe we hacked the replication studies, too?

Röseler, Lukas, 2023. “Predicting Replication Rates with Z-Curve: A Brief Exploratory Validation Study Using the Replication Database,” MetaArXiv ewb2t, Center for Open Science.

ANNONYMOUS, July 17, 2024

Referee: 1

Comments to the Author
The manuscript “Credibility of results in emotion science: A z-curve analysis of results in the journals Cognition & Emotion and Emotion” (CEM-DA.24) presents results from a z-curve analysis on reported statistics (t-tests, F-tests, and chi-square tests with df < 6 and 95% confidence intervals) for empirical studies (excluding meta-analysis) published in Cognition & Emotion from 1987 to 2023 and Emotion from 2001 to 2023. The purposes of reporting results from a z-curve analysis are to (a) estimate selection bias in emotion research and (b) predict a success rate in replication studies.

I have strong reservations about the conclusions drawn by the authors that do not seem to be strongly supported by their reported results. Specifically, I am not confident that conclusions from z-curve results justify the statements made in the paper under review. Below, I outline the main concerns that center on the z-curve methodology that unfortunately focuses on providing a review on Brunner and Schimmack (2020) and not so much on the current paper.

VAGUE METHODOLOGY. The authors make strong claims about what conclusions can be drawn from z-curve analyses. Their presentation of z-curve analysis in the present paper is declarative and does not provide the necessary information to describe the assumptions made by the method, how it works, when it fails, etc. The authors cite previous publications on z-curve (Brunner & Schimmack, 2020; Bartos & Schimmack, 2022; Schimmack & Bartos, 2023). Furthermore, this work ignores recent criticism in the literature about such statistical forensics. One example questioning the validity of conclusions by tests of credibility/replicability (e.g., p-curve, Francis’s [2013] consistency test) is in a talk by Richard Morey titled “Statistical games: Flawed thinking of popular methods for assessing reproducibility” (https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3Dc0G98qp1cf4&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245569384%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=n12hBiuHLvvL7uvpt5cj0qaYKrmze39HggtrSPVYIZ0%3D&reserved=0). The talk was based on Morey (2013). Other authors who have written on this topic include McShane, Böckenholt, and Hansen (2020) and Pek, Hoisington-Shaw, & Wegener (2022).

==
Morey, R. D. (2013). The consistency test does not–and cannot–deliver what is advertised: A comment on Francis (2013). Journal of Mathematical Psychology, 57(5), 180-183. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1016%2Fj.jmp.2013.03.004&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245573351%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=giZR7Etdc9n8qTUvXMCMnFeh95GeGO5KRCCoG0P2bHY%3D&reserved=0

McShane, B. B., Böckenholt, U., & Hansen, K. T. (2020). Average Power: A Cautionary Note. Advances in Methods and Practices in Psychological Science, 3(2), 185–199. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1177%2F2515245920902370&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245577184%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=qQtgjxmUam%2ByfFCjknA84sQnecQTk8qm7MObb7b%2BO3E%3D&reserved=0

Francis, G. (2013). Replication, statistical consistency, and publication bias. Journal of Mathematical Psychology, 57(5), 153-169. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1016%2Fj.jmp.2013.02.003&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245580995%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=%2Fkd82Q%2BgOfm1yZECD%2FRbah2uAdZROtPlyKfb4kmFmS4%3D&reserved=0

Pek, J., Hoisington-Shaw, K. J., & Wegener, D. T. (2022). Avoiding Questionable Research Practices Surrounding Statistical Power Analysis. In W. O’Donohue, A. Masuda, & S. Lilienfeld (Eds.), Avoiding Questionable Practices in Applied Psychology (pp. 243–267). Springer. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1007%2F978-3-031-04968-2_11&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245584836%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=DRLox%2Bmn2ztlp6Y4hagpuZKCyCsUFOF1xEXZP779gvk%3D&reserved=0
==

In reading Brunner and Schimmack (2020), z-curve analysis is based on the concept of using an average power estimate of completed studies (i.e., post hoc power analysis). However, statisticians and methodologists have written about the problem of post hoc power analysis (whether it be for a single study or for a set of studies; see Pek, Hoisington-Shaw, & Wegener, in press for a treatment of this misconception).

It should be noted that power is *not* a property of a (completed) study (fixed data). Power is a performance measure of a procedure (statistical test) applied to an infinite number of studies (random data) represented by a sampling distribution. Thus, what one estimates from completed study is not really “power” that has the properties of a frequentist probability even though the same formula is used. Average power does not solve this ontological problem (i.e., misunderstanding what frequentist probability is; see also McShane et al., 2020). Power should *always* be about a design for future studies, because power is the probability of the performance of a test (rejecting the null hypothesis) over repeated samples for some specified sample size, effect size, and Type I error rate (see also Greenland et al., 2016; O’Keefe, 2007). z-curve, however, makes use of this problematic concept of average power (for completed studies), which brings to question the validity of z-curve analysis results.

Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. European Journal of Epidemiology, 31(4), 337–350. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1007%2Fs10654-016-0149-3&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245588763%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=iKLnBvCg0BGd8l9x%2BZii7O%2BKapURRjoWn8rKZpTxHDw%3D&reserved=0

O’Keefe, D. J. (2007). Post hoc power, observed power, a priori power, retrospective power, prospective power, achieved power: Sorting out appropriate uses of statistical power analyses. Communication Methods and Measures, 1(4), 291–299. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1080%2F19312450701641375&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245592749%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=q3TpAIdWZs%2BPxLhZ1YI2Yby0qSbj14%2Fxc6hcc4YZtA8%3D&reserved=0
==

In Brunner and Schimmack (2020), there is a problem with “Theorem 1 states that success rate and mean power are equivalent even if the set of coins is a subset of all coins.” Here, the coin flip with a binary outcome is a process to describe significant vs. nonsignificant p-values. Focusing on observed power, the problem is that using estimated effect sizes (from completed studies) have sampling variability and cannot be assumed to be equivalent to the population effect size. Methodological papers that deal with power analysis making use of estimated effect size show that the uncertainty due to sampling variability is extremely high (e.g., see Anderson et al., 2017; McShane & Böckenholt, 2016); it is worse when effects are random (cf. random effects meta-analysis; see McShane, Böckenholt, & Hansen, 2020; Pek, Pitt, & Wegener, 2024). Accepting that effects are random seems to be more consistent with what we observe in empirical results of the same topic. The extent of uncertainty in power estimates (based on observed effects) is so high that much cannot be concluded with such imprecise calculations.

Coming back to p-values, these statistics have their own distribution (that cannot be derived unless the effect size is null and the p-value follows a uniform distribution). However, because p-values have sampling variability (and an unknown sampling distribution), one cannot take a significant p-value to deterministically indicate a tally on power (which assumes that an unknown specific effect size is true). Stated differently, a significant p-value can be consistent with a Type I error. Now, if the counter argument taken is that z-curve does not require an effect size input to calculate power, then I’m not sure what z-curve calculates because a value of power is defined by sample size, effect size, Type I error rate, and the sampling distribution of the statistical procedure (as consistently presented in textbooks for data analysis).

There seems to be some conceptual slippage on the meaning of power here because what the authors call power does not seem to have the defining features of power.

The problem of Theorem 2 in Brunner and Schimmack (2020) is assuming some distribution of power (for all tests, effect sizes, and sample sizes). This is curious because the calculation of power is based on the sampling distribution of a specific test statistic centered about the unknown population effect size and whose variance is determined by sample size. Power is then a function of sample size, effect size, and the sampling distribution of the test statistic. There is no justification (or mathematical derivation) to show that power follows a uniform or beta distribution (e.g., see Figure 1 & 2 in Brunner and Schimmack, 2000, respectively). If the counter argument here is that we avoid these issues by transforming everything into a z-score, there is no justification that these z-scores will follow a z-distribution because the z-score is derived from a normal distribution – it is not the transformation of a p-value that will result in a z-distribution of z-scores. P-values are statistics and follow a sampling distribution; the variance of the sampling distribution is a function of sample size. So, it’s weird to assume that p-values transformed to z-scores might have the standard error of 1 according to the z-distribution. If the further argument is using a mixture of z-distributions to estimate the distribution of the z-scores, then these z-scores are not technically z-scores in that they are nor distributed following the z-distribution. We might estimate the standard error of the mixture of z-distributions to rescale the distribution again to a z-distribution… but to what end? Again, there is some conceptual slippage in what is meant by a z-score. If the distribution of p-values that have been transformed to a z-score is not a z-distribution and then the mixture distribution is then shaped back into a z-distribution (with truncations that seem arbitrary) so that the critical value of 1.96 can be used – I’m not sure what the resulting distribution is of, anymore. A related point is that we do not yet know whether p-values are transformation invariant (in distribution) under a z-score transformation. Furthermore, the distribution for power invoked in Theorem 1 is not a function of sample size, effect size, or statistical procedure, suggesting that the assumed distribution does not align well with the features that we know influence power. It is unclear how Theorem 2 is related to the z-curve procedure. Again, there seems to be some conceptual slippage involved with p-values being transformed into z-scores that somehow give us an estimate of power (without stating the effect size, sample size, or procedure).

In the description of the z-curve analysis, it is unclear why z-curve is needed to calculate “average power.” If p < .05 is the criterion of significance, then according to Theorem 1, why not count up all the reported p-values and calculate the proportion in which the p-values are significant? After all, p-values can be transformed to z-scores and vice-versa in that they carry the same information. But then, there is a problem of p-values having sampling variability and might be consistent with Type I error. A transformation from p to z will not fix sampling variability.

To beat a dead horse, z-curve makes use of the concept of “power” for completed studies. To claim that power is a property of completed studies is an ontological error about the meaning of frequentist probability. A thought experiment might help. Suppose I completed a study, and the p-value is .50. I convert this p-value to a z-score for a two-tailed test and get 0.67. Let’s say I collect a bunch of studies and do this and get a distribution of z-scores (that don’t end up being distributed z). I do a bunch of things to make this distribution become a z-distribution. Then, I define power as the proportion of z-scores above the cutoff of 1.96. We are now calling power a collection of z-scores above 1.96 (without controlling for sample size, effect size, and procedure). This newly defined “power” based on the z-distribution does not reflect the original definition of power (area under the curve for a specific effect size, a specific procedure, and a specific sample size, assuming the Type I error is .05). This conceptual slippage is akin to burning a piece of wood, putting the ashes into a mold that looks like wood, and calling the molded ashes wood.

The authors make a statement that (observed) power is the probability of exact replication. However, there is a conceptual error embedded in this statement. While Greenwald et al. (1996, p. 1976) state “replicability can be computed as the power of an exact replication study, which can be approximated by [observed power],” they also explicitly emphasized that such a statement requires the assumption that the estimated effect size is the same as the unknown population effect size which they admit cannot be met in practice. Furthermore, recall that power is a property of a procedure and is not a property of completed data (cf. ontological error), thus using observed power to quantify replicability presents replicability as a property of a procedure and not about the robustness of an observed effect. Again, there seems to be some conceptual slippage occurring here on what is meant by replication versus what is quantifying replication (which should not be observed power).

The basis of supporting the z-curve procedure is a simulation study. This approach merely confirms what is assumed with simulation and does not allow for the procedure to be refuted in any way (cf. Popper’s idea of refutation being the basis of science.) In a simulation study, one assumes that the underlying process of generating p-values is correct (i.e., consistent with the z-curve procedure). However, one cannot evaluate whether the p-value generating process assumed in the simulation study matches that of empirical data. Stated a different way, models about phenomena are fallible and so we find evidence to refute and corroborate these models. The simulation in support of the z-curve does not put the z-curve to the test but uses a model consistent with the z-curve (absent of empirical data) to confirm the z-curve procedure (a tautological argument). This is akin to saying that model A gives us the best results, and based on simulated data on model A, we get the best results.

Further, the evidence that z-curve performs well is specific to the assumptions within the simulation study. If p-values were generated in a different way to reflect a competing tentative process, the performance of the z-curve would be different.  The simulation study was conducted for the performance of the z-curve on constrained scenarios including F-tests with df = 1 and not for the combination of t-tests and chi-square tests as applied in the current study. I’m not sure what to make of the z-curve performance for the data used in the current paper because the simulation study does not provide evidence of its performance under these unexplored conditions.

==
Anderson, S. F., Kelley, K., & Maxwell, S. E. (2017). Sample-size planning for more accurate statistical power: A method adjusting sample effect sizes for publication bias and uncertainty. Psychological Science, 28(11), 1547–1562. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1177%2F0956797617723724&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245596532%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=qg%2BHctfodgm9tHN4oiKkFSJgcIk5%2BSWGBvrWGKRalRQ%3D&reserved=0

Greenwald, A. G., Gonzalez, R., Harris, R. J., & Guthrie, D. (1996). Effect sizes and p values: What should be reported and what should be replicated? Psychophysiology, 33(2), 175–183. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1111%2Fj.1469-8986.1996.tb02121.x&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245600546%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=du%2BhnlOwN22%2FmOAgdoPqEVB3WQhXVYl%2FI0l5J6xTXhU%3D&reserved=0

McShane, B. B., & Böckenholt, U. (2016). Planning Sample Sizes When Effect Sizes Are Uncertain: The Power-Calibrated Effect Size Approach. Psychological Methods, 21(1), 47–60. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1037%2Fmet0000036&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245604346%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=Lx2kD5FruPdsja9v%2B5uqSrl%2BaiWma1o316z%2BXgSojIY%3D&reserved=0

Pek, J., Hoisington-Shaw, K. J., & Wegener, D. T. (in press). Uses of uncertain statistical power: Designing future studies, not evaluating completed studies . Psychological Methods. 
https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.researchgate.net%2Fpublication%2F368358276_Uses_of_uncertain_statistical_power_Designing_future_studies_not_evaluating_completed_studies&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245608080%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=4KepFlEFOqQyVrhSfXGHUmHToMZlzKt4AlR9sMtzif0%3D&reserved=0

Pek, J., Pitt, M. A., & Wegener, D. T. (2004). Uncertainty limits the use of power analysis. Journal of Experimental Psychology: General, 153(4), 1139–1151. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2Fhttps%3A%2F%2Fdoi.org%2F10.1037%2Fxge0001273&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245611962%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=S6EZxxThRGeruV6RF9%2FUuIXv4MmMKZWlfAgXkYUtoxI%3D&reserved=0
==

IINPUT DATA. The authors made use of statistics reported in empirical research published in Cognition & Emotion and Emotion. Often, articles might report several studies, and studies would have several models, and models would contain several tests. Thus, there might be a nested structure of tests nested within models, models nested within studies, and studies nested within articles. It does not seem that this nesting is taken into account to provide a good estimate of selection bias and the expectation replication rate. Thus, the estimates provided cannot be deemed unbiased (e.g., estimates would be biased toward articles that tend to report a lot of statistics compared to others).

As the authors admit, there is no separation of statistical tests used for manipulation checks, preliminary analyses, or tests of competing and alternative hypothesis. Given that the sampling of the statistics might not be representative of key findings in emotion research, little confidence can be placed in the accuracy of the estimates reported and the strong claims being made using them (about emotion research in general). 
Finally, the authors excluded chi-square tests with degrees of freedom larger than 6. This would mean that tests of independence with designs larger than a 2×2 contingency table would be excluded (or tests of independence with 6 categories). In general, the authors need to be careful on what conditions their conclusions apply to.

UNSUBSTANTIATED CONCLUSIONS. The key conclusions made by the authors are that there is selection bias in emotion research, and there is a success rate of 70% in replication studies. These conclusions are made from z-curve analysis, in which I question the validity of. My concerns of the z-curve procedure has to do with ontological errors made about the probability attached to the concept of power, the rationale for z-transformations on p-values (along with strange distributional gymnastics with little justification provided in the original paper), and equating power with replication.

Even if the z-curve is valid, the performance of z-curve should be better evaluated to show that they apply to the conditions of the data used in the current study. Furthermore, data quality used in z-curve analysis in terms of selection criteria (e.g., excluding tests for manipulation checks, etc.) and modeling the nested structure inherent in reported results would go a long way in ensuring that the estimate provided is as unbiased as can be.

Finally, it seems odd to conclude selection bias based on data with selection bias. There might be some tautology going on within the argument. An analogy about missing data might help. Given a set of data in which we assume had undergone selection (i.e., part of the distribution is missing), how can we know from the data what is missing? The only way to talk about the missing part of the distribution is to assume a distribution for the “full” data that subsumes the observed data distribution. But who can say that the assumed distribution is the correct one that would have generated the full data? Our selected data does not have the features to let us infer what the full distribution should be. How can we know what we observe has undergone selection bias without knowledge of the selection process (cf. distribution of the full data) unless some implicit assumption is made. We are not given the assumption and therefore cannot evaluate whether the assumption is valid. I cannot tell what assumptions z-curve makes about selection.