Tag Archives: Pek

Willful Incompetence: Repeating False Claims Does Not Make them True

“Windmills are evil” (Don Quixote cited by Trump)

“Zcurve is made by the devil” (Pek et al., 2024)

Preamble

Ideal conceptions of science have a set of rules that help to distinguish beliefs from knowledge. Actual science is a game with few rules. Anything goes (Feyerabend), if you can sell it to an editor of a peer-reviewed journal. US American psychologist also conflate the meaning of freedom in “Freedom of speech” and “academic freedom” to assume that there are no standards for truth in science, just like there are none in American politics. The game is to get more publications, citations, views, and clicks, and truth is decided by the winner of popularity contests. Well, not to be outdone in this war, I am posting yet another blog post about Pek’s quixotian attacks on z-curve.

For context, Pek has already received an F by statistics professor Jerry Brunner for her nonsensical attacks on a statistical method (Brunner, 2024), but even criticism by a professor of statistics has not deterred her from repeating misinformation about z-curve. I call this willful incompetence; the inability to listen to feedback and to wonder whether somebody else could have more expertise than yourself. This is not to be confused with the Dunning-Kruger effect, where people have no feedback about their failures. Here failures are repeated again and again, despite strong feedback that errors are being made.

Context

One of the editors of Cognition and Emotion, Sander Koole, has been following our work and encouraged us to submit our work on the credibility of emotion research as an article to Cognition & Emotion. We were happy to do so. The manuscript was handled by the other editor Klaus Rothermund. In the first round of reviews, we received a factually incorrect and hostile review by an anonymous reviewer. We were able to address these false criticisms of z-curve and resubmitted the manuscript. In a new round of reviews, the hostile reviewer came up with simulation studies that showed z-curve fails. We showed that this is indeed the case in the simulations that used studies with N = 3 and 2 degrees of freedom. The problem here is not z-curve, but the transformation of t-values into z-values. When degrees of freedom are like those in the published literature we examined, this is not a problem. The article was finally accepted, but the hostile reviewer was allowed to write a commentary. At least, it wa now clear that the hostile reviewer was Pek.

I found out that the commentary is apparently accepted for publication when somebody sent me the link to it on ResearchGate and a friendly over to help with a rebuttal. However, I could not wait and drafted a rebuttal with the help of ChatGPT. Importantly, I use ChatGPT to fact check claims and control my emotions, not to write for me. Below you can find a clear, point by point response to all the factually incorrect claims about z-curve made by Pek et al. that passed whatever counts as human peer-review at Cognition and Emotion.

Rebuttal

Abstract

What is the Expected Discovery Rate?

EDR also lacks a clear interpretation in relation to credibility because it reflects both the average pre-data power of tests and the estimated average population effect size for studied effects.”

This sentence is unclear and introduces several poorly defined or conflated concepts. In particular, it confuses the meaning of the Expected Discovery Rate (EDR) and misrepresents what z-curve is designed to estimate.

A clear and correct definition of the Expected Discovery Rate (EDR) is that it is an estimate of the average true power of a set of studies. Each empirical study has an unknown population effect size and is subject to sampling error. The observed effect size is therefore a function of these two components. In standard null-hypothesis significance testing, the observed effect size is converted into a test statistic and a p-value, and the null hypothesis is rejected when the p-value falls below a prespecified criterion, typically α = .05.

Hypothetically, if the population effect size were known, one could specify the sampling distribution of the test statistic and compute the probability that the study would yield a statistically significant result—that is, its power (Cohen, 1988). The difficulty, of course, is that the true population effect size is unknown. However, when one considers a large set of studies, the distribution of observed p-values (or equivalently, z-values) provides information about the average true power of those studies. This is the quantity that z-curve seeks to estimate.

Average true power predicts the proportion of statistically significant results that should be observed in an actual body of studies (Brunner & Schimmack, 2020), in much the same way that the probability of heads predicts the proportion of heads in a long series of coin flips. The realized outcome will deviate from this expectation due to sampling error—for example, flipping a fair coin 100 times will rarely yield exactly 50 heads—but large deviations from the expected proportion would indicate that the assumed probability is incorrect. Analogously, if a set of studies has an average true power of 80%, the observed discovery rate should be close to 80%. Substantially lower rates imply that the true power of the studies is lower than assumed.

Crucially, true power has nothing to do with pre-study (or pre-data) power, contrary to the claim made by Pek et al. Pre-study power is a hypothetical quantity based on researchers’ assumptions—often optimistic or wishful—about population effect sizes. These beliefs can influence study design decisions, such as planned sample size, but they cannot influence the outcome of a study. Study outcomes are determined by the actual population effect size and sampling variability, not by researchers’ expectations.

Pek et al. therefore conflate hypothetical pre-study power with true power in their description of EDR. This conflation is a fundamental conceptual error. Hypothetical power is irrelevant for interpreting observed results or evaluating their credibility. What matters for assessing the credibility of a body of empirical findings is the true power of the studies to produce statistically significant results, and EDR is explicitly designed to estimate that quantity.

Pek et al.’s misunderstanding of the z-curve estimands (i.e., the parameters the method is designed to estimate) undermines their more specific criticisms. If a critique misidentifies the target quantity, then objections about bias, consistency, or interpretability are no longer diagnostics of the method as defined; they are diagnostics of a different construct.

The situation is analogous to Bayesian critiques of NHST that proceed from an incorrect description of what p-values or Type I error rates mean. In that case, the criticism may sound principled, but it does not actually engage the inferential object used in NHST. Likewise here, Pek et al.’s argument rests on a category error about “power,” conflating hypothetical pre-study power (a design-stage quantity based on assumed effect sizes) with true power (the long-run success probability implied by the actual population effects and the study designs). Because z-curve’s EDR is an estimand tied to the latter, not the former, their critique is anchored in conceptual rather than empirical disagreement.

2. Z-Curve Does Not Follow the Law of Large Numbers

.“simulation results further demonstrate that z-curve estimators can often be biased and inconsistent (i.e., they fail to follow the Law of Large Numbers), leading to potentially misleading conclusions.”

This statement is scientifically improper as written, for three reasons.

First, it generalizes from a limited set of simulation conditions to z-curve as a method in general. A simulation can establish that an estimator performs poorly under the specific data-generating process that was simulated, but it cannot justify a blanket claim about “z-curve estimators” across applications unless the simulated conditions represent the method’s intended model and cover the relevant range of plausible selection mechanisms. Pek et al. do not make those limitations explicit in the abstract, where readers typically take broad claims at face value.

Second, the statement is presented as if Pek et al.’s simulations settle the question, while omitting that z-curve has already been evaluated in extensive prior simulation work. That omission is not neutral: it creates the impression that the authors’ results are uniquely diagnostic, rather than one contribution within an existing validation literature. Because this point has been raised previously, continuing to omit it is not a minor oversight; it materially misleads readers about the evidentiary base for the method.

Third, and most importantly, their claim that z-curve estimates “fail to follow the Law of Large Numbers” is incorrect. Z-curve estimates are subject to ordinary sampling error, just like any other estimator based on finite data. A simple analogy is coin flipping: flipping a fair coin 10 times can, by chance, produce 10 heads, but flipping it 10,000 times will not produce 10,000 heads by chance. The same logic applies to z-curve. With a small number of studies, the estimated EDR can deviate substantially from its population value due to sampling variability; as the number of studies increases, those random deviations shrink. This is exactly why z-curve confidence intervals narrow as the number of included studies grows: sampling error decreases as the amount of information increases. Nothing about z-curve exempts it from this basic statistical principle. Suggesting otherwise implies that z-curve is somehow unique in how sampling error operates, when in fact it is a standard statistical model that estimates population parameters from observed data and, accordingly, becomes more precise as the sample size increases.

3. Sweeping Conclusion Not Supported by Evidence

“Accordingly, we do not recommend using 𝑍-curve to evaluate research findings.”

Based on these misrepresentations of z-curve, Pek et al. make a sweeping recommendation that z-curve estimates provide no useful information for evaluating published research and should be ignored. This recommendation is not only disproportionate to the issues they raise; it is also misaligned with the practical needs of emotion researchers. Researchers in this area have a legitimate interest in whether their literature resembles domains with comparatively strong replication performance or domains where replication has been markedly weaker. For example, a reasonable applied question is whether the published record in emotion research looks more like areas of cognitive psychology, where about 50% of results replicate or more like social psychology, where about 25% replicate (Open Science Collaboration, 2015).

Z-curve is not a crystal ball capable of predicting the outcome of any particular future replication with certainty. Rather, the appropriate claim is more modest and more useful: z-curve provides model-based estimates that can help distinguish bodies of evidence that are broadly consistent with high average evidential strength from those that are more consistent with low average evidential strength and substantial selection. Used in that way, z-curve can assist emotion researchers in critically appraising decades of published results without requiring the field to replicate every study individually.

4. Ignoring the Replication Crisis That Led to The Development of Z-curve

We advocate for traditional meta-analytic methods, which have a well-established history of producing appropriate and reliable statistical conclusions regarding focal research findings”

This statement ignores that traditional meta-analysis ignores publication bias and have produced dramatically effect size estimates. The authors ignore the need to take biases into account to separate true findings from false ones.

Article

5. False definition of EDR (again)

EDR (cf. statistical power) is described as “the long-run success rate in a series of exact replication studies” (Brunner & Schimmack, 2020, p. 1)”.

This quotation describes statistical power in Brunner and Schimmack (2020), not the Expected Discovery Rate (EDR). The EDR was introduced later in Bartoš and Schimmack (2022) as part of z-curve 2.0, and, as described above, the EDR is an estimate of average true power. While the power of a single study can be defined in terms of the expected long-run frequency of significant results (Cohen, 1988), it can also be defined as the probability of obtaining a significant result in a single study. This is the typical use of power in a priori power calculations to plan a specific study. More importantly, the EDR is defined as the average true power of a set of unique studies and does not assume that these studies are exact replications.

Thus, the error is not merely a misplaced citation, but a substantive misrepresentation of what EDR is intended to estimate. Pek et al. import language used to motivate the concept of power in Brunner and Schimmack (2020) and incorrectly present it as a defining interpretation of EDR. This move obscures the fact that EDR is a summary parameter of a heterogeneous literature, not a prediction about repeated replications of a single experiment.

6. Confusing Observed Data with Unobserved Population Parameters (Ontological Error)

Because z-curve analysis infers EDR from observed p-values, EDR can be understood as a measure of average observed power.

This statement is incorrect. To clarify the issue without delving into technical statistical terminology, consider a simple coin-toss example. Suppose we flip a coin that is unknown to us to be biased, producing heads 60% of the time, and we toss it 100 times. We observe 55 heads. In this situation, we have an observed outcome (55 heads), an unknown population parameter (the true probability of heads, 60%), and an unknown expected value (60 heads in 100 tosses). Based on the observed data, we attempt to estimate the true probability of heads or to test the hypothesis that the coin is fair (i.e., that the expected number of heads is 50 out of 100). Importantly, we do not confuse the observed outcome with the true probability; rather, we use the observed outcome as noisy information about an underlying parameter. That is, we treat 55 as a reasonable estimate of the true power and use confidence intervals to see whether it includes 50. If it does not, we can reject the hypothesis that it is fair.

Estimating average true power works in exactly the same way. If 100 honestly reported studies yield 36 statistically significant results, the best estimate of the average true power of these studies is 36%, and we would expect a similar discovery rate if the same 100 studies were repeated under identical conditions (Open Science Collaboration, 2015). Of course, we recognize that the observed rate of 36% is influenced by sampling error and that a replication might yield, for example, 35 or 37 significant results. The observed outcome is therefore treated as an estimate of an unknown parameter, not as the parameter itself. The true average power is probably not 36%, but it is somewhere around this estimate and not 80%.

The problem with so-called “observed power” calculations arises precisely when this distinction is ignored—when estimates derived from noisy data are mistaken for true underlying parameters. This is the issue discussed by Hoenig and Heisey (2001). There is nothing inherently wrong with computing power using effect-size estimates from a study (see, e.g., Yuan & Maxwell, 200x); the problem arises when sampling error is ignored and estimated quantities are treated as if they were known population values. In a single study, the observed power could be 36% power, and the true power is 80%, but in a reasonably large set of studies this will not happen.

Z-curve explicitly treats average true power as an unknown population parameter and uses the distribution of observed p-values to estimate it. Moreover, z-curve quantifies the uncertainty of this estimate by providing confidence intervals, and correct interpretations of z-curve results explicitly take this uncertainty into account. Thus, the alleged ontological error attributed to z-curve reflects a misunderstanding of basic statistical inference rather than a flaw in the method itself.

7. Modeling Sampling Error of Z-Values

z-curve analysis assumes independence among the K analyzed p-values, making the inclusion criteria for p-values critical to defining the population of inference…. Including multiple 𝑝-values from the same sampling unit (e.g., an article) violates the independence assumption, as 𝑝-values within a sampling unit are often correlated. Such dependence can introduce bias, especially because the 𝑍-curve does not account for unequal numbers of 𝑝-values across sampling units or within-unit correlations.

It is true that z-curve assumes that sampling error for a specific result converted into a z-value follows the standard normal distribution with a variance of 1. Correlations among results can lead to violations of this assumption. However, this does not imply that z-curve “fails” in the presence of any dependence, nor does it justify treating this point as a decisive objection to our application. Rather, it means that analysts should take reasonable steps to limit dependence or to use inference procedures that are robust to clustering of results within studies or articles.

A conservative way to meet the independence assumption is to select only one test per study or one test per article in multiple-study articles where the origin of results is not clear. It is also possible to use more than one result per study by computing confidence intervals that sample one result at random, but different results for each sample with replacement. This is closely related to standard practices in meta-analysis for handling multiple dependent effects per study, where uncertainty is estimated with resampling or hierarchical approaches rather than by treating every effect size as independent. The practical impact of dependence also depends on the extent of clustering. In z-curve applications with large sets of articles (e.g., all articles in Cognition and Emotion), the influence of modest dependence is typically limited, and in our application we obtain similar estimates whether we treat results as independent or use clustered bootstrapping to compute uncertainty. Thus, even if Pek et al.’s point is granted in principle, it does not materially change the interpretation of our empirical findings about the emotion literature. Although we pointed this out in our previous review, the authors continue to misrepresent how our z-curve analyses addressed non-independence among p-values (e.g., by using clustered bootstrapping and/or one-test-per-study rules).

8. Automatic Extraction of Test Statistics

Unsurprisingly, automated text mining methods for extracting test statistics has been criticized for its inability to reliably identify 𝑝-values suitable for forensic meta-analysis, such as 𝑍-curve analysis.

This statement fails to take into account advantages and disadvantageous of automatically extracting results from articles. The advantages are that we have nearly population level data for research in the top two emotion journals. This makes it possible to examine time trends (did power increase; did selection bias decrease). The main drawback is that automatic extraction does not, by itself, distinguish between focal tests (i.e., tests that bear directly on an article’s key claim) and non-focal tests. We are explicit about this limitation and also included analyses of hand-coded focal analyses to supplement the results based on automatically extracted test statistics. Importantly, our conclusion that z-curve estimates are similar across these coding approaches is consistent with an often-overlooked feature of Cohen’s (1962) classic assessment of statistical power: Cohen explicitly distinguished between focal and non-focal tests and reported that this distinction did not materially change his inferences about typical power. In this respect, our hand-coded focal analyses suggest that the inclusion of non-focal tests in large-scale automated extraction is not necessarily a fatal limitation for estimating average evidential strength at the level of a literature, although it remains essential to be transparent about what is being sampled and to supplement automated extraction with focal coding when possible.

Pek et al. accurately describe our automated extraction procedure as relying on reported test statistics (e.g., t, F), which are then converted into z-values for z-curve analysis. However, their subsequent criticism shifts to objections that apply specifically to analyses based on scraped p-values, such as concerns about rounded or imprecise information about p-values (e.g., p < .05) and their suitability for forensic meta-analysis. This criticism is valid, but it is also the reason why we do not use p-values for z-curve analysis when better information is available.

9. Pek et al.’s Simulation Study: What it really shows

Pek et al.’s description of their simulation study is confusing. They call one condition “no bias” and the other “bias.” The problem here is that “no bias” refers to a simulation in which selection bias is present. The bias here assumes that α = .05 serves as the selection mechanism. That is, studies are selected based on statistical significance, but there is no additional selection among statistically significant results. Most importantly, it is assumed that there is no further selection based on effect sizes.

Pek et al.’s simulation of “bias” instead implies that researchers would not publish a result if d = .2, but would publish it if d = .5, consistent with a selection mechanism that favors larger observed effects among statistically significant results. Importantly, their simulation does not generalize to other violations of the assumptions underlying z-curve. In particular, it represents only one specific form of within-significance selection and does not address alternative selection mechanisms that have been widely discussed in the literature.

For example, a major concern about the credibility of psychological research is p-hacking, where researchers use flexibility in data analysis to obtain statistically significant results from studies with low power. P-hacking has the opposite effect of Pek et al.’s simulated bias. Rather than boosting the representation of studies with high power, studies with low power are over-represented among the statistically significant results.

Pek et al. are correct that z-curve estimates depend on assumptions about the selection mechanism, but this is not a fundamental problem. All selection models necessarily rely on assumptions about how studies enter the published literature, and different models make different assumptions (e.g., selection on significance thresholds, on p-value intervals, or on effect sizes). Because the specific practices that generate bias in published results are unknown, no selection model can avoid such assumptions, and z-curve’s assumptions are neither unique nor unusually restrictive.

Pek et al.’s simulations are also confusing because they include scenarios in which all p-values are reported and analyzed. These conditions are not relevant for standard applications of z-curve that assume and usually find evidence of bias. Accordingly, we focus on the simulations that match the usual publication environment, in which z-curve is fitted to the distribution of statistically significant z-values.

Pek et al.’s figures are also easy to misinterpret because the y-axis is restricted to a very narrow range of values. Although EDR estimates can in principle range from alpha (5% )to 100%, the y-axis in Figure 1a spans only approximately 60% to 85%. This makes estimation errors look big visually, when they are numerically relatively small.

In the relevant condition, the true EDR is 72.5%. For small sets of studies (e.g., K = 100), the estimated EDR falls roughly 10 percentage points below this value, a deviation that is visually exaggerated by the truncated y-axis. As the number of studies increases the point estimate approaches the true value. In short, Pek et al.’s simulation reproduces Bartos and Schimmack’s results that z-curve estimates are fairly accurate when bias is simply selection for significance.

The simulation based on selection by strength of evidence leads to an overestimation of the EDR. Here smaller samples appear more accurate because they underestimated the EDR and the two biases cancel out. More relevant is that with large samples, z-curve overestimates true average power by about 10 percentage points. This value is limited to one specific simulation of bias and could be larger or smaller. The main point of this simulation is to show that z-curve estimates depend on the type of selection bias in a set of studies. The simulation does not tell us the nature of actual selection biases and the amount of bias in z-curve estimates that violations of the selection assumption introduce.

From a practical point of view an overestimation by 10 percentage points is not fatal.  If the EDR estimate is 80% and the true average power is only 70%, the literature is still credible. The problem is bigger with literatures that already have low EDRs like experimental social psychology. With an EDR of 21% a 10 percentage point correction would reduce the EDR to 11% and the lower bound of the CI would include 5% (Schimmack, 2020), implying that all significant results could be false positives. Thus, Pek et al.’s simulation suggests that z-curve estimates may be overly optimistic. In fact, z-curve overestimates replicability compared to actual replication outcomes in the reproducibility project (Open Science Collaboration, 2015). Pek et al.’s simulations suggest that selection for effect sizes could be reason, but other reasons cannot be ruled out.

Simulation results for the False Discovery Risk and bias (Observed Discovery Rate minus Expected Discoery Rate) are the same because they are a direct function of the EDR. The Expected Replication Rate (ERR), average true power of the significant results, is a different parameter, but shows the same pattern.

In short, Pek et al.’s simulations show that z-curve estimates depend on the actual selection processes that are unknown, but that does not invalidate z-curve estimates. Especially important is that z-curve evaluations of credibility are asymmetrical (Schimmack, 2012). Low values raise concerns about a literature, but high values do not ensure credibility (Soto & Schimmack, 2024).

Specific Criticism of the Z-Curve Results in the Emotion Literature

10. Automatic Extraction (Again)

Based on our discussion on the importance of determining independent sampling units, formulating a well-defined research question, establishing rigorous inclusion and exclusion criteria for 𝑝-values, and conducting thorough quality checks on selected 𝑝-values, we have strong reservations about the methods used in SS2024.” (Pek et al.)

As already mentioned, the population of all statistical hypothesis tests reported in a literature is meaningful for researchers in this area. Concerns about low replicability and high false positive rates have undermined the credibility of the empirical foundations of psychological research. We examined this question empirically using all available statistical test results. This defines a clearly specified population of reported results and a well-defined research question. The key limitation remains that automatic extraction does not distinguish focal and non-focal results. We believe that information for all tests is still important. After all, why are they reported if they are entirely useless? Does it not matter whether a manipulation check was important or whether a predicted result was moderated by gender? Moreover, it is well known that focality is often determined only after results are known in order to construct a compelling narrative (Kerr, 1998). A prominent illustration is provided by Cesario, Plaks, and Higgins (2006), where a failure to replicate the original main effect was nonetheless presented as a successful conceptual replication based on a significant moderator effect.

Pek et al. further argue that analyzing all reported tests violates the independence assumption. However, our inference relied on bootstrapping with articles as the clustering unit, which is the appropriate approach when multiple test statistics are nested within articles and directly addresses the dependence they emphasize. In addition, SS2024 reports z-curve analyses based on hand-coded focal tests that are not subject to these objections; these results are not discussed in Pek et al.’s critique.

11. No Bias in Psychology

Even I f the 𝑍-curve estimates and their CIs are unbiased and exhibit proper coverage, SS2024’s claim of selection bias in emotional research – based on observing that.EDR for both journals were not contained within their respective 95% CIs for ODR – is dubious”. 

It is striking that Pek et al. question z-curve evidence of publication bias. Even setting aside z-curve entirely, it is difficult to defend the assumption of honest and unbiased reporting in psychology. Sterling (1959) already noted that success rates approaching those observed in the literature are implausible under unbiased reporting, and subsequent surveys have repeatedly documented overwhelmingly high rates of statistically significant findings (Sterling et al., 1995).

To dismiss z-curve evidence of selection bias as “dubious” would therefore require assuming that average true power in psychology is extraordinarily high. This assumption is inconsistent with longstanding evidence that psychological studies are typically underpowered to detect even moderate effect sizes, with average power estimates far below conventional benchmarks (Cohen, 1988). None of these well-established considerations appear to inform Pek et al.’s evaluation of z-curve, which treats its results in isolation from the broader empirical literature on publication bias and research credibility. In this broader context, the combination of extremely high observed discovery rates for focal tests and low EDR estimates—such as the EDR of 27% reported in SS2024—is neither surprising nor dubious, but aligns with conclusions drawn from independent approaches, including large-scale replication efforts (Open Science Collaboration, 2015).

12. Misunderstanding of Estimation

Inference using these estimators in the presence of bias would be misleading because the estimators converge onto an incorrect value.

This statement repeats the fallacy of drawing general conclusions about the interpretability of z-curve from a specific, stylized simulation. In addition, Pek et al.’s argument effectively treats point estimates as the sole inferential output of z-curve analyses while disregarding uncertainty. Point estimates are never exact representations of unknown population parameters. If this standard were applied consistently, virtually all empirical research would have to be dismissed on the grounds that estimates are imperfect. Instead, estimates must be interpreted in light of their associated uncertainty and reasonable assumptions about error.

For the 227 significant hand-coded focal tests, the point estimate of the EDR was 27%, with a confidence interval ranging from 10% to 67%. Even if one were to assume an overestimation of 10 percentage points, as suggested by Pek et al.’s most pessimistic simulation scenario, the adjusted estimate would be 17%, and the lower bound of the confidence interval would include 5%. Under such conditions, it cannot be ruled out that a substantial proportion—or even all—statistically significant focal results in this literature are false positives. Rather than undermining our conclusions, Pek et al.’s simulation therefore reinforces the concern that many focal findings in the emotion literature may lack evidential value. At the same time, the width of the confidence interval also allows for more optimistic scenarios. The appropriate response to this uncertainty is to code and analyze additional studies, not to dismiss z-curve results simply because they do not yield perfect estimates of unknown population parameters.

13. Conclusion Does Not Follow From the Arguments

𝑍-curve as a tool to index credibility faces fundamental challenges – both at the definitional and interpretational levels as well as in the statistical performance of its estimators.”

This conclusion does not follow from Pek et al.’s analyses. Their critique rests on selective simulations, treats point estimates as decisive while disregarding uncertainty, and evaluates z-curve in isolation from the broader literature on publication bias, statistical power, and replication. Rather than engaging with z-curve’s assumptions, scope, and documented performance under realistic conditions, their argument relies on narrow counterexamples that are then generalized to broad claims about invalidity.

More broadly, the article exemplifies a familiar pattern in which methodological tools are evaluated against unrealistic standards of perfection rather than by their ability to provide informative, uncertainty-qualified evidence under real-world conditions. Such standards would invalidate not only z-curve, but most statistical methods used in empirical science. When competing conclusions are presented about the credibility of a research literature, the appropriate response is not to dismiss imperfect tools, but to weigh the totality of evidence, assumptions, and robustness checks supporting each position.

We can debate whether the average true power of studies in the emotion literature is closer to 5% or 50%, but there is no plausible scenario under which average true power would justify success rates exceeding 90%. We can also debate the appropriate trade-off between false positives and false negatives, but it is equally clear that the standard significance criterion does not warrant the conclusion that no more than 5% of statistically significant results are false positives, especially in the presence of selection bias and low power. One may choose to dismiss z-curve results, but what cannot be justified is a return to uncorrected effect-size meta-analyses that assume unbiased reporting. Such approaches systematically inflate effect-size estimates and can even produce compelling meta-analytic evidence for effects that do not exist, as vividly illustrated by Bem’s (2011) meta-analysis of extrasensory perception findings.

Postscript

Ideally, the Schimmack-Pek controversy will attract some attention from human third parties with sufficient statistical expertise to understand the issues and weigh in on this important issue. As Pek et al. point out, a statistical tool that can distinguish credible and unbelievable research is needed. Effect size meta-analyses are also increasingly recognizing the need to correct for bias and new methods show promise. Z-curve is one tool among others. Rather than dismissing these attempts, we need to improve them, because we cannot go back to the time where psychologists were advised to err on the side of discovery (Bem, 2000).

Guest Post by Jerry Brunner: Response to an Anonymous Reviewer

Introduction

Jerry Brunner is a recent emeritus from the Department of Statistics at the University of Toronto Mississauga. Jerry first started in psychology, but was frustrated by the unscientific practices he observed in graduate school. He went on to become a professor in statistics. Thus, he is not only an expert in statistis. He also understands the methodological problems in psychology.

Sometime in the wake of the replication crisis around 2014/15, I went to his office to talk to him about power and bias detection. . Working with Jerry was educational and motivational. Without him z-curve would not exist. We spend years on trying different methods and thinking about the underlying statistical assumptions. Simulations often shattered our intuitions. The Brunner and Schimmack (2020) article summarizes all of this work.

A few years later, the method is being used to examine the credibility of published articles across different research areas. However, not everybody is happy about a tool that can reveal publication bias, the use of questionable research practices, and a high risk of false positive results. An anonymous reviewer dismissed z-curve results based on a long list of criticisms (Post: Dear Anonymous Reviewer). It was funny to see how ChatGPT responds to these criticisms (Comment). However, the quality of ChatGPT responses is difficult to evaluate. Therefore, I am pleased to share Jerry’s response to the reviewer’s comments here. Let’s just say that the reviewer was wise to make their comments anonymously. Posting the review and the response in public also shows why we need open reviews like the ones published in Meta-Psychology by the reviewers of our z-curve article. Hidden and biased reviews are just one more reason why progress in psychology is so slow.

Jerry Brunner’s Response

This is Jerry Brunner, the “Professor of Statistics” mentioned the post. I am also co-author of Brunner and Schimmack (2020). Since the review Uli posted is mostly an attack on our joint paper (Brunner and Schimmack, 2020), I thought I’d respond.

First of all, z-curve is sort of a moving target. The method described by Brunner and Schimmack is strictly a way of estimating population mean power based on a random sample of tests that have been selected for statistical significance. I’ll call it z-curve 1.0. The algorithm has evolved over time, and the current z-curve R package (available at https://cran.r-project.org/web/packages/zcurve/index.html) implements a variety of diagnostics based on a sample of p-values. The reviewer’s comments apply to z-curve 1.0, and so do my responses. This is good from my perspective, because I was in on the development of z-curve 1.0, and I believe I understand it pretty well. When I refer to z-curve in the material that follows, I mean z-curve 1.0. I do believe z-curve 1.0 has some limitations, but they do not overlap with the ones suggested by the reviewer.

Here are some quotes from the review, followed by my answers.

(1) “… z-curve analysis is based on the concept of using an average power estimate of completed studies (i.e., post hoc power analysis). However, statisticians and methodologists have written about the problem of post hoc power analysis …”

This is not accurate. Post-hoc power analysis is indeed fatally flawed; z-curve is something quite different. For later reference, in the “observed” power method, sample effect size is used to estimate population effect size for a single study. Estimated effect size is combined with observed sample size to produce an estimated non-centrality parameter for the non-central distribution of the test statistic, and estimated power is calculated from that, as an area under the curve of the non-central distribution. So, the observed power method produces an estimated power for an individual study. These estimates have been found to be too noisy for practical use.

The confusion of z-curve with observed power comes up frequently in the reviewer’s comments. To be clear, z-curve does not estimate effect sizes, nor does it produce power estimates for individual studies.

(2) “It should be noted that power is not a property of a (completed) study (fixed data). Power is a performance measure of a procedure (statistical test) applied to an infinite number of studies (random data) represented by a sampling distribution. Thus, what one estimates from completed study is not really “power” that has the properties of a frequentist probability even though the same formula is used. Average power does not solve this ontological problem (i.e., misunderstanding what frequentist probability is; see also McShane et al., 2020). Power should always be about a design for future studies, because power is the probability of the performance of a test (rejecting the null hypothesis) over repeated samples for some specified sample size, effect size, and Type I error rate (see also Greenland et al., 2016; O’Keefe, 2007). z-curve, however, makes use of this problematic concept of average power (for completed studies), which brings to question the validity of z-curve analysis results.”

The reviewer appears to believe that once the results of a study are in, the study no longer has a power. To clear up this misconception, I will describe the model on which z-curve is based.

There is a population of studies, each with its own subject population. One designated significance test will be carried out on the data for each study. Given the subject population, the procedure and design of the study (including sample size), significance level and the statistical test employed, there is a probability of rejecting the null hypothesis. This probability has the usual frequentist interpretation; it’s the long-term relative frequency of rejection based on (hypothetical) repeated sampling from the particular subject population. I will use the term “power” for the probability of rejecting the null hypothesis, whether or not the null hypothesis is exactly true.

Note that the power of the test — again, a member of a population of tests — is a function of the design and procedure of the study, and also of the true state of affairs in the subject population (say, as captured by effect size).

So, every study in the population of studies has a power. It’s the same before any data are collected, and after the data are collected. If the study were replicated exactly with a fresh sample from the same population, the probability of observing significant results would be exactly the power of the study — the true power.

This takes care of the reviewer’s objection, but let me continue describing our model, because the details will be useful later.

For each study in the population of studies, a random sample is drawn from the subject population, and the null hypothesis is tested. The results are either significant, or not. If the results are not significant, they are rejected for publication, or more likely never submitted. They go into the mythical “file drawer,” and are no longer available. The studies that do obtain significant results form a sub-population of the original population of studies. Naturally, each of these studies has a true power value. What z-curve is trying to estimate is the population mean power of the studies with significant results.

So, we draw a random sample from the population of studies with significant results, and use the reported results to estimate population mean power — not of the original population of studies, but only of the subset that obtained significant results. To us, this roughly corresponds to the mean power in a population of published results in a particular field or sub-field.

Note that there are two sources of randomness in the model just described. One arises from the random sampling of studies, and the other from random sampling of subjects within studies. In an appendix containing the theorems, Brunner and Schimmack liken designing a study (and choosing a test) to the manufacture of a biased coin with probability of heads equal to the power. All the coins are tossed, corresponding to running the subjects, collecting the data and carrying out the tests. Then the coins showing tails are discarded. We seek to estimate the mean P(Head) for all the remaining coins.

(3) “In Brunner and Schimmack (2020), there is a problem with ‘Theorem 1 states that success rate and mean power are equivalent …’ Here, the coin flip with a binary outcome is a process to describe significant vs. nonsignificant p-values. Focusing on observed power, the problem is that using estimated effect sizes (from completed studies) have sampling variability and cannot be assumed to be equivalent to the population effect size.”

There is no problem with Theorem 1. The theorem says that in the coin tossing experiment just described, suppose you (1) randomly select a coin from the population, and (2) toss it — so there are two stages of randomness. Then the probability of observing a head is exactly equal to the mean P(Heads) for the entire set of coins. This is pretty cool if you think about it. The theorem makes no use of the concept of effect size. In fact, it’s not directly about estimation at all; it’s actually a well-known result in pure probability, slightly specialized for this setting. The reviewer says “Focusing on observed power …” But why would he or she focus on observed power? We are talking about true power here.

(4) “Coming back to p-values, these statistics have their own distribution (that cannot be derived unless the effect size is null and the p-value follows a uniform distribution).

They said it couldn’t be done. Actually, deriving the distribution of the p-value under the alternative hypothesis is a reasonable homework problem for a masters student in statistics. I could give some hints …

(5) “Now, if the counter argument taken is that z-curve does not require an effect size input to calculate power, then I’m not sure what z-curve calculates because a value of power is defined by sample size, effect size, Type I error rate, and the sampling distribution of the statistical procedure (as consistently presented in textbooks for data analysis).”

Indeed, z-curve uses only p-values, from which useful estimates of effect size cannot be recovered. As previously stated, z-curve does not estimate power for individual studies. However, the reviewer is aware that p-values have a probability distribution. Intuitively, shouldn’t the distribution of p-values and the distribution of power values be connected in some way? For example, if all the null hypotheses in a population of tests were true so that all power values were equal to 0.05, then the distribution of p-values would be uniform on the interval from zero to one. When the null hypothesis of a test is false, the distribution of the p-value is right skewed and strictly decreasing (except in pathological artificial cases), with more of the probability piling up near zero. If average power were very high, one might expect a distribution with a lot of very small p-values. The point of this is just that the distribution of p-values surely contains some information about the distribution of power values. What z-curve does is to massage a sample of significant p-values to produce an estimate, not of the entire distribution of power after selection, but just of its population mean. It’s not an unreasonable enterprise, in spite of what the reviewer thinks. Also, it works well for large samples of studies. This is confirmed in the simulation studies reported by Brunner and Schimmack.

(6) “The problem of Theorem 2 in Brunner and Schimmack (2020) is assuming some distribution of power (for all tests, effect sizes, and sample sizes). This is curious because the calculation of power is based on the sampling distribution of a specific test statistic centered about the unknown population effect size and whose variance is determined by sample size. Power is then a function of sample size, effect size, and the sampling distribution of the test statistic.”

Okay, no problem. As described above, every study in the population of studies has its own test statistic, its own true (not estimated) effect size, its own sample size — and therefore its own true power. The relative frequency histogram of these numbers is the true population distribution of power.

(7) “There is no justification (or mathematical derivation) to show that power follows a uniform or beta distribution (e.g., see Figure 1 & 2 in Brunner and Schimmack, 2000, respectively).”

Right. These were examples, illustrating the distribution of power before versus after selection for significance — as given in Theorem 2. Theorem 2 applies to any distribution of true power values.

(8) “If the counter argument here is that we avoid these issues by transforming everything into a z-score, there is no justification that these z-scores will follow a z-distribution because the z-score is derived from a normal distribution – it is not the transformation of a p-value that will result in a z-distribution of z-scores … it’s weird to assume that p-values transformed to z-scores might have the standard error of 1 according to the z-distribution …”

The reviewer is objecting to Step 1 of constructing a z-curve estimate, given on page 6 of Brunner and Schimmack (2020). We start with a sample of significant p-values, arising from a variety of statistical tests, various F-tests, chi-squared tests, whatever — all with different sample sizes. Then we pretend that all the tests were actually two-sided z-tests with the results in the predicted direction, equivalent to one-sided z-tests with significance level 0.025. Then we transform the p-values to obtain the z statistics that would have generated them, had they actually been z-tests. Then we do some other stuff to the z statistics.

But as the reviewer notes, most of the tests probably are not z-tests. The distributions of their p-values, which depend on the non-central distributions of their test statistics, are different from one another, and also different from the distribution for genuine z-tests. Our paper describes it as an approximation, but why should it be a good approximation? I honestly don’t know, and I have given it a lot of thought. I certainly would not have come up with this idea myself, and when Uli proposed it, I did not think it would work. We both came up with a lot of estimation methods that did not work when we tested them out. But when we tested this one, it was successful. Call it a brilliant leap of intuition on Uli’s part. That’s how I think of it.

Uli’s comment.
It helps to know your history. Well before psychologists focused on effect sizes for meta-analysis, Fisher already had a method to meta-analyze p-values. P-Curve is just a meta-analysis of p-values with a selection model. However, p-values have ugly distributions and Stouffer proposed the transformation of p-values into z-scores to conduct meta-analyses. This method was used by Rosenthal to compute the fail-safe-N, one of the earliest methods to evaluate the credibility of published results (Fail-Safe-N). Ironically, even the p-curve app started using this transformation (p-curve changes). Thus, p-curve is really a version of z-curve. The problem with p-curve is that it has only one parameter and cannot model heterogeneity in true power. This is the key advantage of z-curve.1.0 over p-curve (Brunner & Schimmack, 2020). P-curve is even biased when all studies have the same population effect size, but different sample sizes, which leads to heterogeneity in power (Brunner, 2018].

Such things are fairly common in statistics. An idea is proposed, and it seems to work. There’s a “proof,” or at least an argument for the method, but the proof does not hold up. Later on, somebody figures out how to fill in the missing technical details. A good example is Cox’s proportional hazards regression model in survival analysis. It worked great in a large number of simulation studies, and was widely used in practice. Cox’s mathematical justification was weak. The justification starts out being intuitively reasonable but not quite rigorous, and then deteriorates. I have taught this material, and it’s not a pleasant experience. People used the method anyway. Then decades after it was proposed by Cox, somebody else (Aalen and others) proved everything using a very different and advanced set of mathematical tools. The clean justification was too advanced for my students.

Another example (from mathematics) is Fermat’s last theorem, which took over 300 years to prove. I’m not saying that z-curve is in the same league as Fermat’s last theorem, just that statistical methods can be successful and essentially correct before anyone has been able to provide a rigorous justification.

Still, this is one place where the reviewer is not completely mixed up.

Another Uli comment
Undergraduate students are often taught different test statistics and distributions as if they are totally different. However, most tests in psychology are practically z-tests. Just look at a t-distribution with N = 40 (df = 38) and try to see the difference to a standard normal distribution. The difference is tiny and invisible when you increase sample sizes above 40! And F-tests. F-values with 1 experimenter degree of freedom are just squared t-values, so the square root of these is practically a z-test. But what about chi-square? Well, with 1 df, chi-square is just a squared z-score, so we can use the square root and have a z-score. But what if we don’t have two groups, but compute correlations or regressions? Well, the statistical significance test uses the t-distribution and sample sizes are often well above 40. So, t and z are practically identical. It is therefore not surprising to me that approximating empirical results with different test-statistics can be approximated with the standard normal distribution. We could make teaching statistics so much easier, instead of confusing students with F-distributions. The only exception are complex designs with 3 x 4 x 5 ANOVAs, but they don’t really test anything and are just used to p-hack. Rant over. Back to Jerry.

(9) “It is unclear how Theorem 2 is related to the z-curve procedure.”

Theorem 2 is about how selection for significance affects the probability distribution of true power values. Z-curve estimates are based only on studies that have achieved significant results; the others are hidden, by a process that can be called publication bias. There is a fundamental distinction between the original population of power values and the sub-population belonging to studies that produce significant results. The theorems in the appendix are intended to clarify that distinction. The reviewer believes that once significance has been observed, the studies in question no longer even have true power values. So, clarification would seem to be necessary.

(10) “In the description of the z-curve analysis, it is unclear why z-curve is needed to calculate “average power.” If p < .05 is the criterion of significance, then according to Theorem 1, why not count up all the reported p-values and calculate the proportion in which the p-values are significant?”

If there were no selection for significance, this is what a reasonable person would do. But the point of the paper, and what makes the estimation problem challenging, is that all we can observe are statistics from studies with p < 0.05. Publication bias is real, and z-curve is designed to allow for it.

(11) “To beat a dead horse, z-curve makes use of the concept of “power” for completed studies. To claim that power is a property of completed studies is an ontological error …”

Wrong. Power is a feature of the design of a study, the significance test, and the subject population. All of these features still exist after data have been collected and the test is carried out.

Uli and Jerry comment:
Whenever a psychologist uses the word “ontological,” be very skeptical. Most psychologists who use the word understand philosophy as well as this reviewer understands statistics.

(12) “The authors make a statement that (observed) power is the probability of exact replication. However, there is a conceptual error embedded in this statement. While Greenwald et al. (1996, p. 1976) state “replicability can be computed as the power of an exact replication study, which can be approximated by [observed power],” they also explicitly emphasized that such a statement requires the assumption that the estimated effect size is the same as the unknown population effect size which they admit cannot be met in practice.”

Observed power (a bad estimate of true power) is not the probability of significance upon exact replication. True power is the probability of significance upon exact replication. It’s based on true effect size, not estimated effect size. We were talking about true power, and we mistakenly thought that was obvious.

(13) “The basis of supporting the z-curve procedure is a simulation study. This approach merely confirms what is assumed with simulation and does not allow for the procedure to be refuted in any way (cf. Popper’s idea of refutation being the basis of science.) In a simulation study, one assumes that the underlying process of generating p-values is correct (i.e., consistent with the z-curve procedure). However, one cannot evaluate whether the p-value generating process assumed in the simulation study matches that of empirical data. Stated a different way, models about phenomena are fallible and so we find evidence to refute and corroborate these models. The simulation in support of the z-curve does not put the z-curve to the test but uses a model consistent with the z-curve (absent of empirical data) to confirm the z-curve procedure (a tautological argument). This is akin to saying that model A gives us the best results, and based on simulated data on model A, we get the best results.”

This criticism would have been somewhat justified if the simulations had used p-values from a bunch of z-tests. However, they did not. The simulations reported in the paper are all F-tests with one numerator degree of freedom, and denominator degrees of freedom depending on the sample size. This covers all the tests of individual regression coefficients in multiple regression, as well as comparisons of two means using two-sample (and even matched) t-tests. Brunner and Schmmack say (p. 8)

Because the pattern of results was similar for F-tests
and chi-squared tests and for different degrees of freedom,
we only report details for F-tests with one numerator
degree of freedom; preliminary data mining of
the psychological literature suggests that this is the case
most frequently encountered in practice. Full results are
given in the supplementary materials.

So I was going to refer the reader (and the anonymous reviewer, who is probably not reading this post anyway) to the supplementary materials. Fortunately I checked first, and found that the supplementary materials include a bunch of OSF stuff like the letter submitting the article for publication, and the reviewers’ comments and so on — but not the full set of simulations. Oops.

All the code and the full set of simulation results is posted at

https://www.utstat.utoronto.ca/brunner/zcurve2018

You can download all the material in a single file at

https://www.utstat.utoronto.ca/brunner/zcurve2018.zip

After expanding, just open index.html in a browser.

Actually we did a lot more simulation studies than this, but you have to draw the line somewhere. The point is that z-curve performs well for large numbers of studies with chi-squared test statistics as well as F statistics — all with varying degrees of freedom.

(14) “The simulation study was conducted for the performance of the z-curve on constrained scenarios including F-tests with df = 1 and not for the combination of t-tests and chi-square tests as applied in the current study. I’m not sure what to make of the z-curve performance for the data used in the current paper because the simulation study does not provide evidence of its performance under these unexplored conditions.”

Now the reviewer is talking about the paper that was actually under review. The mistake is natural, because of our (my) error in not making sure that the full set of simulations was included in the supplementary materials. The conditions in question are not unexplored; they are thoroughly explored, and the accuracy of z-curve for large samples is confirmed.

(15+) There are some more comments by the reviewer, but these are strictly about the paper under review, and not about Brunner and Schimmack (2020). So, I will leave any further response to others.

Dear Anonymous Reviewer…

Peer-review is the foundation of science. Peer-reviewers work hard to evaluate manuscript to see whether they are worthy of being published, especially in old-fashioned journals with strict page limitations. Their hard work often goes unnoticed because peer-reviews remain unpublished. This is a shame. A few journals have recognized that science might benefit from publishing reviews. Not all reviews are worthy of publication, but when a reviewer spends hours, if not days, to write a long and detailed comment, it seems only fair to share the fruits of their labor in public. Unfortunately, I am not able to give credit to Reviewer 1 who was too modest or shy to share their name. This does not undermine the value they created and I hope the reviewer may find the courage to take credit for their work.

Reviewer 1 was asked to review a paper that used z-curve to evaluate the credibility of research published in the leading emotion journals. Yet, going beyond the assigned task, Reviewer 1 gave a detailed and thorough review of the z-curve method that showed the deep flaws of this statistical method that had been missed by reviewers of articles that promoted this dangerous and misleading tool. After a theoretical deep-dive into the ontology of z-curve, Reviewer 1 points out that simulation studies seem to have validated the method. Yet, Reviewer 1 was quick to notice that the simulations were a shame and designed to show that z-curve works rather than to see it fail in applications to more realistic data. Deeply embarrassed, my co-thors, including a Professor of Statistics, are now contacting journals to retract our flawed articles.

Please find the damaging review of z-curve below.

P.S. We are also offering a $200 reward for credible simulation studies that demonstrate that z-curve is crap.

P.P.S Some readers seem to have missed the sarcasm and taken the criticism by Reviewer 1 seriously. The problem is lack of expertise to evaluate the conflicting claims. To make it easy I share an independent paper that validated z-curve with actual replication outcomes. Not sure how Reviewer 1 would explain the positive outcome. Maybe we hacked the replication studies, too?

Röseler, Lukas, 2023. “Predicting Replication Rates with Z-Curve: A Brief Exploratory Validation Study Using the Replication Database,” MetaArXiv ewb2t, Center for Open Science.

ANNONYMOUS, July 17, 2024

Referee: 1

Comments to the Author
The manuscript “Credibility of results in emotion science: A z-curve analysis of results in the journals Cognition & Emotion and Emotion” (CEM-DA.24) presents results from a z-curve analysis on reported statistics (t-tests, F-tests, and chi-square tests with df < 6 and 95% confidence intervals) for empirical studies (excluding meta-analysis) published in Cognition & Emotion from 1987 to 2023 and Emotion from 2001 to 2023. The purposes of reporting results from a z-curve analysis are to (a) estimate selection bias in emotion research and (b) predict a success rate in replication studies.

I have strong reservations about the conclusions drawn by the authors that do not seem to be strongly supported by their reported results. Specifically, I am not confident that conclusions from z-curve results justify the statements made in the paper under review. Below, I outline the main concerns that center on the z-curve methodology that unfortunately focuses on providing a review on Brunner and Schimmack (2020) and not so much on the current paper.

VAGUE METHODOLOGY. The authors make strong claims about what conclusions can be drawn from z-curve analyses. Their presentation of z-curve analysis in the present paper is declarative and does not provide the necessary information to describe the assumptions made by the method, how it works, when it fails, etc. The authors cite previous publications on z-curve (Brunner & Schimmack, 2020; Bartos & Schimmack, 2022; Schimmack & Bartos, 2023). Furthermore, this work ignores recent criticism in the literature about such statistical forensics. One example questioning the validity of conclusions by tests of credibility/replicability (e.g., p-curve, Francis’s [2013] consistency test) is in a talk by Richard Morey titled “Statistical games: Flawed thinking of popular methods for assessing reproducibility” (https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3Dc0G98qp1cf4&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245569384%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=n12hBiuHLvvL7uvpt5cj0qaYKrmze39HggtrSPVYIZ0%3D&reserved=0). The talk was based on Morey (2013). Other authors who have written on this topic include McShane, Böckenholt, and Hansen (2020) and Pek, Hoisington-Shaw, & Wegener (2022).

==
Morey, R. D. (2013). The consistency test does not–and cannot–deliver what is advertised: A comment on Francis (2013). Journal of Mathematical Psychology, 57(5), 180-183. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1016%2Fj.jmp.2013.03.004&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245573351%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=giZR7Etdc9n8qTUvXMCMnFeh95GeGO5KRCCoG0P2bHY%3D&reserved=0

McShane, B. B., Böckenholt, U., & Hansen, K. T. (2020). Average Power: A Cautionary Note. Advances in Methods and Practices in Psychological Science, 3(2), 185–199. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1177%2F2515245920902370&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245577184%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=qQtgjxmUam%2ByfFCjknA84sQnecQTk8qm7MObb7b%2BO3E%3D&reserved=0

Francis, G. (2013). Replication, statistical consistency, and publication bias. Journal of Mathematical Psychology, 57(5), 153-169. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1016%2Fj.jmp.2013.02.003&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245580995%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=%2Fkd82Q%2BgOfm1yZECD%2FRbah2uAdZROtPlyKfb4kmFmS4%3D&reserved=0

Pek, J., Hoisington-Shaw, K. J., & Wegener, D. T. (2022). Avoiding Questionable Research Practices Surrounding Statistical Power Analysis. In W. O’Donohue, A. Masuda, & S. Lilienfeld (Eds.), Avoiding Questionable Practices in Applied Psychology (pp. 243–267). Springer. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1007%2F978-3-031-04968-2_11&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245584836%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=DRLox%2Bmn2ztlp6Y4hagpuZKCyCsUFOF1xEXZP779gvk%3D&reserved=0
==

In reading Brunner and Schimmack (2020), z-curve analysis is based on the concept of using an average power estimate of completed studies (i.e., post hoc power analysis). However, statisticians and methodologists have written about the problem of post hoc power analysis (whether it be for a single study or for a set of studies; see Pek, Hoisington-Shaw, & Wegener, in press for a treatment of this misconception).

It should be noted that power is *not* a property of a (completed) study (fixed data). Power is a performance measure of a procedure (statistical test) applied to an infinite number of studies (random data) represented by a sampling distribution. Thus, what one estimates from completed study is not really “power” that has the properties of a frequentist probability even though the same formula is used. Average power does not solve this ontological problem (i.e., misunderstanding what frequentist probability is; see also McShane et al., 2020). Power should *always* be about a design for future studies, because power is the probability of the performance of a test (rejecting the null hypothesis) over repeated samples for some specified sample size, effect size, and Type I error rate (see also Greenland et al., 2016; O’Keefe, 2007). z-curve, however, makes use of this problematic concept of average power (for completed studies), which brings to question the validity of z-curve analysis results.

Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. European Journal of Epidemiology, 31(4), 337–350. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1007%2Fs10654-016-0149-3&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245588763%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=iKLnBvCg0BGd8l9x%2BZii7O%2BKapURRjoWn8rKZpTxHDw%3D&reserved=0

O’Keefe, D. J. (2007). Post hoc power, observed power, a priori power, retrospective power, prospective power, achieved power: Sorting out appropriate uses of statistical power analyses. Communication Methods and Measures, 1(4), 291–299. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1080%2F19312450701641375&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245592749%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=q3TpAIdWZs%2BPxLhZ1YI2Yby0qSbj14%2Fxc6hcc4YZtA8%3D&reserved=0
==

In Brunner and Schimmack (2020), there is a problem with “Theorem 1 states that success rate and mean power are equivalent even if the set of coins is a subset of all coins.” Here, the coin flip with a binary outcome is a process to describe significant vs. nonsignificant p-values. Focusing on observed power, the problem is that using estimated effect sizes (from completed studies) have sampling variability and cannot be assumed to be equivalent to the population effect size. Methodological papers that deal with power analysis making use of estimated effect size show that the uncertainty due to sampling variability is extremely high (e.g., see Anderson et al., 2017; McShane & Böckenholt, 2016); it is worse when effects are random (cf. random effects meta-analysis; see McShane, Böckenholt, & Hansen, 2020; Pek, Pitt, & Wegener, 2024). Accepting that effects are random seems to be more consistent with what we observe in empirical results of the same topic. The extent of uncertainty in power estimates (based on observed effects) is so high that much cannot be concluded with such imprecise calculations.

Coming back to p-values, these statistics have their own distribution (that cannot be derived unless the effect size is null and the p-value follows a uniform distribution). However, because p-values have sampling variability (and an unknown sampling distribution), one cannot take a significant p-value to deterministically indicate a tally on power (which assumes that an unknown specific effect size is true). Stated differently, a significant p-value can be consistent with a Type I error. Now, if the counter argument taken is that z-curve does not require an effect size input to calculate power, then I’m not sure what z-curve calculates because a value of power is defined by sample size, effect size, Type I error rate, and the sampling distribution of the statistical procedure (as consistently presented in textbooks for data analysis).

There seems to be some conceptual slippage on the meaning of power here because what the authors call power does not seem to have the defining features of power.

The problem of Theorem 2 in Brunner and Schimmack (2020) is assuming some distribution of power (for all tests, effect sizes, and sample sizes). This is curious because the calculation of power is based on the sampling distribution of a specific test statistic centered about the unknown population effect size and whose variance is determined by sample size. Power is then a function of sample size, effect size, and the sampling distribution of the test statistic. There is no justification (or mathematical derivation) to show that power follows a uniform or beta distribution (e.g., see Figure 1 & 2 in Brunner and Schimmack, 2000, respectively). If the counter argument here is that we avoid these issues by transforming everything into a z-score, there is no justification that these z-scores will follow a z-distribution because the z-score is derived from a normal distribution – it is not the transformation of a p-value that will result in a z-distribution of z-scores. P-values are statistics and follow a sampling distribution; the variance of the sampling distribution is a function of sample size. So, it’s weird to assume that p-values transformed to z-scores might have the standard error of 1 according to the z-distribution. If the further argument is using a mixture of z-distributions to estimate the distribution of the z-scores, then these z-scores are not technically z-scores in that they are nor distributed following the z-distribution. We might estimate the standard error of the mixture of z-distributions to rescale the distribution again to a z-distribution… but to what end? Again, there is some conceptual slippage in what is meant by a z-score. If the distribution of p-values that have been transformed to a z-score is not a z-distribution and then the mixture distribution is then shaped back into a z-distribution (with truncations that seem arbitrary) so that the critical value of 1.96 can be used – I’m not sure what the resulting distribution is of, anymore. A related point is that we do not yet know whether p-values are transformation invariant (in distribution) under a z-score transformation. Furthermore, the distribution for power invoked in Theorem 1 is not a function of sample size, effect size, or statistical procedure, suggesting that the assumed distribution does not align well with the features that we know influence power. It is unclear how Theorem 2 is related to the z-curve procedure. Again, there seems to be some conceptual slippage involved with p-values being transformed into z-scores that somehow give us an estimate of power (without stating the effect size, sample size, or procedure).

In the description of the z-curve analysis, it is unclear why z-curve is needed to calculate “average power.” If p < .05 is the criterion of significance, then according to Theorem 1, why not count up all the reported p-values and calculate the proportion in which the p-values are significant? After all, p-values can be transformed to z-scores and vice-versa in that they carry the same information. But then, there is a problem of p-values having sampling variability and might be consistent with Type I error. A transformation from p to z will not fix sampling variability.

To beat a dead horse, z-curve makes use of the concept of “power” for completed studies. To claim that power is a property of completed studies is an ontological error about the meaning of frequentist probability. A thought experiment might help. Suppose I completed a study, and the p-value is .50. I convert this p-value to a z-score for a two-tailed test and get 0.67. Let’s say I collect a bunch of studies and do this and get a distribution of z-scores (that don’t end up being distributed z). I do a bunch of things to make this distribution become a z-distribution. Then, I define power as the proportion of z-scores above the cutoff of 1.96. We are now calling power a collection of z-scores above 1.96 (without controlling for sample size, effect size, and procedure). This newly defined “power” based on the z-distribution does not reflect the original definition of power (area under the curve for a specific effect size, a specific procedure, and a specific sample size, assuming the Type I error is .05). This conceptual slippage is akin to burning a piece of wood, putting the ashes into a mold that looks like wood, and calling the molded ashes wood.

The authors make a statement that (observed) power is the probability of exact replication. However, there is a conceptual error embedded in this statement. While Greenwald et al. (1996, p. 1976) state “replicability can be computed as the power of an exact replication study, which can be approximated by [observed power],” they also explicitly emphasized that such a statement requires the assumption that the estimated effect size is the same as the unknown population effect size which they admit cannot be met in practice. Furthermore, recall that power is a property of a procedure and is not a property of completed data (cf. ontological error), thus using observed power to quantify replicability presents replicability as a property of a procedure and not about the robustness of an observed effect. Again, there seems to be some conceptual slippage occurring here on what is meant by replication versus what is quantifying replication (which should not be observed power).

The basis of supporting the z-curve procedure is a simulation study. This approach merely confirms what is assumed with simulation and does not allow for the procedure to be refuted in any way (cf. Popper’s idea of refutation being the basis of science.) In a simulation study, one assumes that the underlying process of generating p-values is correct (i.e., consistent with the z-curve procedure). However, one cannot evaluate whether the p-value generating process assumed in the simulation study matches that of empirical data. Stated a different way, models about phenomena are fallible and so we find evidence to refute and corroborate these models. The simulation in support of the z-curve does not put the z-curve to the test but uses a model consistent with the z-curve (absent of empirical data) to confirm the z-curve procedure (a tautological argument). This is akin to saying that model A gives us the best results, and based on simulated data on model A, we get the best results.

Further, the evidence that z-curve performs well is specific to the assumptions within the simulation study. If p-values were generated in a different way to reflect a competing tentative process, the performance of the z-curve would be different.  The simulation study was conducted for the performance of the z-curve on constrained scenarios including F-tests with df = 1 and not for the combination of t-tests and chi-square tests as applied in the current study. I’m not sure what to make of the z-curve performance for the data used in the current paper because the simulation study does not provide evidence of its performance under these unexplored conditions.

==
Anderson, S. F., Kelley, K., & Maxwell, S. E. (2017). Sample-size planning for more accurate statistical power: A method adjusting sample effect sizes for publication bias and uncertainty. Psychological Science, 28(11), 1547–1562. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1177%2F0956797617723724&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245596532%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=qg%2BHctfodgm9tHN4oiKkFSJgcIk5%2BSWGBvrWGKRalRQ%3D&reserved=0

Greenwald, A. G., Gonzalez, R., Harris, R. J., & Guthrie, D. (1996). Effect sizes and p values: What should be reported and what should be replicated? Psychophysiology, 33(2), 175–183. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1111%2Fj.1469-8986.1996.tb02121.x&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245600546%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=du%2BhnlOwN22%2FmOAgdoPqEVB3WQhXVYl%2FI0l5J6xTXhU%3D&reserved=0

McShane, B. B., & Böckenholt, U. (2016). Planning Sample Sizes When Effect Sizes Are Uncertain: The Power-Calibrated Effect Size Approach. Psychological Methods, 21(1), 47–60. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1037%2Fmet0000036&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245604346%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=Lx2kD5FruPdsja9v%2B5uqSrl%2BaiWma1o316z%2BXgSojIY%3D&reserved=0

Pek, J., Hoisington-Shaw, K. J., & Wegener, D. T. (in press). Uses of uncertain statistical power: Designing future studies, not evaluating completed studies . Psychological Methods. 
https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.researchgate.net%2Fpublication%2F368358276_Uses_of_uncertain_statistical_power_Designing_future_studies_not_evaluating_completed_studies&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245608080%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=4KepFlEFOqQyVrhSfXGHUmHToMZlzKt4AlR9sMtzif0%3D&reserved=0

Pek, J., Pitt, M. A., & Wegener, D. T. (2004). Uncertainty limits the use of power analysis. Journal of Experimental Psychology: General, 153(4), 1139–1151. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2Fhttps%3A%2F%2Fdoi.org%2F10.1037%2Fxge0001273&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245611962%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=S6EZxxThRGeruV6RF9%2FUuIXv4MmMKZWlfAgXkYUtoxI%3D&reserved=0
==

IINPUT DATA. The authors made use of statistics reported in empirical research published in Cognition & Emotion and Emotion. Often, articles might report several studies, and studies would have several models, and models would contain several tests. Thus, there might be a nested structure of tests nested within models, models nested within studies, and studies nested within articles. It does not seem that this nesting is taken into account to provide a good estimate of selection bias and the expectation replication rate. Thus, the estimates provided cannot be deemed unbiased (e.g., estimates would be biased toward articles that tend to report a lot of statistics compared to others).

As the authors admit, there is no separation of statistical tests used for manipulation checks, preliminary analyses, or tests of competing and alternative hypothesis. Given that the sampling of the statistics might not be representative of key findings in emotion research, little confidence can be placed in the accuracy of the estimates reported and the strong claims being made using them (about emotion research in general). 
Finally, the authors excluded chi-square tests with degrees of freedom larger than 6. This would mean that tests of independence with designs larger than a 2×2 contingency table would be excluded (or tests of independence with 6 categories). In general, the authors need to be careful on what conditions their conclusions apply to.

UNSUBSTANTIATED CONCLUSIONS. The key conclusions made by the authors are that there is selection bias in emotion research, and there is a success rate of 70% in replication studies. These conclusions are made from z-curve analysis, in which I question the validity of. My concerns of the z-curve procedure has to do with ontological errors made about the probability attached to the concept of power, the rationale for z-transformations on p-values (along with strange distributional gymnastics with little justification provided in the original paper), and equating power with replication.

Even if the z-curve is valid, the performance of z-curve should be better evaluated to show that they apply to the conditions of the data used in the current study. Furthermore, data quality used in z-curve analysis in terms of selection criteria (e.g., excluding tests for manipulation checks, etc.) and modeling the nested structure inherent in reported results would go a long way in ensuring that the estimate provided is as unbiased as can be.

Finally, it seems odd to conclude selection bias based on data with selection bias. There might be some tautology going on within the argument. An analogy about missing data might help. Given a set of data in which we assume had undergone selection (i.e., part of the distribution is missing), how can we know from the data what is missing? The only way to talk about the missing part of the distribution is to assume a distribution for the “full” data that subsumes the observed data distribution. But who can say that the assumed distribution is the correct one that would have generated the full data? Our selected data does not have the features to let us infer what the full distribution should be. How can we know what we observe has undergone selection bias without knowledge of the selection process (cf. distribution of the full data) unless some implicit assumption is made. We are not given the assumption and therefore cannot evaluate whether the assumption is valid. I cannot tell what assumptions z-curve makes about selection.