Pre-publication peer review is the biggest scam in the history of science. Researchers have to pass the evaluation by “peers,” before an editor (often a mindless vote-counter of opinions shared in reviews) decides whether the article deserves to be published. Arbitrary limits on the number of publications, give reviewers power to censor by writing extremely negative reviews and recommending rejection. This faulty system of quality control is maintained by publishers to resell publicly funded research to publicly funded universities with fat profit margins. In short, we would be better off to publish everything and let open post-publication reviews decide what is good or not. Until then, researchers have to battle with anonymous reviewers in a Kafkaesque closed and hidden process at the mercy of dark forces that decide their fate.
I have encountered this many times myself, but right now I am writing to help researchers who encounter Reviewer 2, when they use z-curve to examine the credibility of research in their field. Interestingly, I am never asked to review z-curve articles, although I am the expert on this method, but Reviewer 2 is asked to comment on the method and is allowed to make outrageously false claims. If this happens to you, please feel free to contact me to write a rebuttal or just point the editor to this blog post.
You can also ask AI to evaluate the quality of Reviewer 2’s arguments and my rebuttals. I did so myself and ChatGPT and Perplexity concluded that my counterarguments are scientifically accurate and valid, but don’t trust me. Trust is good, but fact checking is better.
Reviewer 2
1. The authors primarily conceptualize replicability in terms of retrospective (or post-hoc) average power (also known as the “expected discovery rate” / EDR).
Rebuttal:
This statement is false and shows limited knowledge by this reviewer about the method they criticize. Z-curve estimates two percentages: The percentage of significant results that is expected if the studies that were used in the meta-analysis were reproduced exactly and analyzed exactly the same way using the same sample size. This is the Expected REPLICATION rate (ERR). It also estimates the percentage of significant results that are expected in this new replication set of studies. This estimate includes all of the non-significant results that will also be produced, but may or may not have been reported. This is called the expected discovery rate (discovery = p < .05). The reviewer confuses ERR and EDR.0
2. Average power is a meta-analytic analogue of single study post hoc power. Single study post hoc power has been greatly lampooned for many decades now (Hoenig & Heisey, 2001; Yuan & Maxwell, 2005). For example, Greenland (2012) writes that post hoc power computed from completed studies is: “Irrelevan[t]: Power refers only to future studies done on populations that look exactly like our sample with respect to the estimates from the sample used in the power calculation; for a study as completed (observed), it is analogous to giving odds on a horse race after seeing the outcome.” In addition, average power is not relevant to the replicability of actual prospective replication studies. As McShane, Bockenholt, and Hansen (2020) write: “Average power is relevant to replicability if and only if replication is defined in terms of statistical significance within the classical frequentist repeated sampling framework. As this framework is both purely hypothetical and ontologically impossible, average power is not relevant to the replicability of actual prospective replication studies.”
Rebuttal:
All of these comments are irrelevant and rest on confusion about the term power. The classic definition of power defines power as a probability of obtaining a significant result given a hypothetical alternative hypothesis. This definition of power is irrelevant in studies that estimate the ERR and EDR that are influenced by the true population effect sizes of studies (and sampling error), not some hypothetical values that are no longer relevant when actual data are available.
The criticism of post-hoc power is also relevant because it is about the interpretation of results in a single study, not about meta-analysis of many studies.
Finally, McShane et al.’s article makes two mistakes. It uses the term power for empirical estimates, when power is defined in terms of hypothetical values. Second, the article relied on sets of 30 studies to claim that estimates are imprecise, but precision increases with the set of studies. This article had over 100 studies and the precision of the estimates is clearly specified with 95% confidence interval. Thus, the uncertainty of the results can and should be evaluated with the actual results and not based on an article that did not examine z-curve estimates.
3. Pek et al (2022) also note ontological concerns with average power. Pek et al (2024) further note that (as per the present authors’ approach) “using power for evaluating completed studies can be counterproductive.”
Rebuttal
Pek et al.’s criticism is about studies that compute post-hoc power based on the definition of power as a hypothetical construct. This criticism does not apply to z-curve estimates that estimate expected values based on true population effect sizes and not statistical power as defined by Pek et al. Pek et al. also did not discuss z-curve as a method to estimate expected discovery rates or expected replication rates. So, the article is irrelevant to the evaluation of z-curve.
4. While I have thus far focused on the primary manner in which the authors conceptualize replicability (i.e., average power / EDR), exactly the same concerns apply to the secondary manner (i.e., the “expected replication rate” / ERR).
Rebuttal
The same rebuttal holds for the ERR. It is not an estimate of average power as defined by Pek. Because it estimates the true probability of significant results in exact replication studies, whereas Pek et al. define power as a hypothetical construct. Estimating the ERR is not wrong, calling it power is. The terms EDR and ERR therefore make it clear that these estimates are not estimates of average power, in the classic sense of statistical power. So, this criticism does not address z-curve estimates and their validity.
5. Rosenthal was a pioneer studying replication in psychology. Drawing on his work dating from the 1960s, Rosenthal (1990) dismissed evaluations of replicability that are dichotomous and based on significance testing as “the traditional, not very useful view of replication” and advocated evaluations of replicability that are continuous and based on effect sizes as “the newer, more useful view of replication. The authors’ approach in this paper is dichotomous and based on significance testing and thus falls squarely in what Rosenthal thirty-five years ago today already termed “the traditional, not very useful view of replication.”
Rebuttal
Rosenthal made contributions to effect size meta-analysis. They are useful and important when researchers want to combine results from several close or direct replications to estimate the population effect size. The main in this article is different. Science-wide estimates of EDR and ERR can provide useful information for the interpretation of individual studies that lack multiple replications and can be meta-analyzed. Moreover, it can provide information about the typical amount of publication bias in a literature and provide information for the planning of future studies. In short, effect-size meta-analysis is important. So, is knowing the amount of publication bias, replicability, and the false positive risk in a field of studies. Effect size meta-analyses do not provide this information.
Rosenthal also was responsible for a faulty way to assess publication bias in meta-analysis (fail-safe N) that suggested publication bias is not a big problem in meta-analysis. Z-curve, however, can estimate the actual amount of publication bias in a literature and has shown massive publication bias and a high false positive risk in literatures with hundreds of studies. For example, z-curve showed that Noble Laureate had picked priming studies for his bestseller “Thinking: Fast and Slow” that had a false positive risk of 100%. He openly distanced himself from the researchers who had published results and were unwilling to back up their claims with actual replication studies. In this example, the average effect size of these different studies was not important. What mattered was that the studies failed to provide credible evidence that social priming works.
6. “It is therefore not surprising that a common finding among replication projects is that unbiased replication studies with larger sample sizes produce much smaller effect sizes. For instance, the ### replication project found that 88% of the replication effect sizes were severely inflated in comparison to the original effect sizes, with a median percentage decrease of 75%. As can be seen, the ### replication project takes a continuous quantitative view based on effect sizes, reporting that the median decrease in the effect size estimates was 75% and going on to characterize the full distribution of effect size differentials in Figures 1 and 2 of that paper. I do not find the present authors’ retrospective and dichotomous approach based on significance testing to be an advance over the ### replication project’s prospective and continuous approach based on effect sizes. Indeed, I view it as retrograde.
Rebuttal
Reviewer 2 does not stop for one second to explain why effect sizes estimates shrank by about 80%. The z-curve analysis shows why; the orginal studies reported inflated effect sizes estimates because studies with large sampling error require large estimates to get significant results. The actual replication results cannot show that this is the reason, but the z-curve analysis of original studies can because it estimates how replicable these studies are in the hypothetical scenario that they are exactly replicated with a new sample. The argument also ignores that effect size estimates are rarely used to interpret results. Most of the time, the key claim is a rejection of the null-hypothesis in a specific direction. This conclusion is not altered by lower effect sizes, but it is altered when the result is no longer significant. The original conclusion no longer holds.
7. Even for those who prefer a dichotomous approach based on significance testing, when such is applied to the sports science replication project, we get a result similar to the present authors’ result (see middle of page 12 of their manuscript). Therefore, in a very important sense, the present authors’ result is already known (or at least cannot be said to be novel).
This comment by the reviewer shows once more their lack of understanding of science and a lack of awareness of the methodological discussion about the importance of replication studies that by definition lack novelty. Ironically, they applaud a replication project and then criticize a replication study for being unoriginal. The proper comparison of the actual replications and the z-curve analysis is this: Both projects used different methods on different sets of studies and produced consistent results. The novel finding is that in this literature both estimates converge on the same conclusion. When two different methods with different data show consistent results, it provides evidence that the results are not driven by sampling error (e.g., actual replication studies picked studies with easy and cheap designs) or methodological biases (e.g., replication studies produce weaker effects because the replication researchers are not experts in that field). In short, consistent results provide valuable information. Novelty is important for original studies, not for meta-analyses that assess how many of the novel original findings are actually findings and how many may be false positive ones.
8. The authors’ use the forensic Z-curve meta-analytic procedure of Brunner & Schimmack (2020) and Bartos & Schimmack (2022). On page 3 of their manuscript, they note that they could use the forensic P-curve meta-analytic procedure of Simonsohn, Nelson, and Simmons instead. In a forthcoming Journal of the American Statistical Association paper, Morey and Davis-Stober provide a formal analysis that proves that the P-curve has poor statistical properties. For example, they prove that the P-curve produces inconsistent estimates of average power / EDR. One might question the relevance of this to the Z-curve and thus the present manuscript. I quote the final paragraph of Morey and Davis-Stober:
“As a final point, we suggest that meta-scientists be more skeptical of procedures like the P-curve in the meta-scientific literature. Papers introducing them are often light on statistical exposition, using metaphors [and] a few simulations to make sweeping arguments. Simulation is a powerful tool and can help build intuition, but it is not a substitute for formal analysis. Simulation may provide hints of problems with a procedure, but only if the simulator’s formal knowledge helps guide the choice of simulations. A simulator might quit after running a few simulations that tell them what they think is true while problems remain uncovered. Given the implications of poor forensic procedures for science, all such procedures demand deeper formal scrutiny.”
This forthcoming paper is extremely relevant to the present manuscript because the very paragraph above could be written about the Z-curve.
Rebuttal
In a legal trial, this witness would be held in contempt. They are simply lying. Brunner and Schimmack (2020) directly compared p-curve and z-curve and showed that p-curve fails when data are heterogeneous as they typically are and as they are in this article (heterogeneity: ERR > EDR, homogeneity: ERR = EDR). Schimmack and Brunner have also written several subsequent criticisms of p-curve. Morey and Davis-Stober’s article adds to this criticism and the p-curve authors are not defending their method against these criticisms. So, yes, p-curve was an attempt to estimate the true power of a set of studies, but it failed.
It is ridiculous to imply that we can just use any criticism of p-curve and apply it to a fundamentally different method. P-cure was not evaluated by simulation studies. Z-curve has been evaluated with hundreds of simulation studies and performs well with typical data sets, including data like the one in this article. The convergence between results from the actual replication project and z-curve predictions that the Reviewer used to claim “lack of novelty” is also relevant here. If z-curve were flawed, why does it produce estimates that are validated with actual replication outcomes?
9. Turning back to this manuscript and its use of the Z-curve, in short, we at present know next to nothing about the statistical properties of the Z-curve (just as we knew next to nothing about the statistical properties of the P-curve until Morey and Davis-Stober came along). The statistical properties of the Z-curve may be as poor or worse than those of the P-curve. Or they may be solid. We simply cannot say. Morey and Davis-Stober write: “Given the stated purpose of the P-curve—evaluating the trustworthiness of scientific literatures—the stakes are too high to use tests with such poor, or poorly-understood, properties.” The same applies to the Z-curve which has the same stated purpose. As a consequence, I remain very skeptical of any use of the Z-curve until its properties have been investigated formally and shown not to be wanting—especially given the very high stakes involved.
Rebuttal
I had an email discussion with Davis-Stober and he was not aware of z-curve and does not know anything about z-curve. He simply does not think it is useful to estimate publication bias, but that is his personal opinion, and not a criticism of a method that estimates it.
10. You refer to these four quantities as “parameters” but they are not parameters. The word parameter has a formal definition within the context of a statistical model and these do not qualify. These are outputs or estimands but not parameters.
No Rebuttal
That is correct. EDR and ERR are estimates of population parameters not parameters themselves. Estimands is a new fancy word that few psychologists use. The word estimates is good enough. ODR, EDR ERR and FDR are estimates of population parameters. Correcting this mistake does not change anything substantial about the results.
11 You assert (arguably rather blithely) that the Z-curve’s independence assumption is met in your analysis because only one p-value per study is included in the analysis. This is of course not necessarily true. If, for example, the 269 studies share authors or sets of authors, that could induce dependence. There are of course many additional sources of possible dependence. One simply cannot say.
Rebuttal
This is simply false. The independence assumption is about the sampling error of studies, and each new sample has a new sampling error. If all studies used z-tests and had the same effect size and sample size, we expect an average sampling error of 1. When studies are heterogeneous, there is additional variation due to real differences in the non-centrality parameters (the location of the normal distribution on the x-axis of z-values) that describes the sampling distribution, but that is irrelevant for z-curve because it makes no assumptions about that distribution. Some studies from one author may be close to z = 0 and those of other authors may be close to 3. That is heterogeneity, not dependence in sampling errors. Dependence of sampling errors only occurs for some analysis based on the same dataset (e.g., correlated dependent variables).
12. The authors discuss many subjective choices or value judgments as if they were objective. An example that recurs throughout the manuscript is the discussion and use of alpha = 0.05 and power = 0.80. As is well known, any choice of alpha and power reflects a particular tradeoff between the relative costs of Type I errors versus Type II errors. Except in very narrow circumstance where these relative costs can be objectively quantified (e.g. industrial quality control), these relative costs reflect a particular subjective utility (or loss) function. This subjective function will in turn vary by context or even by different people working within the same context (Neyman, 1977). This is why some have made calls for researchers to “justify their alpha” and power in light of their subjective preferences and idiosyncratic research contexts (see, for example, Lakens et al, 2018). It would be helpful if the authors discussed a range of possible (alpha, power) pairs. Alternatively, if they believe (alpha = 0.05, power = 0.80) are objectively justified in their setting, please state that and argue in favor of it. This comment applies more broadly to other quantities that the authors tend to suggest are objective (e.g., the percentage of studies with “statistically significant” results, the replication rate, etc.): either recognize the subjectivity involved or justify the values of these quantities that you believe are objectively optimal.
Rebuttal
I do not see how this is related to z-curve, but to blame the authors of this article for the mindless use of alpha = .05 that has been in place since Fisher published his first book with tables that allowed researchers to claim significance at that level is just another strange and unhinged comment by this Reviewer that revealed nothing but willful ignorance; except for the comment about parameters.