Peer-review is the foundation of science. Peer-reviewers work hard to evaluate manuscript to see whether they are worthy of being published, especially in old-fashioned journals with strict page limitations. Their hard work often goes unnoticed because peer-reviews remain unpublished. This is a shame. A few journals have recognized that science might benefit from publishing reviews. Not all reviews are worthy of publication, but when a reviewer spends hours, if not days, to write a long and detailed comment, it seems only fair to share the fruits of their labor in public. Unfortunately, I am not able to give credit to Reviewer 1 who was too modest or shy to share their name. This does not undermine the value they created and I hope the reviewer may find the courage to take credit for their work.
Reviewer 1 was asked to review a paper that used z-curve to evaluate the credibility of research published in the leading emotion journals. Yet, going beyond the assigned task, Reviewer 1 gave a detailed and thorough review of the z-curve method that showed the deep flaws of this statistical method that had been missed by reviewers of articles that promoted this dangerous and misleading tool. After a theoretical deep-dive into the ontology of z-curve, Reviewer 1 points out that simulation studies seem to have validated the method. Yet, Reviewer 1 was quick to notice that the simulations were a shame and designed to show that z-curve works rather than to see it fail in applications to more realistic data. Deeply embarrassed, my co-thors, including a Professor of Statistics, are now contacting journals to retract our flawed articles.
Please find the damaging review of z-curve below.
P.S. We are also offering a $200 reward for credible simulation studies that demonstrate that z-curve is crap.
P.P.S Some readers seem to have missed the sarcasm and taken the criticism by Reviewer 1 seriously. The problem is lack of expertise to evaluate the conflicting claims. To make it easy I share an independent paper that validated z-curve with actual replication outcomes. Not sure how Reviewer 1 would explain the positive outcome. Maybe we hacked the replication studies, too?
Röseler, Lukas, 2023. “Predicting Replication Rates with Z-Curve: A Brief Exploratory Validation Study Using the Replication Database,” MetaArXiv ewb2t, Center for Open Science.
ANNONYMOUS, July 17, 2024
Referee: 1
Comments to the Author
The manuscript “Credibility of results in emotion science: A z-curve analysis of results in the journals Cognition & Emotion and Emotion” (CEM-DA.24) presents results from a z-curve analysis on reported statistics (t-tests, F-tests, and chi-square tests with df < 6 and 95% confidence intervals) for empirical studies (excluding meta-analysis) published in Cognition & Emotion from 1987 to 2023 and Emotion from 2001 to 2023. The purposes of reporting results from a z-curve analysis are to (a) estimate selection bias in emotion research and (b) predict a success rate in replication studies.
I have strong reservations about the conclusions drawn by the authors that do not seem to be strongly supported by their reported results. Specifically, I am not confident that conclusions from z-curve results justify the statements made in the paper under review. Below, I outline the main concerns that center on the z-curve methodology that unfortunately focuses on providing a review on Brunner and Schimmack (2020) and not so much on the current paper.
VAGUE METHODOLOGY. The authors make strong claims about what conclusions can be drawn from z-curve analyses. Their presentation of z-curve analysis in the present paper is declarative and does not provide the necessary information to describe the assumptions made by the method, how it works, when it fails, etc. The authors cite previous publications on z-curve (Brunner & Schimmack, 2020; Bartos & Schimmack, 2022; Schimmack & Bartos, 2023). Furthermore, this work ignores recent criticism in the literature about such statistical forensics. One example questioning the validity of conclusions by tests of credibility/replicability (e.g., p-curve, Francis’s [2013] consistency test) is in a talk by Richard Morey titled “Statistical games: Flawed thinking of popular methods for assessing reproducibility” (https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3Dc0G98qp1cf4&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245569384%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=n12hBiuHLvvL7uvpt5cj0qaYKrmze39HggtrSPVYIZ0%3D&reserved=0). The talk was based on Morey (2013). Other authors who have written on this topic include McShane, Böckenholt, and Hansen (2020) and Pek, Hoisington-Shaw, & Wegener (2022).
==
Morey, R. D. (2013). The consistency test does not–and cannot–deliver what is advertised: A comment on Francis (2013). Journal of Mathematical Psychology, 57(5), 180-183. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1016%2Fj.jmp.2013.03.004&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245573351%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=giZR7Etdc9n8qTUvXMCMnFeh95GeGO5KRCCoG0P2bHY%3D&reserved=0
McShane, B. B., Böckenholt, U., & Hansen, K. T. (2020). Average Power: A Cautionary Note. Advances in Methods and Practices in Psychological Science, 3(2), 185–199. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1177%2F2515245920902370&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245577184%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=qQtgjxmUam%2ByfFCjknA84sQnecQTk8qm7MObb7b%2BO3E%3D&reserved=0
Francis, G. (2013). Replication, statistical consistency, and publication bias. Journal of Mathematical Psychology, 57(5), 153-169. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1016%2Fj.jmp.2013.02.003&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245580995%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=%2Fkd82Q%2BgOfm1yZECD%2FRbah2uAdZROtPlyKfb4kmFmS4%3D&reserved=0
Pek, J., Hoisington-Shaw, K. J., & Wegener, D. T. (2022). Avoiding Questionable Research Practices Surrounding Statistical Power Analysis. In W. O’Donohue, A. Masuda, & S. Lilienfeld (Eds.), Avoiding Questionable Practices in Applied Psychology (pp. 243–267). Springer. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1007%2F978-3-031-04968-2_11&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245584836%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=DRLox%2Bmn2ztlp6Y4hagpuZKCyCsUFOF1xEXZP779gvk%3D&reserved=0
==
In reading Brunner and Schimmack (2020), z-curve analysis is based on the concept of using an average power estimate of completed studies (i.e., post hoc power analysis). However, statisticians and methodologists have written about the problem of post hoc power analysis (whether it be for a single study or for a set of studies; see Pek, Hoisington-Shaw, & Wegener, in press for a treatment of this misconception).
It should be noted that power is *not* a property of a (completed) study (fixed data). Power is a performance measure of a procedure (statistical test) applied to an infinite number of studies (random data) represented by a sampling distribution. Thus, what one estimates from completed study is not really “power” that has the properties of a frequentist probability even though the same formula is used. Average power does not solve this ontological problem (i.e., misunderstanding what frequentist probability is; see also McShane et al., 2020). Power should *always* be about a design for future studies, because power is the probability of the performance of a test (rejecting the null hypothesis) over repeated samples for some specified sample size, effect size, and Type I error rate (see also Greenland et al., 2016; O’Keefe, 2007). z-curve, however, makes use of this problematic concept of average power (for completed studies), which brings to question the validity of z-curve analysis results.
Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. European Journal of Epidemiology, 31(4), 337–350. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1007%2Fs10654-016-0149-3&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245588763%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=iKLnBvCg0BGd8l9x%2BZii7O%2BKapURRjoWn8rKZpTxHDw%3D&reserved=0
O’Keefe, D. J. (2007). Post hoc power, observed power, a priori power, retrospective power, prospective power, achieved power: Sorting out appropriate uses of statistical power analyses. Communication Methods and Measures, 1(4), 291–299. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1080%2F19312450701641375&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245592749%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=q3TpAIdWZs%2BPxLhZ1YI2Yby0qSbj14%2Fxc6hcc4YZtA8%3D&reserved=0
==
In Brunner and Schimmack (2020), there is a problem with “Theorem 1 states that success rate and mean power are equivalent even if the set of coins is a subset of all coins.” Here, the coin flip with a binary outcome is a process to describe significant vs. nonsignificant p-values. Focusing on observed power, the problem is that using estimated effect sizes (from completed studies) have sampling variability and cannot be assumed to be equivalent to the population effect size. Methodological papers that deal with power analysis making use of estimated effect size show that the uncertainty due to sampling variability is extremely high (e.g., see Anderson et al., 2017; McShane & Böckenholt, 2016); it is worse when effects are random (cf. random effects meta-analysis; see McShane, Böckenholt, & Hansen, 2020; Pek, Pitt, & Wegener, 2024). Accepting that effects are random seems to be more consistent with what we observe in empirical results of the same topic. The extent of uncertainty in power estimates (based on observed effects) is so high that much cannot be concluded with such imprecise calculations.
Coming back to p-values, these statistics have their own distribution (that cannot be derived unless the effect size is null and the p-value follows a uniform distribution). However, because p-values have sampling variability (and an unknown sampling distribution), one cannot take a significant p-value to deterministically indicate a tally on power (which assumes that an unknown specific effect size is true). Stated differently, a significant p-value can be consistent with a Type I error. Now, if the counter argument taken is that z-curve does not require an effect size input to calculate power, then I’m not sure what z-curve calculates because a value of power is defined by sample size, effect size, Type I error rate, and the sampling distribution of the statistical procedure (as consistently presented in textbooks for data analysis).
There seems to be some conceptual slippage on the meaning of power here because what the authors call power does not seem to have the defining features of power.
The problem of Theorem 2 in Brunner and Schimmack (2020) is assuming some distribution of power (for all tests, effect sizes, and sample sizes). This is curious because the calculation of power is based on the sampling distribution of a specific test statistic centered about the unknown population effect size and whose variance is determined by sample size. Power is then a function of sample size, effect size, and the sampling distribution of the test statistic. There is no justification (or mathematical derivation) to show that power follows a uniform or beta distribution (e.g., see Figure 1 & 2 in Brunner and Schimmack, 2000, respectively). If the counter argument here is that we avoid these issues by transforming everything into a z-score, there is no justification that these z-scores will follow a z-distribution because the z-score is derived from a normal distribution – it is not the transformation of a p-value that will result in a z-distribution of z-scores. P-values are statistics and follow a sampling distribution; the variance of the sampling distribution is a function of sample size. So, it’s weird to assume that p-values transformed to z-scores might have the standard error of 1 according to the z-distribution. If the further argument is using a mixture of z-distributions to estimate the distribution of the z-scores, then these z-scores are not technically z-scores in that they are nor distributed following the z-distribution. We might estimate the standard error of the mixture of z-distributions to rescale the distribution again to a z-distribution… but to what end? Again, there is some conceptual slippage in what is meant by a z-score. If the distribution of p-values that have been transformed to a z-score is not a z-distribution and then the mixture distribution is then shaped back into a z-distribution (with truncations that seem arbitrary) so that the critical value of 1.96 can be used – I’m not sure what the resulting distribution is of, anymore. A related point is that we do not yet know whether p-values are transformation invariant (in distribution) under a z-score transformation. Furthermore, the distribution for power invoked in Theorem 1 is not a function of sample size, effect size, or statistical procedure, suggesting that the assumed distribution does not align well with the features that we know influence power. It is unclear how Theorem 2 is related to the z-curve procedure. Again, there seems to be some conceptual slippage involved with p-values being transformed into z-scores that somehow give us an estimate of power (without stating the effect size, sample size, or procedure).
In the description of the z-curve analysis, it is unclear why z-curve is needed to calculate “average power.” If p < .05 is the criterion of significance, then according to Theorem 1, why not count up all the reported p-values and calculate the proportion in which the p-values are significant? After all, p-values can be transformed to z-scores and vice-versa in that they carry the same information. But then, there is a problem of p-values having sampling variability and might be consistent with Type I error. A transformation from p to z will not fix sampling variability.
To beat a dead horse, z-curve makes use of the concept of “power” for completed studies. To claim that power is a property of completed studies is an ontological error about the meaning of frequentist probability. A thought experiment might help. Suppose I completed a study, and the p-value is .50. I convert this p-value to a z-score for a two-tailed test and get 0.67. Let’s say I collect a bunch of studies and do this and get a distribution of z-scores (that don’t end up being distributed z). I do a bunch of things to make this distribution become a z-distribution. Then, I define power as the proportion of z-scores above the cutoff of 1.96. We are now calling power a collection of z-scores above 1.96 (without controlling for sample size, effect size, and procedure). This newly defined “power” based on the z-distribution does not reflect the original definition of power (area under the curve for a specific effect size, a specific procedure, and a specific sample size, assuming the Type I error is .05). This conceptual slippage is akin to burning a piece of wood, putting the ashes into a mold that looks like wood, and calling the molded ashes wood.
The authors make a statement that (observed) power is the probability of exact replication. However, there is a conceptual error embedded in this statement. While Greenwald et al. (1996, p. 1976) state “replicability can be computed as the power of an exact replication study, which can be approximated by [observed power],” they also explicitly emphasized that such a statement requires the assumption that the estimated effect size is the same as the unknown population effect size which they admit cannot be met in practice. Furthermore, recall that power is a property of a procedure and is not a property of completed data (cf. ontological error), thus using observed power to quantify replicability presents replicability as a property of a procedure and not about the robustness of an observed effect. Again, there seems to be some conceptual slippage occurring here on what is meant by replication versus what is quantifying replication (which should not be observed power).
The basis of supporting the z-curve procedure is a simulation study. This approach merely confirms what is assumed with simulation and does not allow for the procedure to be refuted in any way (cf. Popper’s idea of refutation being the basis of science.) In a simulation study, one assumes that the underlying process of generating p-values is correct (i.e., consistent with the z-curve procedure). However, one cannot evaluate whether the p-value generating process assumed in the simulation study matches that of empirical data. Stated a different way, models about phenomena are fallible and so we find evidence to refute and corroborate these models. The simulation in support of the z-curve does not put the z-curve to the test but uses a model consistent with the z-curve (absent of empirical data) to confirm the z-curve procedure (a tautological argument). This is akin to saying that model A gives us the best results, and based on simulated data on model A, we get the best results.
Further, the evidence that z-curve performs well is specific to the assumptions within the simulation study. If p-values were generated in a different way to reflect a competing tentative process, the performance of the z-curve would be different. The simulation study was conducted for the performance of the z-curve on constrained scenarios including F-tests with df = 1 and not for the combination of t-tests and chi-square tests as applied in the current study. I’m not sure what to make of the z-curve performance for the data used in the current paper because the simulation study does not provide evidence of its performance under these unexplored conditions.
==
Anderson, S. F., Kelley, K., & Maxwell, S. E. (2017). Sample-size planning for more accurate statistical power: A method adjusting sample effect sizes for publication bias and uncertainty. Psychological Science, 28(11), 1547–1562. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1177%2F0956797617723724&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245596532%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=qg%2BHctfodgm9tHN4oiKkFSJgcIk5%2BSWGBvrWGKRalRQ%3D&reserved=0
Greenwald, A. G., Gonzalez, R., Harris, R. J., & Guthrie, D. (1996). Effect sizes and p values: What should be reported and what should be replicated? Psychophysiology, 33(2), 175–183. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1111%2Fj.1469-8986.1996.tb02121.x&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245600546%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=du%2BhnlOwN22%2FmOAgdoPqEVB3WQhXVYl%2FI0l5J6xTXhU%3D&reserved=0
McShane, B. B., & Böckenholt, U. (2016). Planning Sample Sizes When Effect Sizes Are Uncertain: The Power-Calibrated Effect Size Approach. Psychological Methods, 21(1), 47–60. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1037%2Fmet0000036&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245604346%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=Lx2kD5FruPdsja9v%2B5uqSrl%2BaiWma1o316z%2BXgSojIY%3D&reserved=0
Pek, J., Hoisington-Shaw, K. J., & Wegener, D. T. (in press). Uses of uncertain statistical power: Designing future studies, not evaluating completed studies . Psychological Methods.
https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.researchgate.net%2Fpublication%2F368358276_Uses_of_uncertain_statistical_power_Designing_future_studies_not_evaluating_completed_studies&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245608080%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=4KepFlEFOqQyVrhSfXGHUmHToMZlzKt4AlR9sMtzif0%3D&reserved=0
Pek, J., Pitt, M. A., & Wegener, D. T. (2004). Uncertainty limits the use of power analysis. Journal of Experimental Psychology: General, 153(4), 1139–1151. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2Fhttps%3A%2F%2Fdoi.org%2F10.1037%2Fxge0001273&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245611962%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=S6EZxxThRGeruV6RF9%2FUuIXv4MmMKZWlfAgXkYUtoxI%3D&reserved=0
==
IINPUT DATA. The authors made use of statistics reported in empirical research published in Cognition & Emotion and Emotion. Often, articles might report several studies, and studies would have several models, and models would contain several tests. Thus, there might be a nested structure of tests nested within models, models nested within studies, and studies nested within articles. It does not seem that this nesting is taken into account to provide a good estimate of selection bias and the expectation replication rate. Thus, the estimates provided cannot be deemed unbiased (e.g., estimates would be biased toward articles that tend to report a lot of statistics compared to others).
As the authors admit, there is no separation of statistical tests used for manipulation checks, preliminary analyses, or tests of competing and alternative hypothesis. Given that the sampling of the statistics might not be representative of key findings in emotion research, little confidence can be placed in the accuracy of the estimates reported and the strong claims being made using them (about emotion research in general).
Finally, the authors excluded chi-square tests with degrees of freedom larger than 6. This would mean that tests of independence with designs larger than a 2×2 contingency table would be excluded (or tests of independence with 6 categories). In general, the authors need to be careful on what conditions their conclusions apply to.
UNSUBSTANTIATED CONCLUSIONS. The key conclusions made by the authors are that there is selection bias in emotion research, and there is a success rate of 70% in replication studies. These conclusions are made from z-curve analysis, in which I question the validity of. My concerns of the z-curve procedure has to do with ontological errors made about the probability attached to the concept of power, the rationale for z-transformations on p-values (along with strange distributional gymnastics with little justification provided in the original paper), and equating power with replication.
Even if the z-curve is valid, the performance of z-curve should be better evaluated to show that they apply to the conditions of the data used in the current study. Furthermore, data quality used in z-curve analysis in terms of selection criteria (e.g., excluding tests for manipulation checks, etc.) and modeling the nested structure inherent in reported results would go a long way in ensuring that the estimate provided is as unbiased as can be.
Finally, it seems odd to conclude selection bias based on data with selection bias. There might be some tautology going on within the argument. An analogy about missing data might help. Given a set of data in which we assume had undergone selection (i.e., part of the distribution is missing), how can we know from the data what is missing? The only way to talk about the missing part of the distribution is to assume a distribution for the “full” data that subsumes the observed data distribution. But who can say that the assumed distribution is the correct one that would have generated the full data? Our selected data does not have the features to let us infer what the full distribution should be. How can we know what we observe has undergone selection bias without knowledge of the selection process (cf. distribution of the full data) unless some implicit assumption is made. We are not given the assumption and therefore cannot evaluate whether the assumption is valid. I cannot tell what assumptions z-curve makes about selection.
I asked ChatGPT to write a “harsh critique” of z-curve. It made some fundamental mistakes, but it is no worse than the BS by Reviewer 1.
Question: Can you write a harsh critique of Brunner and Schimmack’s z-curve method to estimate post-hoc power?
Critique of Brunner and Schimmack’s Z-Curve Method
The z-curve method, proposed by Brunner and Schimmack, claims to estimate post-hoc statistical power based on the distribution of significant p-values in published research. While innovative in its approach, the method is plagued by several critical shortcomings that undermine its validity and reliability.
Reliance on Published Data: The z-curve method fundamentally depends on the distribution of published p-values, which are notoriously subject to publication bias. Studies with non-significant results are less likely to be published, leading to an overrepresentation of significant p-values. This skewed dataset renders the z-curve estimates inherently unreliable as they do not accurately reflect the true distribution of study results.
Assumption of Homogeneity: The method assumes that the studies included in the analysis are homogeneous in terms of their true effect sizes and power. This is a highly unrealistic assumption, as research studies often vary significantly in their design, sample sizes, and underlying true effects. Such heterogeneity can severely distort the z-curve estimates, making them meaningless.
Complexity and Transparency: The z-curve method involves complex statistical modeling that is not easily understandable or transparent to most researchers. This lack of transparency makes it difficult to scrutinize the methodology, verify the results, or apply the method correctly without advanced statistical expertise.
Sensitivity to P-Hacking: Given the prevalence of questionable research practices such as p-hacking, the distribution of p-values used in the z-curve method may be artificially manipulated. Researchers often engage in practices that inflate the number of significant results, further biasing the post-hoc power estimates. The z-curve method does not adequately address or correct for these pervasive issues.
Misleading Precision: The z-curve method provides point estimates of post-hoc power with confidence intervals, giving a false impression of precision and accuracy. However, the aforementioned biases and assumptions seriously undermine these estimates, potentially misleading researchers and policymakers about the actual power of the studies analyzed.
Overemphasis on Statistical Significance: By focusing solely on significant p-values, the z-curve method perpetuates the problematic emphasis on statistical significance rather than the practical or theoretical importance of research findings. This approach risks promoting a narrow view of research quality, ignoring the broader context and implications of the studies.
In conclusion, while Brunner and Schimmack’s z-curve method represents an ambitious attempt to estimate post-hoc power, it suffers from critical flaws that significantly limit its utility. The reliance on biased published data, unrealistic assumptions of homogeneity, lack of transparency, vulnerability to p-hacking, misleading precision, and overemphasis on statistical significance all contribute to a method that is more likely to mislead than to provide meaningful insights into the power of scientific studies. Researchers and practitioners should exercise extreme caution when considering the application of the z-curve method and seek more robust and transparent alternatives for assessing research quality and power.
And here is ChatGPT’s response to its own criticism. Sounds pretty convincing to me, but what is the right answer?
Can you write a response to criticisms of Brunner and Schimmack’s z-curve method to estimate post-hoc power of studies with significant results?
Response to Criticisms of Brunner and Schimmack’s Z-Curve Method
The z-curve method developed by Brunner and Schimmack has attracted considerable attention and, naturally, some criticism. While it is important to critically evaluate new methodologies, many of the critiques leveled against the z-curve method either misunderstand its purpose or fail to recognize its strengths and contributions to the field. Here, we address several common criticisms:
Reliance on Published Data:
Critics argue that the z-curve method is undermined by publication bias, given its reliance on published p-values. However, the method explicitly accounts for this bias by modeling the distribution of significant results. The z-curve does not merely rely on the observed p-values but uses them to estimate the underlying distribution, offering a more nuanced view than traditional meta-analyses. This approach helps to correct for the overrepresentation of significant findings, providing a more accurate estimate of true effect sizes and power.
Assumption of Homogeneity:
While some critics claim that the z-curve method assumes homogeneity among studies, this is a misunderstanding. The z-curve does not require all studies to have identical effect sizes or power. Instead, it models a distribution of z-scores, allowing for heterogeneity in the data. The method’s flexibility in handling diverse studies is one of its strengths, accommodating variations in study designs, sample sizes, and effect sizes.
Complexity and Transparency:
It is true that the z-curve method involves sophisticated statistical techniques, which may appear complex. However, complexity should not be conflated with opacity. The z-curve methodology is well-documented, and the statistical models used are accessible to those with advanced training in statistics. Additionally, tools and software implementations have been developed to aid researchers in applying the method correctly, enhancing transparency and usability.
Sensitivity to P-Hacking:
The concern about p-hacking is valid, but it is not unique to the z-curve method. The z-curve actually provides a robust framework to detect and account for such biases. By estimating the distribution of z-scores and assessing the extent of publication bias, the method can reveal patterns consistent with p-hacking and other questionable research practices. This capability is a significant advantage over traditional power estimation methods, which often overlook these issues.
Misleading Precision:
Critics argue that the z-curve provides misleading precision in its estimates. However, the z-curve method offers confidence intervals around its estimates, reflecting the uncertainty inherent in any statistical analysis. These intervals provide a range within which the true power is likely to fall, offering a realistic assessment rather than an illusion of exact precision. This feature enhances the method’s reliability by explicitly acknowledging the uncertainty in power estimation.
Overemphasis on Statistical Significance:
While the z-curve focuses on significant p-values, this focus is intentional and methodologically justified. The method aims to evaluate the post-hoc power of studies that have reported significant results, addressing a specific and important question in the literature. By concentrating on these studies, the z-curve provides insights into the reliability and robustness of reported findings, helping to identify potential overestimation of effects due to selective reporting.
In conclusion, Brunner and Schimmack’s z-curve method represents a significant advancement in the estimation of post-hoc power. It addresses critical issues such as publication bias and p-hacking, offering a more accurate and nuanced assessment of study power than traditional methods. While no method is without limitations, the z-curve’s strengths and innovative approach make it a valuable tool for researchers. Rather than dismissing it based on misunderstood or overstated criticisms, the research community should recognize its potential and apply it judiciously to enhance the robustness and transparency of scientific findings.