Preliminary Rating by ChatGPT 9/10 (ChatGPT is American and overly positive)
Summary of Article
Summary of Carter et al. (2019): “Correcting for Bias in Psychology: A Comparison of Meta-Analytic Methods”
Carter et al. (2019) conducted a comprehensive simulation study to evaluate how well various meta-analytic methods perform under conditions common in psychological research, including publication bias and questionable research practices (QRPs). They compared seven estimators: traditional random-effects (RE) meta-analysis, trim-and-fill, WAAP-WLS, PET-PEESE, p-curve, p-uniform, and the three-parameter selection model (3PSM), across 432 simulated scenarios that varied in effect size, heterogeneity, number of studies, and severity of bias.
Their key finding is that no method performs well under all conditions, and each has vulnerabilities depending on the presence and nature of bias and heterogeneity. Standard RE meta-analysis, trim-and-fill, and WAAP-WLS often show severe upward bias and high false-positive rates when publication bias is present. P-curve and p-uniform are unbiased under homogeneity but become increasingly biased under heterogeneity. PET-PEESE and 3PSM generally have better Type I error control and reduced bias, though they may suffer from lower power and occasional underestimation in the presence of QRPs.
Carter et al. do not recommend any single method. Instead, they argue for sensitivity analysis informed by a method performance check, where analysts compare results from multiple estimators but weigh them based on their expected performance under plausible research conditions. They also stress the limitations of meta-analysis in biased literatures and urge increased reliance on preregistered, high-powered primary studies.
Their results support a shift from viewing meta-analysis as definitive toward a more cautious, multi-method strategy for synthesizing evidence in psychology.
ChatGPT Review
Summary
This article by Carter et al. (2019) presents a comprehensive and methodologically rigorous simulation study comparing the performance of seven meta-analytic methods under varying conditions of bias and heterogeneity. The central contribution is a neutral evaluation of methods such as random-effects (RE) meta-analysis, trim-and-fill, WAAP-WLS, PET-PEESE, p-curve, p-uniform, and the three-parameter selection model (3PSM), across 432 simulation conditions defined by combinations of true effect size, heterogeneity, number of studies, publication bias, and questionable research practices (QRPs).
Strengths
- Comprehensive Simulation Design: The simulation covers a wide range of plausible research scenarios, making it more generalizable than most prior comparisons of bias-correction methods.
- Nuanced Evaluation Criteria: The use of multiple performance metrics (bias, RMSE, Type I/II error, coverage) provides a multi-faceted picture of method strengths and weaknesses.
- Practical Guidance: Rather than naming a best method, the authors advocate for sensitivity analysis, combining multiple estimators with method performance checks based on plausible assumptions.
- Transparency and Open Science: Code and interactive figures are provided, aligning with current norms for reproducibility.
- Thoughtful Discussion of Limits: The authors openly acknowledge assumptions in the data-generating model (e.g., 2-group designs, specific QRP implementations) and suggest how researchers can adapt the framework to new contexts.
Weaknesses
- Overlong and Dense: The article is extremely long and technically detailed, which may deter non-methodologists. A clearer separation of main findings and technical appendices might help.
- Limited Handling of p-hacking Complexity: Although their QRP implementation is more elaborate than most, it still doesn’t capture all real-world practices (e.g., undisclosed covariate tinkering or HARKing).
- P-curve and p-uniform Ambiguity: The treatment of p-curve and p-uniform as estimators of δ may be controversial, especially given their original purpose as tests of evidential value rather than estimators of population effect sizes.
Contribution to the Field
This article is among the most careful and exhaustive evaluations of bias-correction methods in meta-analysis available in the psychology literature. Its key contribution is not only its findings about method performance (e.g., RE and trim-and-fill dramatically inflate false positives under bias), but also its principled recommendation for sensitivity analysis weighted by method performance under plausible conditions. This approach is both pragmatic and epistemically humble—a strong counterweight to naïve meta-analytic inference.