In this series of blog posts, I am reexamining Carter et al.’s (2019) simulation studies that tested various statistical tools to conduct a meta-analysis. The reexamination has three purposes. First, it examines the performance of methods that can detect biases to do so. Second, I examine the ability of methods that estimate heterogeneity to provide good estimates of heterogeneity. Third, I examine the ability of different methods to estimate the average effect size for three different sets of effect sizes, namely (a) the effect size of all studies, including effects in the opposite direction than expected, (b) all positive effect sizes, and (c) all positive and significant effect sizes.
The reason to estimate effect sizes for different sets of studies is that meta-analyses can have different purposes. If all studies are direct replications of each other, we would not want to exclude studies that show negative effects. For example, we want to know that a drug has negative effects in a specific population. However, many meta-analyses in psychology rely on conceptual replications with different interventions and dependent variables. Moreover, many studies are selected to show significant results. In these meta-analyses, the purpose is to find subsets of studies that show the predicted effect with notable effect sizes. In this context, it is not relevant whether researchers tried many other manipulations that did not work. The aim is to find the one’s that did work and studies with significant results are the most likely ones to have real effects. Selection of significance, however, can lead to overestimation of effect sizes and underestimation of heterogeneity. Thus, methods that correct for bias are likely to outperform methods that do not in this setting.
I focus on the selection model implemented in the weightr package in R because it is the only tool that tests for the presence of bias and estimates heterogeneity. I examine the ability of these tests to detect bias when it is present and to provide good estimates of heterogeneity when heterogeneity is present. I also use the model result to compute estimates of the average effect size for all three sets of studies (all, only positive, only positive & significant).
The second method is PET-PEESE because it is widely used. PET-PEESE does not provide conclusive evidence of bias, but a positive relationship between effect sizes and sampling error suggests that bias is present. It does not provide an estimate of heterogeneity. It is also not clear which set of studies are the target of this method. Presumably, it is the average effect size of all studies, but if very few negative results are available, the estimate overestimates this average. I examine whether it produces a more reasonable estimate of the average of the positive results.
P-uniform is included because it is similar to the widely used p-curve method,but has an r-package and produces slightly superior estimates. I am using the LN1MINP method that is less prone to inflated estimates when heterogeneity is present. P-uniform assumes selection bias and provides a test of bias. It does not test heterogeneity. Importantly, p-uniform uses only positive and significant results. It’s performance has to be evaluated against the true average effect size for this subset of studies rather than all studies that are available. It does not provide estimates for the set of only positive results (including non-significant ones) or the set of all studies, including negative results.
Finally, I am using these simulations to examine the performance of z-curve.2.0 (Bartos & Schimmack, 2022). Using exactly the simulations that Carter et al. (2019) used prevents me from simulation hacking; that is, testing situations that show favorable results for my own method. I am testing the performance of z-curve to estimate average power of studies with positive and significant results and the average power of all positive studies. I then use these power estimates to estimate the average effect size for these two sets of studies. I also examine the presence of publication bias and p-hacking with z-curve.
The present blog post, examine a scenario with a small effect size, d = .2, in all studies. That is, there is no heterogeneity, tau = 0. This condition is of interest because the selection model did very well with a small effect size and high heterogeneity (tau = .4), but struggled with no effect size, d = 0, and homogeneity, tau = 0. One reason for problems with homogeneity is that the default specification of the selection model (3PSM) that was used by Carter et al. (2019) assumes heterogeneity and sometimes fails to converge when the data are homogenous. This problem can be solved by specifying a fixed effects model. Here, I examine whether the fixed effects model produces good estimates of the population effect size, d = .2, in the condition with high selection bias and high p-hacking (simulation #136). .
Simulation
I focus on a sample size of k = 100 studies to examine the properties of confidence intervals with a large, but not unreasonable set of studies. The effect size is d = .2 and there is no heterogeneity, tau = 0. This scenario examines Carter et al.’s (2019) scenario of high p-hacking and no selection bias.
Figure 1 shows the distribution of the effect size estimates (extreme values below d = 1.2 and above d = 2 are excluded). P-hacking …
.
Selection Model (weightr)
The most important question is whether the selection model underestimates a small effect size and might fail to reject the false null hypothesis in this scenario. P-hacking did reduce the effect size estimate, average d = .12, average 95%CI = .06 to .17, and only 20% of confidence intervals included the true value d = .20. However, 90% of confidence intervals did not include a value of 0 and rejected the null hypothesis that the effect size is 0.
All simulations had sufficient non-significant results to estimate selection bias. The average selection weight for non-significant results was .16, average 95%CI = .01 to .31, and all confidence intervals excluded a value of 1. Thus, the model correctly noticed that there were too few non-significant results.
The model also tested p-hacking by examining the selection weight for just significant results with p-values between .05 and .01 (two-sided). The average weight was 1.56, average 95%CI = 1.40 to 1.71 and none of the confidence intervals included 1. Thus, the model detected p-hacking, but was not able to fully correct the effect size estimate accordingly.
The model also correctly identified p-hacking. The average selection weight for just significant results with p-values between .05 and .01 (two-sided) was 3.38 with an average 95% confidence interval ranging from 2.15 to 4.62.
In short, the selection model performed relatively well, but it underestimated the true effect size by d = .10, and failed to reject the false null hypothesis in 10% of the simulations. Thus, other models have a chance to outperform the selection model in this scenario.
PET-PEESE
PET regresses effect sizes on the corresponding sampling errors. The average estimate was d = .11, average 95%CI = .=.04 to .19, and only 37% of confidence intervals included the true value of d = .20. Moreover, only 75% of confidence intervals did not include 0. Thus, 25% of the confidence intervals failed to reject the null hypothesis.
When PET rejected the null hypothesis 75% of the time, underestimation of the effect size is not a problem because the PET-PEESE approach uses the estimate from the PEESE regression of effect sizes on the sampling variance as the effect size estimate. The average estimate was d = .17, average 95%CI = .13 to .22, and 80% of confidence intervals included the correct value.
In sum, PET-PEESE outperforms the selection model in terms of effect size estimation because PEESE estimates are better, but the selection model outperforms the PET-PEESE approach because it rejects the false null hypothesis more often.
Despite these differences, it is also noteworthy that the two models show very similar results. Thus, p-hacking with small effect sizes cannot explain why Carter et al.’s (2019) meta-analysis of the ego-depletion literature showed very different results for the selection model (d = .3) and PET-PEESE (d = -.3).
P-Uniform
P-uniform detected bias in only 87% of the simulations. Thus, it was not as sensitive as the selection model and it cannot distinguish p-hacking and selection bias.
Like the other models, p-uniform underestimated the true effect size, average d = .06, average 95%CI = -.20 to .21. Only 64% of the confidence intervals included the true value of d = .2. More problematic was the finding that only 13% of the confidence intervals did not include zero. Thus, p-hacking and small effect sizes lead to many false negative results. Clearly, p-uniform is not a useful method in this scenario.
Z-Curve
The estimated average power of studies with positive and significant results was 14%, average 95% CI = 04% to 30%, and 84% of confidence intervals included the true value of 22%. The results show that p-hacking leads to an underestimation of the true power.
The estimated average power of studies with positive results was 8%, average 95%CI = 5% to 20%, and only 49% of confidence intervals included the true value of 20%. Thus, p-hacking has an even stronger downward bias for this estimate.
Despite the bias in the power estimate, the average effect size estimate for studies with positive and significant results was unbiased, d = .19 average d = .02 to d = .34, and 99% of the confidence intervals included the true value of d = .20.
In contrast, the estimate for the set of studies with positive results had a downward bias, average d = .10, average d = .00 to .26, and only 77% of confidence intervals included the true value of d = .2
Like the selection model and p-uniform, z-curve also detected bias in all simulations. It was not able to test p-hacking because there were too few z-scores greater than 2.6.
Conclusion
The main conclusion from this specific simulation study is that the selection model performed well. Although the PET-PEESE model had better effect size estimates when the PET model rejected the null-hypothesis and PEESE estimates were used, it had a higher risk of false negative results when PET failed to reject the null-hypothesis. This is an important finding because Carter et al. (2019) favored the PET result over the selection model to conclude that the true ego-depletion effect is zero. The present results suggest that a non-significant result with PET and a significant positive estimate with the selection model might be the result of p-hacking with small effect sizes and underestimation of the average effect with PET.
Carter et al.’s (2019) simulation of “high” p-hacking is also not very extreme. As Carter et al. (2019) noted more extreme p-hacking may lead to different results. The present simulation suggests that more extreme p-hacking would produce a more severe downward bias and produce a high rate of false negative results. They recommended further research, but to my knowledge, simulations of p-hacking are lacking.