Carter et al. (2019): Simulation #136

In this series of blog posts, I am reexamining Carter et al.’s (2019) simulation studies that tested various statistical tools to conduct a meta-analysis. The reexamination has three purposes. First, it examines the performance of methods that can detect biases to do so. Second, I examine the ability of methods that estimate heterogeneity to provide good estimates of heterogeneity. Third, I examine the ability of different methods to estimate the average effect size for three different sets of effect sizes, namely (a) the effect size of all studies, including effects in the opposite direction than expected, (b) all positive effect sizes, and (c) all positive and significant effect sizes.

The reason to estimate effect sizes for different sets of studies is that meta-analyses can have different purposes. If all studies are direct replications of each other, we would not want to exclude studies that show negative effects. For example, we want to know that a drug has negative effects in a specific population. However, many meta-analyses in psychology rely on conceptual replications with different interventions and dependent variables. Moreover, many studies are selected to show significant results. In these meta-analyses, the purpose is to find subsets of studies that show the predicted effect with notable effect sizes. In this context, it is not relevant whether researchers tried many other manipulations that did not work. The aim is to find the one’s that did work and studies with significant results are the most likely ones to have real effects. Selection of significance, however, can lead to overestimation of effect sizes and underestimation of heterogeneity. Thus, methods that correct for bias are likely to outperform methods that do not in this setting.

I focus on the selection model implemented in the weightr package in R because it is the only tool that tests for the presence of bias and estimates heterogeneity. I examine the ability of these tests to detect bias when it is present and to provide good estimates of heterogeneity when heterogeneity is present. I also use the model result to compute estimates of the average effect size for all three sets of studies (all, only positive, only positive & significant).

The second method is PET-PEESE because it is widely used. PET-PEESE does not provide conclusive evidence of bias, but a positive relationship between effect sizes and sampling error suggests that bias is present. It does not provide an estimate of heterogeneity. It is also not clear which set of studies are the target of this method. Presumably, it is the average effect size of all studies, but if very few negative results are available, the estimate overestimates this average. I examine whether it produces a more reasonable estimate of the average of the positive results.

P-uniform is included because it is similar to the widely used p-curve method,but has an r-package and produces slightly superior estimates. I am using the LN1MINP method that is less prone to inflated estimates when heterogeneity is present. P-uniform assumes selection bias and provides a test of bias. It does not test heterogeneity. Importantly, p-uniform uses only positive and significant results. It’s performance has to be evaluated against the true average effect size for this subset of studies rather than all studies that are available. It does not provide estimates for the set of only positive results (including non-significant ones) or the set of all studies, including negative results.

Finally, I am using these simulations to examine the performance of z-curve.2.0 (Bartos & Schimmack, 2022). Using exactly the simulations that Carter et al. (2019) used prevents me from simulation hacking; that is, testing situations that show favorable results for my own method. I am testing the performance of z-curve to estimate average power of studies with positive and significant results and the average power of all positive studies. I then use these power estimates to estimate the average effect size for these two sets of studies. I also examine the presence of publication bias and p-hacking with z-curve.

The present blog post, examine a scenario with a small effect size, d = .2, in all studies. That is, there is no heterogeneity, tau = 0. This condition is of interest because the selection model did very well with a small effect size and high heterogeneity (tau = .4), but struggled with no effect size, d = 0, and homogeneity, tau = 0. One reason for problems with homogeneity is that the default specification of the selection model (3PSM) that was used by Carter et al. (2019) assumes heterogeneity and sometimes fails to converge when the data are homogenous. This problem can be solved by specifying a fixed effects model. Here, I examine whether the fixed effects model produces good estimates of the population effect size, d = .2, in the condition with high selection bias and high p-hacking (simulation #136). .

Simulation

I focus on a sample size of k = 100 studies to examine the properties of confidence intervals with a large, but not unreasonable set of studies. The effect size is d = .2 and there is no heterogeneity, tau = 0.

Figure 1 shows the distribution of the effect size estimates (extreme values below d = 1.2 and above d = 2 are excluded). Selection bias and p-hacking moves the distribution of effect size estimates to the right. A meta-analysis that ignores bias would overestimate the average population effect size.

.

Selection Model (weightr)

The most important question is whether the selection model underestimates a small effect size and might fail to reject the false null hypothesis in this scenario. That was not the case. The average effect size estimate was d = .20, average 95%CI = .18 to .23, and 93% of the confidence intervals included the true value of d = .20. All confidence intervals did not include zero. Thus, the false null hypothesis was rejected in all simulations.

When non-significant results were available to estimate the selection weight for non-significant results, the model always showed evidence for selection bias; that is, the 95% confidence interval did not include a value of 1.

The model also correctly identified p-hacking. The average selection weight for just significant results with p-values between .05 and .01 (two-sided) was 3.38 with an average 95% confidence interval ranging from 2.15 to 4.62.

In short, the selection model performed well and was able to detect a small effect size in this simulation without bias.

PET-PEESE

PET regresses effect sizes on the corresponding sampling errors. The average estimate was d = .09, average 95%CI = .=.07 to .12. This shows a bias to underestimate the true value and none of the confidence intervals included the true parameter, d = .2. However, all confidence intervals rejected the null hypothesis. In this case, the PET-PEESE approach recommends using the PEESE results to avoid underestimation. The PEESE estimate was close to the true value of d = .2, average d = ..22, average 95%CI = .20 to .24, but only 44% of confidence intervals included the true value. However, all confidence intervals correctly rejected the false null-hypothesis.

As all confidence intervals rejected the null hypothesis, the PET-PEESE approach recommends the use of the PEESE regression of effect sizes on sample variances. The average effect size estimate was d = .22, average 95%CI = .20 to .24, and 44% of the confidence intervals included the true value of d = .2.

In short, PET-PEESE works well in this simulation, but it did not perform better than the selection model. The simulation also does not produce a notable difference between the two methods that was observed in Carter et al.’s (2019) meta-analysis of ego-depletion, where the average effect sizes estimates were -.27 for PET-PEESE and .33 for the selection model.

P-Uniform

P-uniform detected bias in all 100% of the simulations, just like the selection model. However, P-uniform was not as good in estimating the true effect size. The average estimate was d = .07, average 95%CI = -.11 to .19, and only 44% of confidence intervals included the true value. This poor performance can be explained by the influence of p-hacking.

Z-Curve

The estimated average power of studies with positive and significant results was 15%, average 95% CI = 04% to 28%, and 76% of confidence intervals included the true value of 23%. The results show that p-hacking leads to an underestimation of the true power.

The estimated average power of studies with positive results was 7%, average 95%CI = 5% to 18%, and only 19% of confidence intervals included the true value of 23%. Thus, p-hacking has an even stronger downward bias for this estimate.

Despite the bias in the power estimate, the average effect size estimate for studies with positive and significant results was unbiased, d = .21, average d = .03 to d = .32, and 99% of the confidence intervals included the true value of d = .20.

In contrast, the estimate for the set of studies with positive results had a downward bias, average d = .09, average d = .00 to .25.

Like the selection model and p-uniform, z-curve also detected bias in all simulations. It also detected p-hacking in 47% of all simulations, but this performance is not as good as the performance of the selection model that identified p-hacking in all simulations.

Conclusion

The main conclusion from this specific simulation study is that the selection model performed well and outperformed all other methods. While PET-PEESE produced a good average estimate, the 95% confidence interval included the true value only 44% of the time. Thus, this simulation offers no justification for the use of PET-PEESE over the selection model. The selection model also has the advantage that it provides an estimate of heterogeneity and correctly detected selection and p-hacking. PET (but not PET-PEESE), p-uniform, and z-curve also showed a downward bias that is explained by the assumption of these models that excessive significant results are due to selection bias. When excessive significant results are produced by p-hacking, just significant results are overrepresented and lead to an underestimation of the average effect size. The effect is mild in these simulations, but Carter et al.’s (2019) high p-hacking condition still assumes rather mild p-hacking. It is therefore important to examine the performance of these models with more intense p-hacking.

Leave a Reply