Z-Curve.3.0 Tutorial: Introduction

Links to Additional Resources and Answers to Frequently Asked Questions

Chapters

This post is Chapter 1. The R-code for this chapter can be found on my github:
zcurve3.0/Tutorial.R.Script.Chapter1.R at main ยท UlrichSchimmack/zcurve3.0
(the picture for this post shows a “finger-plot, you can make your own with the code)

Chapter 2 shows the use of z-curve.3.0 with the Open Science Collaboration Reproducibility Project (Science, 2015) p-values of the original studies.
zcurve3.0/Tutorial.R.Script.Chapter2.R at main ยท UlrichSchimmack/zcurve3.0

Chapter 3 shows the use of z-curve.3.0 with the Open Science Collaboration Reproducibility Project (Science, 2015) p-values of the replication studies.
zcurve3.0/Tutorial.R.Script.Chapter3.R at main ยท UlrichSchimmack/zcurve3.0

Chapter 4 shows how you can run simulation studies to evaluate the performance of z-curve for yourself.
zcurve3.0/Tutorial.R.Script.Chapter4.R at main ยท UlrichSchimmack/zcurve3.0

Chapter 5 uses the simulation from Chapter 4 to compare the performance of z-curve with p-curve, another method that aims to estimate the average power of only significant results that is used to estimate the expected replication rate with z-curve.
zcurve3.0/Tutorial.R.Script.Chapter5.R at main ยท UlrichSchimmack/zcurve3.0

Chapter 6 uses the simulation from Chapter 4 to compare the performance of the default z-curve method with a z-curve that assumes a normal distribution of population effect sizes. The simulation highlights the problem of making distribution assumptions. One of the strengths of z-curve is that it does not make an assumption about the distribution of power.
zcurve3.0/Tutorial.R.Script.Chapter6.R at main ยท UlrichSchimmack/zcurve3.0

Chapter 7 uses the simulation from Chapter 4 to compare the performance of z-curve to a Bayesian mixture model (bacon). The aim of bacon is different, but it also fits a mixture model to a set of z-values. The simulation results show that z-curve performs better than the Bayesian mixture model.
zcurve3.0/Tutorial.R.Script.Chapter7.R at main ยท UlrichSchimmack/zcurve3.0

Chapter 8 uses the simulation from Chapter 4 to examine the performance of z-curve with t-values from small studies (N = 30). It introduces a new transformation method that performs better than the default method from z-curve.2.0 and it introduces the t-curve option to analyze t-values from small studies with t-distributions.
zcurve3.0/Tutorial.R.Script.Chapter8.R at main ยท UlrichSchimmack/zcurve3.0

Chapter 9 simulates p-hacking by combining small samples with favorable trends into a larger sample with a significant result (patchwork samples). The simulation simulates studies with between-subject two-group designs with varying means and SD of effect sizes and sample sizes. It also examines the ability of z-curve to detect p-hacking and compares the performance of the default z-curve that does not make assumptions about the distribution of power and a z-curve model that assumes a normal distribution of power.
zcurve3.0/Tutorial.R.Script.Chapter9.R at main ยท UlrichSchimmack/zcurve3.0


Brief ChatGPT Generated Summary of Key Points

What Is Z-Curve?

Z-curve is a statistical tool used in meta-analysis, especially for large sets of studies (e.g., more than 100). It can also be used with smaller sets (as few as 10 significant results), but the estimates become less precise.

There are several types of meta-analysis:

  • Direct replication: Studies that test the same hypothesis with the same methods.
    Example: Several studies testing whether aspirin lowers blood pressure.
  • Conceptual replication: Studies that test a similar hypothesis using different procedures or measures.
    Example: Different studies exploring how stress affects memory using different tasks and memory measures.

In direct replications, we expect low variability in the true effect sizes. In conceptual replications, variability is higher due to different designs.

Z-curve was primarily developed for a third type of meta-analysis: reviewing many studies that ask different questions but share a common featureโ€”like being published in the same journal or during the same time period. In these cases, estimating an average effect size isnโ€™t very meaningful because effects vary so much. Instead, z-curve focuses on statistical integrity, especially the concept of statistical power.

What Is Statistical Power?

I define statistical power as the probability that a study will produce a statistically significant result (usually p < .05).

To understand this, we need to review null hypothesis significance testing (NHST):

  1. Researchers test a hypothesis (like exercise increasing lifespan) by conducting a study.
  2. They calculate the effect size (e.g., exercise increase the average lifespan by 2 years) and divide it by the standard error to get a test statistic (e.g., a z-score).
  3. Higher test-statistics imply a lower probability that the null hypothesis is true. The null hypothesis is that there is no effect. If the probability is below the conventional criterion of 5%, the finding is interpreted as evidence of an effect.

Power is the probability of obtaining a significant result, p < .05.

Hypothetical vs. Observed Power

Textbooks often describe power in hypothetical terms. For example, before collecting data, a researcher might assume an effect size and calculate how many participants are needed for 80% power.

But z-curve does something different. It estimates the average true power of a set of studies. It is only possible to estimate average true power for sets of studies because power estimates based on a single study are typically too imprecise to be useful. Z-curve provides estimates of the average true power of a set of studies and the uncertainty in these estimates.

Populations of Studies

Brunner and Schimmack (2020) introduced an important distinction:

  • All studies ever conducted (regardless of whether results were published).
  • Only published studies, which are often biased toward significant results.

If we had access to all studies, we could simply calculate power by looking at the proportion of significant results. For example, if 50% of all studies show p < .05, then the average power is 50%.

In reality, we only see a biased sampleโ€”mostly significant results that made it into journals. This is called selection bias (or publication bias), and it can mislead us.

What Z-Curve Does

Z-curve helps us correct for this bias by:

  1. Using the p-values from published studies.
  2. Converting them to z-scores (e.g., p = .05 โ†’ z โ‰ˆ 1.96).
  3. Modeling the distribution of these z-scores to estimate:
    • The power of the studies we see,
    • The likely number of missing studies,
    • And the amount of bias.

Key Terms in Z-Curve

TermMeaning
ODR (Observed Discovery Rate)% of studies that report significant results
EDR (Expected Discovery Rate)Estimated % of significant results we’d expect if there were no selection bias
ERR (Expected Replication Rate)Estimated % of significant studies that would replicate if repeated exactly
FDR (False Discovery Rate)Estimated % of significant results that are false positives

Understanding the Z-Curve Plot

Figure 1. Histogram of z-scores from 1,984 significant tests. The solid red line shows the modelโ€™s estimated distribution of observed z-values. The dashed line shows what weโ€™d expect without selection bias. The Observed Discovery Rate (ODR) is 100%, meaning all studies shown are significant. However, the Expected Discovery Rate (EDR) is only 40%, suggesting many non-significant results were omitted. The Expected Replication Rate (ERR) is also 40%, indicating that only 40% of these significant results would likely replicate. The False Discovery Rate (FDR) is estimated at 8%.

Notice how the histogram spikes just above z = 2 (i.e., just significant) and drops off below. This pattern signals selection for significance, which is unlikely to occur due to chance alone.


Homogeneity vs. Heterogeneity of Power

Sometimes all studies in a set have similar power (called homogeneity). In that case, the power of significant and non-significant studies is similar.

However, z-curve allows for heterogeneity, where studies have different power levels. This flexibility makes it better suited to real-world data than methods that assume all studies are equally powered.

When power varies, high-power studies are more likely to produce significant results. Thatโ€™s why, under heterogeneity, the ERR (for significant studies) is often higher than the EDR (for all studies).


Summary of Key Concepts

  • Meta-analysis = Statistical summary of multiple studies.
  • Statistical significance = p < .05.
  • Power = Probability of finding a significant result.
  • Selection bias = Overrepresentation of significant results in the literature.
  • ODR = Observed rate of p < .05.
  • EDR = Expected rate of p < .05 without bias.
  • ERR = Estimated replication success rate of significant results.

Full Introduction

Z-curve is a statistical tool for meta-analysis of larger sets of studies (k > 100). Although it can be used with smaller sets of studies (k > 10 significant results), confidence intervals are likely to be very wide. There are also different types of meta-analysis. The core application of meta-analysis is to combine information from direct replication studies. that is studies that test the same hypothesis (e.g., the effect of aspirin on blood pressure). The most widely used meta-analytic tools aim to estimate the average effect size for a set of studies with the same research question. A second application is to quantitatively review studies on a specific research topic. These studies are called conceptual replication studies. They test the same or related hypothesis, but with different experimental procedures (paradigms). The main difference between meta-analysis of direct and conceptual replication studies is that we would expect less variability in population effect sizes (not the estimates in specific samples) in direct replications, whereas variability is expected to be higher in conceptual replication studies with different experimental manipulations and dependent variables.

Z-curve can be applied to meta-analysis of conceptual replication studies, but it was mainly developed for a third type of meta-analysis. These meta-analyses examine sets of studies with different hypotheses and research designs. Usually, these studies share a common feature. For example, they may be published in the same journal, belong to a specific scientific discipline or sub-discipline, or a specific time period. The main question of interest here is not the average effect size that is likely to vary widely from study to study. The purpose of a z-curve analysis is to examine the credibility or statistical integrity of a set of studies. The term credibility is a broad term that covers many features of a study. Z-curve focuses on statistical power as one criterion for the credibility of a study. To use z-curve and to interpret z-curve results it is therefore important to understand the concept of statistical power. Unfortunately, statistical power is still not part of the standard education in psychology. Thus, I will provide a brief introduction to statistical power.

Statistical Power

Like many other concepts in statistics, statistical power (henceforth power, the only power that does not corrupt), is a probability. To understand power, it is necessary to understand the basics of null-hypothesis significance testing (NHST). When resources are insufficient to estimate effect sizes, researchers often have to settle for the modest goal to examine whether a predicted positive effect (exercise increases longevity) is positive, or a predicted negative effect is negative (asparin lowers blood pressure). The common approach to do so is to estimate the effect size in a sample, estimate the sampling error, compute the ratio of the two, and then compute the probability that the observed effect size or an even bigger one could have been obtained without an effect; that is, a true effect size of 0. Say, the effect of exercise on longevity is an extra 2 years, the sampling error is 1 year, and the test statistic is 2/1 = 2. This value would correspond to a p-value of .05 that the true effect is positive (not 2 years, but greater than 0). P-values below .05 are conventionally used to decide against the null hypothesis and to infer that the true effect size is positive if the estimate is positive or that the true effect is negative if the estimate is negative. Now we can define power. Power is the probability of obtaining a significant result, which typically means a p-value below .05. In short,

Power is the probability of obtaining a statistically significant result.

This definition of power differs from the textbook definition of power because we need to distinguish between different types of powers or power calculations. The most common use of power calculations relies on hypothetical population effect sizes. For example, let’s say we want to conduct a study of exercise and longevity without any prior studies. Therefore, we do not know whether exercise has an effect or how big the effect is. This does not stop us from calculating power because we can just make assumptions about the effect size. Let’s say we assume the effect is two years. The main reason to compute hypothetical power is to plan sample sizes of studies. For example, we have information about the standard deviation of people’s life span and can compute power for hypothetical sample sizes. A common recommendation is to plan studies with 80% power to obtain a significant result with the correct sign.

It would be silly to compute the hypothetical power for an effect size of zero. First, we know that the probability of a significant result without a real effect is set by the research. When they use p < .05 as a rule to determine significance, the probability of obtaining a significant result without a real effect is 5%. If they use p < .01, it is 1%. No calculations are needed. Second, researchers conduct power analysis to find evidence for an effect. So, it would make no sense to do the power calculation with a value of zero. This is null hypothesis that researchers want to reject, and they want a reasonable sample size to do so.

All of this means that hypothetical power calculations assume a non-zero effect size and power is defined as the conditional probability to obtain a significant result for a specified non-zero effect size. Z-curve is used to compute a different type of power. The goal is to estimate the average true power of a set of studies. This average can be made up of a mix of studies in which the null hypothesis is true or false. Therefore, z-curve estimates are no longer conditional on a true effect. When the null hypothesis is true, power is set by the significance criterion. When there is an effect, power is a function of the size of the effect. All of the discussion of conditional probability is just needed to understand the distinction between the definition of power in hypothetical power calculations and in empirical estimates of power with z-curve. The short and simple definition of power is simply the probability of a study to produce a significant result.

Populations of Studies

Brunner and Schimmack (2020) introduce another distinction between power estimates that is important for the understanding of z-curve. One population of studies are all studies that have been conducted independent of the significance criterion. Let’s assume researchers’ computers were hooked up to the internet and whenever they conduct a statistical analysis, the results are stored in a giant database. The database will contain millions of p-values, some above .05 and others below .05. We could now examine the science-wide average power of null hypothesis significance tests. In fact, it would be very easy to do so. Remember, power is defined as the probability to obtain a significant result. We can therefore just compute the percentage of significant results to estimate average power. This is no different than averaging the results of 100,000 roulette games to see how often a table produces “red” or “black” as an outcome. If the table is biased and has more power to get “red” results, you could win a lot of money with that knowledge. In short,

The percentage of significant results in a set of studies provides an estimate of the average power of the set of studies that was conducted.

We would not need a tool like z-curve, if power estimation were that easy. The reason why we need z-curve is that we do not have access to all statistical tests that were conducted in science, psychology, or even a single lab. Although data sharing is becoming more common, we only see a fraction of results that are published in journal articles or preprints on the web. The published set of results is akin to the proverbial tip of the iceberg, and many results remain unreported and are not available for meta-analysis. This means, we only have a sample of studies.

Whenever statisticians draw conclusions about populations from samples, it is necessary to worry about sampling bias. In meta-analyses, this bias is known as publication bias, but a better term for it is selection bias. Scientific journals, especially in psychology, prefer to publish statistically significant results (exercise increases longevity) over non-significant results (exercise may or may not increase longevity). Concerns about selection bias are as old as meta-analyses, but actual meta-analyses have often ignored the risk of selection bias. Z-curve is one of the few tools that can be used to detect selection bias and quantify the amount of selection bias (the other tool is the selection model for effect size estimation).

To examine selection bias, we need a second approach to estimate average power, other than computing the average of significant results. The second approach is to use the exact p-values of a study (e.g., p = .17, .05, .005) and to convert them into z-values (e.g., z = 1, 2, 2.8). These z-values are a function of the true power of a study (e.g., a study with 50% power has an expected z-value of ~ 2), and sampling error. Z-curve uses this information to obtain a second estimate of the average power of a set of studies. If there is no selection bias, the two estimates should be similar, especially in reasonably large sets of studies. However, often the percentage of significant result (power estimate 1) is higher than the z-curve estimate (power estimate 2). This pattern of results suggests selection for significance.

In conclusion, there are two ways to estimate the average power of a set of studies. Without selection bias, the two estimates will be similar. With selection bias, the estimate based on counting significant results will be higher than the estimate based on the exact p-values.

Figure 1 illustrates the extreme scenario that the true power of studies was just 40%, but selection bias filtered out all non-significant results.


Figure 1. Histogram of z-scores from 1,984 significant tests (based on a simulation of 5,000 studies with 40% power). The solid red line represents the z-curve fit to the distribution of observed z-values. The dashed red line shows the expected distribution without selection bias. The vertical red line shows the significance criterion, p < .05 (two-sided, z ~ 2). ODR = Observed Discovery Rate, EDR = Expected Discovery Rate, ERR = Expected Replication Rate. FDR = False Positive Risk, not relevant for the Introduction.


The figure shows a z-curve plot. Understanding this plot is important for the use of z-curve. First, the plot is a histogram of absolute z-values. Absolute z-values are used because in field-wide meta-analyses the sign has no meaning. In one study, researchers predicted a negative result (aspirin decreases blood pressure) and in another study they predicted a positive result (exercises increases longevity). What matters is that the significant result was used to reject the null hypothesis in either direction. Z-values above 6 are not shown because they are very strong, imply nearly 100% power. The critical range of z-scores are z-scores between 2 (p = .05, just significant) and 4 (~ p = .0001).

The z-curve plot makes it easy to spot selection for significance because there are many studies with just significant results (z > 2) and no studies with just not-significant results that are often called marginally significant results because they are used in publications to reject the null hypothesis with a relaxed criterion. A plot like this cannot be produced by sampling error.

In a z-curve plot, the percentage of significant results is called the observed discovery rate. Discovery is a term used in statistic for a significant result. It does not mean a breaking-news discovery. It just means p < .05. The ODR is 100% because all results are significant. This would imply that all studies tested a true hypothesis with 100% power. However, we know that this is not the case. Z-curve uses the distribution of significant z-scores to estimate power, but there are two populations of power. One population is all studies, including the missing non-significant results. I will explain later how z-curve estimates power. Here it is only important that the estimate is 40%. This estimate is called the expected discovery rate. That is, if we could get access to all missing studies, we would see that only 40% of the studies were significant. Expected therefore means without selection bias and open access to all studies. The difference between the ODR and EDR quantifies the amount of selection bias. Here selection bias inflates the ODR from 40% to 100%.

It is now time to introduce another population of studies. This is the population of studies with significant results. We do not have to assume that all of these studies were published. We just assume that the published studies were not selected based on their p-values. This is a common assumption in selection models. We will see later how changing this assumption can change results.

It is well known that selection introduces bias in averages. Selection for significance, selects studies that had positive sampling error that produced z-scores greater than 2, while the expected z-score without sampling error is only 1.7, not significant on its own. Thus, a simple power calculation for the significant results would overestimate power. Z-curve corrects for this bias and produces an unbiased estimate of the average power of the population of studies with significant results. This estimate of power after selection for significance is called the expected replication rate (ERR). The reason is that average power of the significant results predicts the percentage of significant results if the studies with significant results were replicated exactly, including the same sample sizes. The outcome of this hypothetical replication project would be 40% significant results. The decrease from 100% to 40% is explained by the effect of selection and regression to the mean. A study that had an expected value of 1.7, but sampling error pushed it to 2.1 and produced a significant result is unlikely to have the same sampling error and produce a significant result again.

At the bottom of z-curve 3.0, you see estimates of local power. These are average power estimates for ranges of z-values. The default is to use steps of z = .05. You see that the strength of the observed z-values does not matter. Z-values between 0 and 0.5 are estimated to have 40% power as do z-values between 5.5 and 6. This happens when all studies have the same power. When studies differ in power, local power increases because studies with higher power are more likely to produce larger z-values.

When all studies have the same power, power is said to be homogenous. When studies have different levels of power, power is heterogeneous. Homogeneity or small heterogeneity in power imply that it is easy to infer the power of studies with non-significant results from studies with significant results. The reason is that power is more or less the same. Some selection models like p-curve assume homogeneity. For this reason, it is not necessary to distinguish populations of studies with or without significant results. It is assumed that the true power is the same for all studies, and if the true power is the same for all studies, it is also the same for all subsets of studies. This is different for z-curve. Z-curve allows for heterogeneity in power, and z-curve 3.0 provides a test of heterogeneity. If there is heterogeneity in power, the ERR will be higher than the EDR because studies with higher power are more likely to produce a significant result (Brunner & Schimmack, 2020).

To conclude, the introduction introduced basic statistical concepts that are needed to conduct z-curve analyses and to interpret the results correctly. The key constructs are

Meta-Analysis: the statistical analysis of results from multiple studies
Null Hypothesis Significance Testing
Statistical Significance: p < .05 (alpha)
(Statistical) Power: the probability of obtaining a significant result
Conditional Power: the probability of obtaining a significant result with a true effect
Populations of Studies: A set of studies with a common characteristic
Set of all studies: studies with non-significant and significant results
Selection Bias: An overrepresentation of significant results in a set of studies
(Sub)Set of studies with significant results: Subset of studies with p < .05
Observed Discovery Rate (ODR): the percentage of significant results in a set of studies
Expected Discovery Rate (EDR): the z-curve estimate of the discovery rate based on z-values
Expected Replication Rate (ERR): the z-curve estimate of average power for the subset of significant results.

Beyond Carter et al. (2019): Simulation #1

In this blog post, I continue and expand my exploration of Carter et al.’s (2019) extensive simulation studies of meta-analytic methods. I focus on Carter et al. because it was a seminal study and remains the most thorough study so far. It also remains the key article that is being cited in applied articles to interpret conflicting results.

My previous posts reproduced Carter et al.’s (2019) simulations exactly. In my “Beyond Carter” series, I am starting to investigate new scenarios that were not considered in their design. This is the first blog post in this series. It focuses on the influence of sample sizes in the original studies on the performance of various methods. Carter et al.’s (2019) used a distribution of sample sizes that included many studies with large samples. This is not immediately obvious because their figure was truncated at N = 400.

However, the actual sample sizes were not truncated at N = 400, although studies with more than 800 participants are rare in experimental psychology, even in the days of online studies. The use of a single distribution of sample sizes is problematic because results based on these simulations may not generalize to meta-analysis with smaller sample sizes. It is also problematic to make recommendations about the interpretation of conflicting results based on these simulations. A more sensible approach would be to ask researchers to perform a simulation study with the sample sizes that match the sample sizes of their actual dataset. After all, sample sizes are known knowns, and not untestable assumptions.

In this blog post, I used the sample sizes of Carter et al.’s (2019) ego-depletion meta-analysis to see how robust their recommendations are to changes in sample sizes. To keep all other aspects of the simulation the same, I used Carter et al.’s simulation function and created more studies than needed. I then sampled from this set of studies to match the percentage of studies with very small (39% N < 40; ), small (52%, N = 40 – 100), medium (8% N = 100 – 200), and large (1% N = 200-400) studies in the ego-depletion dataset.

The first blog post uses Carter et al.’s Simulation #328 with k = 100 studies, an average population effect size of d = .2, a standard deviation of normally distributed population effect sizes (tau) of .4, no selection bias, and high amount of p-hacking. The results for the simulation with Carter et al.’s sample sizes showed that p-hacking introduces a small (d = .1 to .2) downward bias in the effect size estimates for all methods (). The question is whether smaller sample sizes increase that downward bias and whether some methods are more affected by p-hacking when sample sizes are small.

Random Effects Meta-Analysis

I added the standard random effects meta-analysis model to my simulation studies. The main reason is that the results serve as a benchmark to evaluate bias-correction methods. If bias is small, the benefits of correcting for bias may out-weight the costs of doing so. Carter et al.’s (2019) study already showed that p-hacking and selection bias have different effects on random effect meta-analysis. Namely, p-hacking – as simulated by Carter et al. (2019) – introduced only a small bias in effect size estimates.

This observation was confirmed here. The average effect size estimate was d = .38, average 95%CI = .28 to .49, and 70% of the confidence intervals included the true value of d = .30. The model also slightly overestimated the true heterogeneity, average tau = .46, 95%CI = .35 to .54. While there is some bias in the estimates, they do not radically change the conclusions. Moreover, it is not clear that bias-correction models can do better in this scenario.

It is also instructive to compare this result with Carter et al.’s (2019) ego-depletion RMA result that produced an effect size estimate of d = .43, 95%CI = .34 to .52. Carter et al. (2019) dismiss this finding because the model has a high false positive risk when the null hypothesis is true, but this does not mean that it would produce estimates of d = .4 when the null hypothesis is true. The present simulation suggests that the result could be obtained with a small average effect size, high heterogeneity, and p-hacking.

Selection Model

An improved selection model that accounts for p-hacking did well in the simulations with Carter et al.’s (2019) sample sizes, but it struggled in this simulation with smaller sample sizes. It often failed to estimate all of the parameters, and the effect size estimate failed to reject the false null-hypothesis.

The advantage of the selection model is that it can be modified, if it fails to work well. One problem for the selection model was that Carter et al.’s (2019) simulation contained several negative values, including some statistically significant ones. To reduce the number of steps, it is possible to remove all negative values and to specify this selection with a step at p = .5 (one-sided). The model will then use a fixed value of .01 and does not need to estimate this weight parameter, making it easier to estimate free parameters.

However, even this modification was not sufficient to identify all parameters reliably. The problem was that there were not enough just-significant p-values to estimate the parameter for p-hacked p-values between .005 and .025. A solution to this problem was to widen the interval to .005 to .05. This is even more reasonable with real data because p-hacking also often leads to marginally significant results (.025 to .05) that are presented as evidence against the null-hypothesis. A model with a single parameter for just significant (.005 to .025) and marginally significant results (.025 to .050) can be used when the set of studies is relatively small. The modified selection models with steps = c(.005, .050, .500, 1) produced estimates in 99% of the simulations.

The true average effect size was d = .30. The true average is higher than d = .20 because the selection of different sample sizes changed the population of studies, and the new population had a higher average. The average effect size estimate was d = .32, average 95%CI = .31, 95%CI = .22 to .43, and 67% of confidence intervals included the true average. Thus, the model showed a small upward bias, but it was able to do a bit better than the random effects model that ignored bias.

To compare the performance of the selection model to models that focus on positive results or positive and significant results, I used the estimated average and standard deviation of population effect sizes to estimate the average effect size for these populations of studies. The true average for only positive results was d = .38. The average selection model estimate was d = .46, average 95%CI = .21 to .65, and 65% of confidence intervals included the true value. Thus, upward bias is also present for this subset of studies.

The true average for positive and significant results is d = .50. The average estimate was d = .65, average 95%CI = .25 to .88, and 94% of confidence intervals included the true average. The bias is larger, but the wider confidence intervals produce better coverage than the estimates for other sets of data.

The point estimate of heterogeneity was good, average tau = .39 (true tau = .4), but the average 95% confidence intervals was wide, .04 to .57. The wide confidence intervals included the true value 91% of the time.

The model correctly identified selection of significant results in all simulations. The highest estimate of the selection weight for non-significant positive results was w = .65, and the highest upper limit of the confidence intervals was w = .89. Thus, the model can be used to detect selection bias and warn against the use of models that do not take biases into account.

The model noticed p-hacking. That is, the average selection weight for just significant results was well above 1, average w = 2.32, but the average 95%CI was wide, 0.64 to 3.84, and only 13% of confidence intervals rejected the null hypothesis of no p-hacking. Thus, it is difficult to diagnose p-hacking in this simulated condition.

Overall, the performance of the selection model was negatively affected by smaller sample sizes in the original studies (not a smaller set of studies). The full model that produced good estimates before had identification problems. A modified model failed to fully correct for p-hacking, and the confidence intervals often did not include the true value. While the results were not terrible, there is room for other models to do better. At the same time, the model correctly identified heterogeneity and selection bias. Thus, the model should always be used to examine assumptions. A fixed effect size model without bias should not be based after the selection model shows evidence of heterogeneity and bias.

PET-PEESE

The PET-PEESE model was strongly influenced by the change in sample sizes. PET regresses effect size estimates on sampling error. With fewer studies that have large samples and small standard errors, the prediction of the intercept becomes more uncertain as reflected in larger sample sizes. Moreover, absent unbiased estimates close to the intercept, the correction for p-hacking is more difficult. This led to a strong downward bias in effect size estimates, average d = -.57, average 95%CI = -1.84 to .71. Although 74% of the very wide confidence intervals included the true value of d = .27, none of the confidence intervals rejected the false null hypothesis.

It is instructive to compare these results to the PET results for the actual ego-depletion meta-analysis, d = -.27, 95%CI = -.52 to d = .00. The results also show a negative estimate and fail to reject the null hypothesis. Although the confidence interval is narrower, the pattern qualitatively matches the original results. Thus, p-hacking and small samples may have produced the negative estimate in the actual meta-analysis. This would undermine Carter et al.’s conclusion that PET is most likely to produce the correct results based on their simulations with larger sample sizes.

PEESE is a regression of effect size estimates on the sampling variance (i.e., squared standard errors). PEESE estimates are only used if PET produces a positive and significant result. Thus, they are irrelevant in this simulation, but even PEESE produced negative estimates, average d = -.14, average 95%CI = -.19 to -.09. The confidence intervals are narrower. These results are even more similar to the results for the actual ego-depletion data, d = .00, 95%CI = -.14 to .15.

In conclusion, the combination of small sample sizes and p-hacking leads to a strong downward bias for PET estimates. This also influences the overall result of PET-PEESE because non-significant or negative PET results are given priority in effect size estimation to avoid overestimation with PEESE. However, if PET has a strong downward bias, PEESE results are actually better, questioning the priority of PET in the interpretation of results.

I also added an additional analysis of the data to explore the reason for the downward bias in PET-regression. For this purpose, I examined the correlation between sampling error and the POPULATION effect sizes. Without bias, the simulation assumes that the two are not correlated. That is sample sizes are randomly paired with population effect sizes. However, p-hacking produces a correlation between sampling error and population effect sizes, average r = .39. PET assumes that this correlation is zero and that any correlation between sampling error and the effect size estimates reflect bias. This will lead to a downward bias in the estimates when the assumption is false and studies with smaller sampling error have smaller effect sizes. This bias will vary as a function of the estimated correlation in a specific simulation. This is indeed the case. The bias in PET estimates was negatively correlated with the correlation between sampling error and population effect sizes, r = -.42. Future work needs to explore how p-hacking creates a correlation between sample sizes and the true population effect sizes. For now, these results explain why PET regression has a downward bias. This problem has been overlooked in previous tests of PET regression because simulations either did not simulate p-hacking (Stanley, 2017) or the dataset included many studies with large samples that minimize downward bias (Carter et al., 2019).

The key finding is that negative PET estimates suggest that PET estimates have a downward bias and that PET-PEESE should not be used, when better alternatives are available. It is especially problematic to use negative PET results as evidence that the average effect size is zero, when other models produce positive effect sizes (Carter et al., 2019).

The poor performance of PET-PEESE in this simulation is not unexpected. The parent of PET-PEESE pointed out that PET-PEESE can be unreliable when bias is present and (a) there is little variability in sample sizes of original studies, (b) the sample sizes are small and observed datapoints are far from the intercept. The present analysis only adds that p-hacking makes things worse and produces a strong downward bias and negative estimates when the average effect size is small. The only reason to use PET-PEESE would be that it performs better in other conditions, but so far the selection model always performed as well or better than PET-PEESE. Thus, there is no reason to use PET-PEESE, especially because it is difficult to detect p-hacking. Maybe the main reason to use PET-regression is to use negative estimates as a test of p-hacking.

P-Uniform

P-uniform uses only positive and significant results and estimates the average effect size of this set of results. Carter et al. (2019) claimed that p-uniform overestimates effect sizes, but this claim was based on the false assumption that p-curve estimates the effect size for all studies. This is only possible if all studies have the same population effect sizes and all subsets of studies have the same effect size (i.e., tau = 0). I found that p-hacking actually also produces a downward bias in p-uniform estimates as it does for all models that assume bias was produced by selection bias when it was actually produced by p-hacking.

P-uniform also has a bias test, but only 5% of bias tests were significant, which is expected by chance alone. Thus, the selection model should be used to test for the presence of bias.

P-uniform underestimated the average effect size of studies with positive and significant results, average d = .33, average 95%CI = .05 to .53, and 64% of confidence intervals included the true value of d = .50. The performance is about the same as the selection model, but with opposite biases.

Carter et al.’s (2019) ego-depletion results showed an estimate of d = .55, 95%CI = .33 to .71. The present results suggest that this finding is consistent with a scenario with a small average effect size, high heterogeneity, and p-hacking. The higher estimate in the actual data may be due to less severe p-hacking than in this simulation.

P-Curve

Carter et al. (2019) used p-curve to estimate effect sizes. I did not include p-curve for this purpose in my previous posts because the results are similar to p-uniform and p-uniform has an R-package. Here I am included p-curve estimates of power. The reason is that p-curve is much more often used to examine evidential value and bias than to estimate effect sizes. Often p-curve is used in combination with traditional meta-analysis to claim that estimates without bias correction are acceptable because p-curve showed evidential value.

The inclusion of p-curve is also interesting to compare its performance with z-curve, another method that focuses on power rather than effect sizes. Z-curve has the advantage that it models heterogeneity in power and produces better estimates when heterogeneity is high (Brunner & Schimmack, 2020). The p-curve authors have claimed that the superior performance of z-curve occurs only in rare cases with extreme outliers (Datacolada, 2018). Thus, it is interesting to test this claim with standard meta-analytic scenarios.

P-curve estimates the average power of positive and significant results. The average estimate was 32%, average 95%CI = 18% to 47%, and 70% of the confidence intervals included the true value of 38%. The results shows again a downward bias because p-curve corrects for inflation assuming selection for significance, but p-hacking introduces a smaller bias. This leads to an underestimation of the true value. However, the bias is relatively small.

P-curve sometimes uses the arbitrary criterion of 33% power to examine whether a set of significant values has acceptable average power. Only 14% of simulations rejected the hypothesis that power is at least 33%. This finding is not very interesting when the true power is 38%. There is no meaningful difference between 28% and 38% power.

The most interesting result is that p-curve produced a relatively good estimate of average power despite high heterogeneity in effect sizes. The reasons for this good performance become clearer in the analysis of the data with z-curve.

Z-Curve

One advantage of z-curve is that a z-curve plot provides some information about bias. To illustrate this, I am presenting the results of a z-curve plot with k = 5,000 studies. The actual results are based on the same simulation data as for the other models.

Figure 1 shows a plot of the 735 results with a positive z-value (z-values are obtained by transforming the t-values obtained from effect size estimates and sampling error into z-values). Visual inspection of the pattern shows missing non-significant values and too many just significant results. Just significance is arbitrarily defined by p-values between .05 and .01, which corresponds to z-scores between 2 and 2.6. The test of excessive just significant results is highly significant with k = 735 tests. The pattern suggests that the excess of just significant results was obtained by p-hacking some of the non-significant results. However, other ways produce this pattern of results are possible. This makes it difficult to correct for p-hacking.

In the simulation, I tried to test for p-hacking by fitting the z-curve model to the “really” significant results (z > 2.6) and seeing whether there are too many just significant results. However, this test has low power and cannot be performed when there are only few “really” significant results, p < .01. With k = 100, the test did not work. Thus, I fitted the normal selection model that assumes excessive just significant results are due to not publishing results non-significant results.

Figure 2 shows the result. The expected replication (ERR) is a different name for the average power of the positive and significant results and can be compared with the p-curve estimate. In this single simulation, the estimates are similar, 29% versus 32%, but the real comparison has to use the same data. The main point here is that p-curve does better than z-curve in this scenario. The reason is that p-curve results have two biases. Estimates have a downward bias when selection is assumed, but p-hacking was used. This is true for all models. In addition, p-curve has an upward bias when the data are heterogeneous. When p-hacking and heterogeneity are present the two biases cancel each other out. However, in real data it is not known how much p-hacking was used and the upward bias may lead to inflated estimates. This bias can be severe. It is therefore problematic to rely on p-curve estimates alone. In combination with z-curve and a plot like the one in Figure 1, p-curve results may be interpreted as a better correction for p-hacking. An alternative approach is to use z-curve estimates and use the biased estimates as conservative estimates rather than to correct estimates upward. This would be a p-hacking penalty. Given the uncertainty about the practices that were used to p-hack data, this might be considered an appropriate correction.

One advantage of z-curve over p-curve is that it provides tests of bias. The test of excessive just significant results produced on average 60% just significant results when the model only predicted 40%, and 96% of the significance test were significant. The performance of the selection model was better, but it sometimes fails to provide estimates. Using both tests should produce convergent evidence that bias is present.

The average estimate of average power for positive and significant results (average ERR) was 28%, average 95%CI = 12% to 47%, and 81% of the confidence intervals included the true value. It is noteworthy that the coverage of the confidence interval is better than the coverage of p-curve confidence intervals, although the downward bias was more pronounced. This shows that p-curve confidence intervals are too narrow, which has been observed before (Shane et al., 2019). However, p-curve estimates were on average less biased than z-curve estimates (average bias -6 vs -10 percentage points). This difference is not practically significant, and both methods lead to the same conclusion in this simulation.

Z-curve extrapolates from the distribution of significant results to estimate the average power of all positive results, significant and non-significant ones. P-curve does not do this because it assumes homogeneity. If all studies have the same power, power is the same before and after selection for significance (Brunner & Schimmack, 2020). Thus, we can also use the p-curve estimate of 32% as an estimate of the average power for all positive results.

In this specific simulation, p-curve is lucky and hits the nail on the head. The estimate of 32% matches the true value exactly. In contrast, z-curve produces a lower estimate of 15%, 95%CI = .05 to 36%. Thus, the downward bias introduced by p-hacking has more severe effects on this estimate. Better ways to detect and correct for p-hacking are therefore needed to keep the p-hacking penalty at a reasonable level.

One approach is to create a p-curve version of z-curve. This means, the model uses only a single parameter to fit the data. In addition, the model allows the standard deviation of the normal distribution to be freely estimated rather than assuming a fixed value of 1. In this model, the peak of the normal distribution was estimated to be 1.68 (see Figure 3) and the standard deviation was less than 1, SD = .86.

The SD below 1 is the result of p-hacking that leads to a bunching of z-values in the just-significant range. The model cannot fit the distribution of the really significant results, but there are very few. As a result, this model provides an even better estimate of average power than p-curve, ERR = 39%, true value = 38%. This may just be a lucky result, but allowing for a free component with a free mean and free SD may be a way to capture p-hacking with z-curve.

A new extension of z-curve is to convert power estimates into effect size estimates. The simplifying assumption to make this possible is to assume that population effect sizes are independent of sample sizes. Assuming positive or negative correlations would produce different estimates. In the present simulation, it was already shown that the independence assumption is false and biased PET regression results.

The average effect size estimate for positive and significant results based on z-curve estimates of power, however, were only slightly biased, estimated average d = .44 , average 95%CI .22 to .60, and 90% of confidence intervals included the true value of d = .50. However, the confidence intervals were wide and only 61% of confidence intervals did not include a value of d = .2, the value for a small effect size. This performance is better than the performance of p-uniform and the selection model.

The average estimate for the effect size of all positive results based on the power estimate for this set of studies was d = .28, average 95%CI = .02 to .51, and 76% of confidential intervals included the true value of d = .42. Only 5% of confidence intervals did not include a value of .2. Thus, the downward bias made it difficult to show that the effect size is not small. The selection model performed better.

Conclusion

The main finding in this simulation was that all models except PET-PEESE provided positive estimates of the average effect size for all studies or subsets of studies. P-curve and z-curve also showed that the studies with positive and significant results are not all false positive results. That is, the null hypothesis is clearly rejected at least for a subset of studies.

Why is this important? This finding is important because it mirrors Carter et al.’s (2019) results for actual studies of ego-depletion (see Figure 4, a copy of their Table 2).

However, Carter et al. (2019) used their simulation studies to suggest that PET-PEESE results are the most trustworthy results because PET-PEESE showed the best performance in simulations without a real effect. Thus, the finding that PET-PEESE did not show a positive effect, while the other models showed one, was interpreted as evidence that the average effect size is probably zero, and implicitly as evidence that estimates of the other models are false rejections of a true null hypothesis.

The present results undermine this conclusion with a simulation that is identical to their simulations of p-hacking, but with sample sizes that match the ego-depletion literature. As the present simulation uses sample sizes of the actual ego-depletion studies, the results are more relevant than simulation results with other sample sizes that make PET-PEESE perform better. It is therefore necessary to revise Carter et al.’s (2019) conclusions about PET-PEESE and the interpretation of the ego-depletion meta-analysis. First, it is extremely unlikely that all significant results in the ego-depletion meta-analysis were false positive results. This does not mean that it is easy to replicate any particular study because hidden moderators may make it difficult to replicate these studies, but it does mean that ego-depletion manipulations sometimes had at least a small effect on some dependent variable. More importantly, the results mean that PET-PEESE is not as robust as Carter et al.’s (2019) table and online app suggests. The table suggests that PET-PEESE performs well with small effect sizes. I showed that this is not the case and that downward bias can turn a small positive average effect size into an estimated moderate negative effect. The difference between the true and estimated values is large (d > .5). Thus, it may be necessary to reexamine meta-analysis that relied on PET-PEESE to suggest that a set of studies lacks evidential value.

The results also imply that it is necessary to revisit other scenarios in Carter et al.’s simulation study with other compositions of sample sizes, especially sets of studies with small to medium sample sizes that lack large sample sizes that produce unbiased estimates. In other words, Carter et al.’s (2019) article was a good start, but not the final word on the performance of meta-analytic methods under realistic conditions.

Beyond Carter et al. (2019): P-Hacking in Small Samples

To start on a positive note, Carter et al. (2019) conducted an extensive simulation study that examined the performance of several meta-analytic methods under various scenarios. Most important, the study simulated p-hacking; that is, the use of various statistical tricks to produce significant results in an underpowered study. To my knowledge, this remains the only thorough examination of p-hacking.

The main result in Carter et al.’s simulation study was that p-hacking inflates estimates of traditional methods that assume no selection bias; that is, studies are published independent of the significance of a result. It is well known that this assumption is unrealistic in psychology. To address the problem of selection bias, a number of methods try to correct for selection bias. Carter et al. (2019) observed that these methods tend to produce downward biased estimates when p-hacking is used (they misinterpret the results of p-curve and p-uniform, but that is not relevant here).

One of the methods studied by Carter et al. (2019) is called PET-PEESE. PEESE is only used when PET obtains a positive and significant estimate of the average effect size. Here only PET is relevant because PET produces non-significant or significant and negative results in the reported simulations. PET regresses effect sizes on sampling error. It corrects for bias because studies with larger sampling error require larger effect size estimates to produce a significant result. Thus, p-hacking will inflate the effect size estimates more in these studies and produce a correlation between sampling error and effect size estimates. The regression of effect sizes on sampling error is used to detect bias, and the intercept of the regression model is used as the bias-corrected estimate of the true average population effect size.

Carter et al. (2019) observed that “When PET-PEESE estimates were unbiased or biased downward, as in the case of nonzero true effects, QRPs led to greater downward bias. This downward bias was sometimes quite strong when the null was true. The PET-PEESE method yielded statistically significant effects of opposite sign in many analyses” (p. 134).

So far, so good, but then Carter et al. (2019) make some questionable claims about the performance of PET-PEESE. “A statistically significant PET-PEESE estimate in the unexpected direction probably is incorrect, but researchers should be aware that when they obtain such an estimate, there is likely to be some combination of QRPs and publication bias and, perhaps, a null effect” (p. 134).

The problem in this statement is the suggestion that a negative intercept is likely to occur with a null effect; that is, the population effect size is zero. This suggestion implies that p-hacking would not produce negative effect sizes, when the average population effect size is not zero. This interpretation of PET-PEESE is also implied in other statements, like “the QRPs [questionable research practices = p-hacking] nudged PET-PEESE estimates downward” (p. 124). Nudging suggests a small bias. However, I am going to show that PET-regression can turn a true average effect size of d = .2 into an estimated effect size of d = -.4. This is a large difference that leads to substantially wrong conclusion, even if the d = -4 estimate is interpreted as evidence that d = 0.

I will show that Carter et al.’s (2019) result hold for the specific scenario that they simulated, but that their simulations made unrealistic assumptions about the distribution of sample sizes in psychological research, including the ego-depletion meta-analysis that motivated their simulation study.

Replication of Carter et al. (2019)

I have replicated some important scenarios from Carter et al.’s (2019) extensive design (Schimmack, 2025). In these simulations I focused on a realistic scenario that also applies to Carter et al.’s (2019) meta-analysis of ego-depletion. The known parameters are that the set of studies is about 100 (k = 116) and that there is large heterogeneity in population effect sizes. This assumption was confirmed with the 3PSM selection model that estimated the standard deviation of the population effect sizes to be about .4.

What is not known are the average true effect size and the source of bias; selection bias or p-hacking. Regarding the size of the effect size, the most interesting scenarios are an average effect size of 0 (no effect) or d = .2 (a small effect). Large effects should be detected easily with all methods.

The most important finding of these simulations was that all methods performed well and produced relatively consistent results. The most interesting condition is the simulation of high p-hacking and no selection bias (Carter Simulation #328). A modified selection model produced an estimate of d = .13, 95% = -.13 to .39. While a focus on hypothesis testing would suggest a high false negative rate (i.e., the confidence interval includes 0), the results are clearly consistent with a small effect size and 88% of confidence intervals included the true value of d = .20.

The PET estimate was d = .04, average 95%CI = -.12 to .21 and only 45% of the confidence intervals included the true value of d = .20. This finding is notable for two reasons. First, the selection model performs better than the PET regression. Second, PET does not produce a negative estimate in this scenario.

This is important because Carter et al.’s (2019) meta-analysis of the actual ego-depletion data produced different results. Here the selection model showed evidence of a moderate effect, d = 3, and PET regression produced a negative estimate of d = -.3, a difference of .6 standard deviations. This finding was used by Carter et al. (2019) to argue that (a) the PET results are more credible, (b) the average population effect size in the ego-depletion literature is likely to be zero, and (c) different meta-analytic methods can produce dramatically different results. The problem is that this only occurred in the analysis of the actual ego-depletion data and not in any of the simulation scenarios.

To conclude, Carter et al.’s (2019) extensive simulation study did not produce results that match the results of the ego-depletion literature. We still have to explain how PET can produce significant negative estimates, when the other methods show evidence of a positive average effect.

Problems with PET Regression

While Carter et al. (2019) were the first to examine the influence of p-hacking , other simulation studies had examined the performance of PET regression in previous articles. The author of PET-PEESE published an article with the humble title “Limitations of PET-PEESE,” a rare title in academic journals, especially when it is about an author’s pet theory or model (pun intended) (Stanley, 2017). However, despite the humble title, the article concludes that PET-PEESE works well when the set of studies is reasonably larger (k > 50) and heterogeneity is modest. Even with high heterogeneity, the risk is overestimation. The problem that PET may produce negative estimates even with a small effect size is not discussed. The problem may not gone unnoticed because Stanley (2017) did not examine p-hacking, and Carter et al.’s (2019) simulation study remains the only simulation study that examined effects of p-hacking on PET-regression estimates.

After reproducing Carter et al.’s (2019) simulation and ego-depletion results, I noticed a difference in the sample sizes. The simulation studies included many more studies with large samples than the ego-depletion literature. This led me to examine the influence of sample sizes on p-hacking simulations. Large sample sizes help regression-based models because the meta-analysis will include more studies with unbiased effect size estimates close to the intercept. These estimates will dampen any downward bias that is introduced by p-hacking. However, when these values are rare and the regression line is mostly determined by bias in small studies, the regression model may have a stronger downward bias.

Figure 1 shows the difference in sample sizes between the simulation studies and the actual ego-depletion literature.

Using the simulation sample sizes, I generated 20,000 studies and analyzed them in sets of 4 meta-analysis with 5,000 studies to examine large-sample bias. PET-regression produced effect sizes estimates of d = .00, .02, .06, and .08. Thus, PET underestimates the true average population effect size of d = .20, but it does not produce negative estimates with Carter et al.’s (2019) simulation sample sizes.

I then created a simulation of the ego-depletion effect sizes by sampling from the 20,000 simulations to get a sample of k = 5,000 studies with 39% sample sizes with 20-40 participants, 52% sample sizes between 40 and 100 participants, 8% sample sizes between 100 and 200 participants, and 1% sample sizes between 200 and 400 participants. The PET-regression estimate was d = -.46.

To my knowledge this is the first simulation study that produces a negative average effect size estimate when the true average population effect size is positive. The conditions that produce these estimates are (a) large heterogeneity in population effect sizes, (b) few studies with large sample sizes that produce unbiased estimates of population effect sizes, and (c) p-hacking.

Importantly, Carter et al. (2019) recommend PET-PEESE for this situation that is typical for meta-analyses in psychology. I showed that the use of PET-PEESE in this scenario is only justified when there are sufficient studies with large samples (N > 400) that provide unbiased effect size estimates. However, this is not a realistic assumption for psychology, at least for studies that are not online studies. When sample sizes are modest, PET-regression can produce severely biased estimates that diverge from the results of other studies. Contrary to Carter et al.’s (2019) recommendation, the PET-regression results cannot be trusted in this scenario, and other methods are more likely to produce less biased estimates.

The present results also show a limitation that has been overlooked by Stanley (2017). Stanley noted correctly that “if all studies are small, PET-PEESE has little power to identify a genuine empirical effect. However, this weakness is easily addressed by raising the nominal level to
20% when all studies are highly underpowered” (p. 590). Here I showed that even if studies are relatively large (N = 200) and vary in size, PET-regression can produce severely biased estimates. The problem is not the false positive risk, but rather the false negative risk and even sign reversals. Carter et al. (2019) suggested that negative estimates are probably caused by the lack of a true effect, but I showed that negative estimates can also mask a true positive average effect.

The present result does not imply that ego-depletion is a solid finding. They also do not resolve all of the inconsistencies in the meta-analysis of Carter et al.’s ego-depletion data. For example, the selection model also underestimates the average effect size, d = -.10, in this simulation. Thus, more research is needed to understand how p-hacking influences effect size estimates with different models and how the presence or absence of studies with large sample sizes affects these results. Trying to estimate the true average effect size from a set of p-hacked, underpowered studies may be impossible.

The main implication of this simulation study is that Carter et al.’s (2019) meta-analysis needs to be expanded to examine how p-hacking biases meta-analytic results and how the composition of sample sizes influences these results.

Carter et al. (2019): Simulation #40

In this series of blog posts, I am reexamining Carter et al.’s (2019) simulation studies that tested various statistical tools to conduct a meta-analysis. The reexamination has three purposes. First, it examines the performance of methods that can detect biases to do so. Second, I examine the ability of methods that estimate heterogeneity to provide good estimates of heterogeneity. Third, I examine the ability of different methods to estimate the average effect size for three different sets of effect sizes, namely (a) the effect size of all studies, including effects in the opposite direction than expected, (b) all positive effect sizes, and (c) all positive and significant effect sizes.

The reason to estimate effect sizes for different sets of studies is that meta-analyses can have different purposes. If all studies are direct replications of each other, we would not want to exclude studies that show negative effects. For example, we want to know that a drug has negative effects in a specific population. However, many meta-analyses in psychology rely on conceptual replications with different interventions and dependent variables. Moreover, many studies are selected to show significant results. In these meta-analyses, the purpose is to find subsets of studies that show the predicted effect with notable effect sizes. In this context, it is not relevant whether researchers tried many other manipulations that did not work. The aim is to find the one’s that did work and studies with significant results are the most likely ones to have real effects. Selection of significance, however, can lead to overestimation of effect sizes and underestimation of heterogeneity. Thus, methods that correct for bias are likely to outperform methods that do not in this setting.

I focus on the selection model implemented in the weightr package in R because it is the only tool that tests for the presence of bias and estimates heterogeneity. I examine the ability of these tests to detect bias when it is present and to provide good estimates of heterogeneity when heterogeneity is present. I also use the model result to compute estimates of the average effect size for all three sets of studies (all, only positive, only positive & significant).

The second method is PET-PEESE because it is widely used. PET-PEESE does not provide conclusive evidence of bias, but a positive relationship between effect sizes and sampling error suggests that bias is present. It does not provide an estimate of heterogeneity. It is also not clear which set of studies are the target of this method. Presumably, it is the average effect size of all studies, but if very few negative results are available, the estimate overestimates this average. I examine whether it produces a more reasonable estimate of the average of the positive results.

P-uniform is included because it is similar to the widely used p-curve method,but has an r-package and produces slightly superior estimates. I am using the LN1MINP method that is less prone to inflated estimates when heterogeneity is present. P-uniform assumes selection bias and provides a test of bias. It does not test heterogeneity. Importantly, p-uniform uses only positive and significant results. It’s performance has to be evaluated against the true average effect size for this subset of studies rather than all studies that are available. It does not provide estimates for the set of only positive results (including non-significant ones) or the set of all studies, including negative results.

Finally, I am using these simulations to examine the performance of z-curve.2.0 (Bartos & Schimmack, 2022). Using exactly the simulations that Carter et al. (2019) used prevents me from simulation hacking; that is, testing situations that show favorable results for my own method. I am testing the performance of z-curve to estimate average power of studies with positive and significant results and the average power of all positive studies. I then use these power estimates to estimate the average effect size for these two sets of studies. I also examine the presence of publication bias and p-hacking with z-curve.

The present blog post, examine a scenario with a small effect size, d = .2, in all studies. That is, there is no heterogeneity, tau = 0. This condition is of interest because the selection model did very well with a small effect size and high heterogeneity (tau = .4), but struggled with no effect size, d = 0, and homogeneity, tau = 0. One reason for problems with homogeneity is that the default specification of the selection model (3PSM) that was used by Carter et al. (2019) assumes heterogeneity and sometimes fails to converge when the data are homogenous. This problem can be solved by specifying a fixed effects model. Here, I examine whether the fixed effects model produces good estimates of the population effect size, d = .2, in the condition with high selection bias and high p-hacking (simulation #136). .

Simulation

I focus on a sample size of k = 100 studies to examine the properties of confidence intervals with a large, but not unreasonable set of studies. The effect size is d = .2 and there is no heterogeneity, tau = 0. This scenario examines Carter et al.’s (2019) scenario of high p-hacking and no selection bias.

Figure 1 shows the distribution of the effect size estimates (extreme values below d = 1.2 and above d = 2 are excluded). P-hacking …

.

Selection Model (weightr)

The most important question is whether the selection model underestimates a small effect size and might fail to reject the false null hypothesis in this scenario. P-hacking did reduce the effect size estimate, average d = .12, average 95%CI = .06 to .17, and only 20% of confidence intervals included the true value d = .20. However, 90% of confidence intervals did not include a value of 0 and rejected the null hypothesis that the effect size is 0.

All simulations had sufficient non-significant results to estimate selection bias. The average selection weight for non-significant results was .16, average 95%CI = .01 to .31, and all confidence intervals excluded a value of 1. Thus, the model correctly noticed that there were too few non-significant results.

The model also tested p-hacking by examining the selection weight for just significant results with p-values between .05 and .01 (two-sided). The average weight was 1.56, average 95%CI = 1.40 to 1.71 and none of the confidence intervals included 1. Thus, the model detected p-hacking, but was not able to fully correct the effect size estimate accordingly.

The model also correctly identified p-hacking. The average selection weight for just significant results with p-values between .05 and .01 (two-sided) was 3.38 with an average 95% confidence interval ranging from 2.15 to 4.62.

In short, the selection model performed relatively well, but it underestimated the true effect size by d = .10, and failed to reject the false null hypothesis in 10% of the simulations. Thus, other models have a chance to outperform the selection model in this scenario.

PET-PEESE

PET regresses effect sizes on the corresponding sampling errors. The average estimate was d = .11, average 95%CI = .=.04 to .19, and only 37% of confidence intervals included the true value of d = .20. Moreover, only 75% of confidence intervals did not include 0. Thus, 25% of the confidence intervals failed to reject the null hypothesis.

When PET rejected the null hypothesis 75% of the time, underestimation of the effect size is not a problem because the PET-PEESE approach uses the estimate from the PEESE regression of effect sizes on the sampling variance as the effect size estimate. The average estimate was d = .17, average 95%CI = .13 to .22, and 80% of confidence intervals included the correct value.

In sum, PET-PEESE outperforms the selection model in terms of effect size estimation because PEESE estimates are better, but the selection model outperforms the PET-PEESE approach because it rejects the false null hypothesis more often.

Despite these differences, it is also noteworthy that the two models show very similar results. Thus, p-hacking with small effect sizes cannot explain why Carter et al.’s (2019) meta-analysis of the ego-depletion literature showed very different results for the selection model (d = .3) and PET-PEESE (d = -.3).

P-Uniform

P-uniform detected bias in only 87% of the simulations. Thus, it was not as sensitive as the selection model and it cannot distinguish p-hacking and selection bias.

Like the other models, p-uniform underestimated the true effect size, average d = .06, average 95%CI = -.20 to .21. Only 64% of the confidence intervals included the true value of d = .2. More problematic was the finding that only 13% of the confidence intervals did not include zero. Thus, p-hacking and small effect sizes lead to many false negative results. Clearly, p-uniform is not a useful method in this scenario.

Z-Curve

The estimated average power of studies with positive and significant results was 14%, average 95% CI = 04% to 30%, and 84% of confidence intervals included the true value of 22%. The results show that p-hacking leads to an underestimation of the true power.

The estimated average power of studies with positive results was 8%, average 95%CI = 5% to 20%, and only 49% of confidence intervals included the true value of 20%. Thus, p-hacking has an even stronger downward bias for this estimate.

Despite the bias in the power estimate, the average effect size estimate for studies with positive and significant results was unbiased, d = .19 average d = .02 to d = .34, and 99% of the confidence intervals included the true value of d = .20.

In contrast, the estimate for the set of studies with positive results had a downward bias, average d = .10, average d = .00 to .26, and only 77% of confidence intervals included the true value of d = .2

Like the selection model and p-uniform, z-curve also detected bias in all simulations. It was not able to test p-hacking because there were too few z-scores greater than 2.6.

Conclusion

The main conclusion from this specific simulation study is that the selection model performed well. Although the PET-PEESE model had better effect size estimates when the PET model rejected the null-hypothesis and PEESE estimates were used, it had a higher risk of false negative results when PET failed to reject the null-hypothesis. This is an important finding because Carter et al. (2019) favored the PET result over the selection model to conclude that the true ego-depletion effect is zero. The present results suggest that a non-significant result with PET and a significant positive estimate with the selection model might be the result of p-hacking with small effect sizes and underestimation of the average effect with PET.

Carter et al.’s (2019) simulation of “high” p-hacking is also not very extreme. As Carter et al. (2019) noted more extreme p-hacking may lead to different results. The present simulation suggests that more extreme p-hacking would produce a more severe downward bias and produce a high rate of false negative results. They recommended further research, but to my knowledge, simulations of p-hacking are lacking.

Carter et al. (2019): Simulation #136

In this series of blog posts, I am reexamining Carter et al.’s (2019) simulation studies that tested various statistical tools to conduct a meta-analysis. The reexamination has three purposes. First, it examines the performance of methods that can detect biases to do so. Second, I examine the ability of methods that estimate heterogeneity to provide good estimates of heterogeneity. Third, I examine the ability of different methods to estimate the average effect size for three different sets of effect sizes, namely (a) the effect size of all studies, including effects in the opposite direction than expected, (b) all positive effect sizes, and (c) all positive and significant effect sizes.

The reason to estimate effect sizes for different sets of studies is that meta-analyses can have different purposes. If all studies are direct replications of each other, we would not want to exclude studies that show negative effects. For example, we want to know that a drug has negative effects in a specific population. However, many meta-analyses in psychology rely on conceptual replications with different interventions and dependent variables. Moreover, many studies are selected to show significant results. In these meta-analyses, the purpose is to find subsets of studies that show the predicted effect with notable effect sizes. In this context, it is not relevant whether researchers tried many other manipulations that did not work. The aim is to find the one’s that did work and studies with significant results are the most likely ones to have real effects. Selection of significance, however, can lead to overestimation of effect sizes and underestimation of heterogeneity. Thus, methods that correct for bias are likely to outperform methods that do not in this setting.

I focus on the selection model implemented in the weightr package in R because it is the only tool that tests for the presence of bias and estimates heterogeneity. I examine the ability of these tests to detect bias when it is present and to provide good estimates of heterogeneity when heterogeneity is present. I also use the model result to compute estimates of the average effect size for all three sets of studies (all, only positive, only positive & significant).

The second method is PET-PEESE because it is widely used. PET-PEESE does not provide conclusive evidence of bias, but a positive relationship between effect sizes and sampling error suggests that bias is present. It does not provide an estimate of heterogeneity. It is also not clear which set of studies are the target of this method. Presumably, it is the average effect size of all studies, but if very few negative results are available, the estimate overestimates this average. I examine whether it produces a more reasonable estimate of the average of the positive results.

P-uniform is included because it is similar to the widely used p-curve method,but has an r-package and produces slightly superior estimates. I am using the LN1MINP method that is less prone to inflated estimates when heterogeneity is present. P-uniform assumes selection bias and provides a test of bias. It does not test heterogeneity. Importantly, p-uniform uses only positive and significant results. It’s performance has to be evaluated against the true average effect size for this subset of studies rather than all studies that are available. It does not provide estimates for the set of only positive results (including non-significant ones) or the set of all studies, including negative results.

Finally, I am using these simulations to examine the performance of z-curve.2.0 (Bartos & Schimmack, 2022). Using exactly the simulations that Carter et al. (2019) used prevents me from simulation hacking; that is, testing situations that show favorable results for my own method. I am testing the performance of z-curve to estimate average power of studies with positive and significant results and the average power of all positive studies. I then use these power estimates to estimate the average effect size for these two sets of studies. I also examine the presence of publication bias and p-hacking with z-curve.

The present blog post, examine a scenario with a small effect size, d = .2, in all studies. That is, there is no heterogeneity, tau = 0. This condition is of interest because the selection model did very well with a small effect size and high heterogeneity (tau = .4), but struggled with no effect size, d = 0, and homogeneity, tau = 0. One reason for problems with homogeneity is that the default specification of the selection model (3PSM) that was used by Carter et al. (2019) assumes heterogeneity and sometimes fails to converge when the data are homogenous. This problem can be solved by specifying a fixed effects model. Here, I examine whether the fixed effects model produces good estimates of the population effect size, d = .2, in the condition with high selection bias and high p-hacking (simulation #136). .

Simulation

I focus on a sample size of k = 100 studies to examine the properties of confidence intervals with a large, but not unreasonable set of studies. The effect size is d = .2 and there is no heterogeneity, tau = 0.

Figure 1 shows the distribution of the effect size estimates (extreme values below d = 1.2 and above d = 2 are excluded). Selection bias and p-hacking moves the distribution of effect size estimates to the right. A meta-analysis that ignores bias would overestimate the average population effect size.

.

Selection Model (weightr)

The most important question is whether the selection model underestimates a small effect size and might fail to reject the false null hypothesis in this scenario. That was not the case. The average effect size estimate was d = .20, average 95%CI = .18 to .23, and 93% of the confidence intervals included the true value of d = .20. All confidence intervals did not include zero. Thus, the false null hypothesis was rejected in all simulations.

When non-significant results were available to estimate the selection weight for non-significant results, the model always showed evidence for selection bias; that is, the 95% confidence interval did not include a value of 1.

The model also correctly identified p-hacking. The average selection weight for just significant results with p-values between .05 and .01 (two-sided) was 3.38 with an average 95% confidence interval ranging from 2.15 to 4.62.

In short, the selection model performed well and was able to detect a small effect size in this simulation without bias.

PET-PEESE

PET regresses effect sizes on the corresponding sampling errors. The average estimate was d = .09, average 95%CI = .=.07 to .12. This shows a bias to underestimate the true value and none of the confidence intervals included the true parameter, d = .2. However, all confidence intervals rejected the null hypothesis. In this case, the PET-PEESE approach recommends using the PEESE results to avoid underestimation. The PEESE estimate was close to the true value of d = .2, average d = ..22, average 95%CI = .20 to .24, but only 44% of confidence intervals included the true value. However, all confidence intervals correctly rejected the false null-hypothesis.

As all confidence intervals rejected the null hypothesis, the PET-PEESE approach recommends the use of the PEESE regression of effect sizes on sample variances. The average effect size estimate was d = .22, average 95%CI = .20 to .24, and 44% of the confidence intervals included the true value of d = .2.

In short, PET-PEESE works well in this simulation, but it did not perform better than the selection model. The simulation also does not produce a notable difference between the two methods that was observed in Carter et al.’s (2019) meta-analysis of ego-depletion, where the average effect sizes estimates were -.27 for PET-PEESE and .33 for the selection model.

P-Uniform

P-uniform detected bias in all 100% of the simulations, just like the selection model. However, P-uniform was not as good in estimating the true effect size. The average estimate was d = .07, average 95%CI = -.11 to .19, and only 44% of confidence intervals included the true value. This poor performance can be explained by the influence of p-hacking.

Z-Curve

The estimated average power of studies with positive and significant results was 15%, average 95% CI = 04% to 28%, and 76% of confidence intervals included the true value of 23%. The results show that p-hacking leads to an underestimation of the true power.

The estimated average power of studies with positive results was 7%, average 95%CI = 5% to 18%, and only 19% of confidence intervals included the true value of 23%. Thus, p-hacking has an even stronger downward bias for this estimate.

Despite the bias in the power estimate, the average effect size estimate for studies with positive and significant results was unbiased, d = .21, average d = .03 to d = .32, and 99% of the confidence intervals included the true value of d = .20.

In contrast, the estimate for the set of studies with positive results had a downward bias, average d = .09, average d = .00 to .25.

Like the selection model and p-uniform, z-curve also detected bias in all simulations. It also detected p-hacking in 47% of all simulations, but this performance is not as good as the performance of the selection model that identified p-hacking in all simulations.

Conclusion

The main conclusion from this specific simulation study is that the selection model performed well and outperformed all other methods. While PET-PEESE produced a good average estimate, the 95% confidence interval included the true value only 44% of the time. Thus, this simulation offers no justification for the use of PET-PEESE over the selection model. The selection model also has the advantage that it provides an estimate of heterogeneity and correctly detected selection and p-hacking. PET (but not PET-PEESE), p-uniform, and z-curve also showed a downward bias that is explained by the assumption of these models that excessive significant results are due to selection bias. When excessive significant results are produced by p-hacking, just significant results are overrepresented and lead to an underestimation of the average effect size. The effect is mild in these simulations, but Carter et al.’s (2019) high p-hacking condition still assumes rather mild p-hacking. It is therefore important to examine the performance of these models with more intense p-hacking.

Carter et al. (2019): Simulation #132

In this series of blog posts, I am reexamining Carter et al.’s (2019) simulation studies that tested various statistical tools to conduct a meta-analysis. The reexamination has three purposes. First, it examines the performance of methods that can detect biases to do so. Second, I examine the ability of methods that estimate heterogeneity to provide good estimates of heterogeneity. Third, I examine the ability of different methods to estimate the average effect size for three different sets of effect sizes, namely (a) the effect size of all studies, including effects in the opposite direction than expected, (b) all positive effect sizes, and (c) all positive and significant effect sizes.

The reason to estimate effect sizes for different sets of studies is that meta-analyses can have different purposes. If all studies are direct replications of each other, we would not want to exclude studies that show negative effects. For example, we want to know that a drug has negative effects in a specific population. However, many meta-analyses in psychology rely on conceptual replications with different interventions and dependent variables. Moreover, many studies are selected to show significant results. In these meta-analyses, the purpose is to find subsets of studies that show the predicted effect with notable effect sizes. In this context, it is not relevant whether researchers tried many other manipulations that did not work. The aim is to find the one’s that did work and studies with significant results are the most likely ones to have real effects. Selection of significance, however, can lead to overestimation of effect sizes and underestimation of heterogeneity. Thus, methods that correct for bias are likely to outperform methods that do not in this setting.

I focus on the selection model implemented in the weightr package in R because it is the only tool that tests for the presence of bias and estimates heterogeneity. I examine the ability of these tests to detect bias when it is present and to provide good estimates of heterogeneity when heterogeneity is present. I also use the model result to compute estimates of the average effect size for all three sets of studies (all, only positive, only positive & significant).

The second method is PET-PEESE because it is widely used. PET-PEESE does not provide conclusive evidence of bias, but a positive relationship between effect sizes and sampling error suggests that bias is present. It does not provide an estimate of heterogeneity. It is also not clear which set of studies are the target of this method. Presumably, it is the average effect size of all studies, but if very few negative results are available, the estimate overestimates this average. I examine whether it produces a more reasonable estimate of the average of the positive results.

P-uniform is included because it is similar to the widely used p-curve method,but has an r-package and produces slightly superior estimates. I am using the LN1MINP method that is less prone to inflated estimates when heterogeneity is present. P-uniform assumes selection bias and provides a test of bias. It does not test heterogeneity. Importantly, p-uniform uses only positive and significant results. It’s performance has to be evaluated against the true average effect size for this subset of studies rather than all studies that are available. It does not provide estimates for the set of only positive results (including non-significant ones) or the set of all studies, including negative results.

Finally, I am using these simulations to examine the performance of z-curve.2.0 (Bartos & Schimmack, 2022). Using exactly the simulations that Carter et al. (2019) used prevents me from simulation hacking; that is, testing situations that show favorable results for my own method. I am testing the performance of z-curve to estimate average power of studies with positive and significant results and the average power of all positive studies. I then use these power estimates to estimate the average effect size for these two sets of studies. I also examine the presence of publication bias and p-hacking with z-curve.

The present blog post, examine a scenario where the null hypothesis is true in all studies. In this scenario, significant results are false positive results. If all results were published, it would be easy to see that the few significant results are statistical flukes. However, selection for significance can suppress non-significant results and p-hacking can turn non-significant results into significant ones. Simulation #132 examines Carter et al.’s (2019) scenario of high selection bias and high p-hacking. The analysis of simulation #36 showed that p-hacking alone produced only a small bias that was easily detected and correct by all methods. Simulation #100 showed that selection bias led to a small bias for the selection model and other methods outperformed the selection model. The simulation of p-hacking also showed a tendency of selection models to overcorrect. Carter et al. (2019) already observed that selection bias and p-hacking often cancel each other out. Thus, it is likely that all models perform well in this condition.

Simulation

I focus on a sample size of k = 100 studies to examine the properties of confidence intervals with a large, but not unreasonable set of studies. The effect size is d = 0 and there is no heterogeneity, tau = 0.

Figure 1 shows the distribution of the effect size estimates (extreme values below d = 1.2 and above d = 2 are excluded). Selection bias and p-hacking moves the distribution of effect size estimates to the right. A meta-analysis that ignores bias would overestimate the average population effect size.

Histograms of effect sizes do not show the proportion of significant and non-significant results. This can be examined by computing the ratio of effect sizes over sampling error and treat these as approximate z-scores. Alternatively, the sample sizes can be used to compute t-values, and use the corresponding p-values to convert t-values into z-values.

Figure 2 shows the plot of the z-values and the analysis of the z-values with z-curve, using the standard selection model that fits the model to all statistically significant z-values. 6 results with negative effects are excluded, leaving 4,994 studies with positive effects. Selection bias and p-hacking produced an observed discovery rate (ODR, i.e., the percentage of significant positive results) of 89%. All of these effects are false positives.

Z-curve correctly estimates that the power is 5%. That is, with alpha = .05, 5% of significant results are expected due to chance. The power for the subset of statistically significant results is also 5%. The z-curve estimate for the expected replication rate is 2.5% because it also takes the sign of the effect into account. The power estimates can also be converted into effect size estimates. In this case, this is simple because 5% power implies that all studies have an effect size of zero.

To distinguish selection bias and p-hacking, z-curve can be fitted to only “really” significant results with z-values greater than 2.6. P-hacking would produce more just significant results (2 to 2.6, p = .05 to .01 two-sided) than the model predicts. The results of this test are presented in Figure 3.

With this large set of studies, z-curve can detect that p-hacking produced too many just significant results. However, in smaller sets of studies, there may not be enough “really” significant results. The simulation examines the power of this p-hacking test.

Selection Model (weightr)

Carter et al. (2019) observe conversion issues with the selection model in some conditions. I also encountered this problem, but was able to diagnose and fix the problem. The biggest problem is that the default 3PSM model used by Carter et al. (2019) tries to estimate heterogeneity. When the data are homogenous and all population effect sizes are the same, the model sometimes converges and shows a parameter of zero, but in about half of the cases it did not converge. In these cases, it was possible to get effect size estimates by specifying a fixed-effect model that assumes homogeneity. Here, I fitted the fixed-effects model from the start.

However, this did not solve all problems. In 75% of the simulations, the model provided an effect size estimate, but was unable to estimate sampling error and compute confidence intervals.

The average effect size estimate was close to the true value of zero, but showed a small positive bias like the simulation with high selection bias, d = .06. Estimates ranged from .02 to .10. However, without confidence intervals, researchers would not know how precise the effect size estimate for their specific dataset is.

The model also provided no test of bias and the selection weight for non-significant results was set to .01, even when non-significant results were available.

Overall, this is is the worst performance of the selection model so far. It is also worse than the performance in simulations with only p-hacking or only selection bias. A plausible explanation that would require further study is that p-hacking shrinks the estimate of the sampling error below the variance that is expected from an unbiased model. If this is the problem, it might be possible to fix this parameter at a theoretical minimum.

In sum, the selection model is not very useful in this scenario and other models can outperform it simply by providing confidence intervals around the average effect size estimate.

PET-PEESE

PET regresses effect sizes on the corresponding sampling errors. The average estimate was d = -.04, average 95%CI = .=.08 to .00. This shows a bias to underestimate the true value and only 63% of the confidence intervals included the true parameter, d = .0. None of the CI rejected the null hypothesis in favor of a positive effect. Thus, the model correctly showed that there was no positive effect in these simulations.

When PET does not show a positive and significant result, the PET estimate is used. Thus, the PEESE regression results are not theoretically relevant. PEESE regresses the effect sizes on the sampling variance (i.e., the squared sampling error). PEESE overestimates the true average effect size even more, average d = .18, average 95%CI = .15 to .21, and all confidence intervals rejected the null hypothesis in favor of a positive effect. This confirms that PEESE results have a positive bias and should not be used when PET does not reject the null hypothesis.

In short, PET-PEESE outperforms the selection model when the data have no evidential value and there is high selection bias and p-hacking.

P-Uniform

P-uniform detected bias in all 100% of the simulations. This makes p-uniform an attractive model for this simulation to test for bias. PET-PEESE uses regression to test for bias, but it is agnostic about the source of a relationship between sample size and effect sizes. P-uniform clearly shows that bias is present and caused by selection, p-hacking, or both.

P-uniform was not as good in estimating the true effect size. The average estimate was d = -.14, average 95%CI = -.43 to .06. 85% of CI included the true value and none of the CI included a small positive effect size, d = .2. Thus, all models agree that there is no positive effect, but PET-PEESE estimates are closest to the true value of zero.

Z-Curve

The estimated average power of studies with positive and significant results was 2.8%, average 95% CI = 2.5% to 7.7%. All confidence intervals included the true value of 2.5%. The most important finding here is the tight confidence interval. Scientifically speaking, the z-curve analysis clearly shows that the data are crap. While most of the results are significant, they contain no evidence to reject the null hypothesis.

The estimated average power of all studies with positive results is 5%, average 95%CI = 5% to 10%. This finding just confirms the previous results. Moreover, similar power estimates show that selection for significance does not increase power, which implies homogeneity.

Converting the power estimate into effect size estimates, gives an average effect size estimate of d = 00, 95%CI = .00 to .12, and all confidence intervals include the true value of zero. The model does not show a negative bias because power is limited to alpha. Thus, p-hacking cannot produce lower estimates. 78% of the confidence intervals did not include a value of .2. Thus, z-curve is less able to rule out a positive small effect, but the other methods were only able to do so because they had a negative bias in the point estimate due to p-hacking.

Just like p-uniform, z-curve correctly detected bias in all simulations. It was not possible to test for p-hacking in this simulation because there were too few studies with z-values greater than 2.6 to fit the model.

Conclusion

The main conclusion from this specific simulation study is that the selection model struggled in this condition. While it provided only slightly positively biased effect size estimates, it was not able to estimate sampling error to provide information about uncertainty in these estimates.

P-uniform and PET-PEESE produced negative estimates due to p-hacking, but the bias was small and confidence intervals included zero. Z-curve was not affected by p-hacking in this condition without a real effect because estimates of power are limited to alpha at the lower end.

P-uniform and z-curve correctly detected bias in the data.

Overall, this simulation shows the advantage of using multiple methods. However, the results do not support Carter et al.’s (2019) main conclusion that models can produce different results in different conditions. All models suggested that the data lack evidential value, despite the fact that most of the studies produced positive and significant results. P-uniform and z-curve also explain the reason for this discrepancy. Significance was produced with selection or p-hacking (or both) rather than with real effects.

Carter et al. (2019): Simulation #36

In this series of blog posts, I am reexamining Carter et al.’s (2019) simulation studies that tested various statistical tools to conduct a meta-analysis. The reexamination has three purposes. First, it examines the performance of methods that can detect biases to do so. Second, I examine the ability of methods that estimate heterogeneity to provide good estimates of heterogeneity. Third, I examine the ability of different methods to estimate the average effect size for three different sets of effect sizes, namely (a) the effect size of all studies, including effects in the opposite direction than expected, (b) all positive effect sizes, and (c) all positive and significant effect sizes.

The reason to estimate effect sizes for different sets of studies is that meta-analyses can have different purposes. If all studies are direct replications of each other, we would not want to exclude studies that show negative effects. For example, we want to know that a drug has negative effects in a specific population. However, many meta-analyses in psychology rely on conceptual replications with different interventions and dependent variables. Moreover, many studies are selected to show significant results. In these meta-analyses, the purpose is to find subsets of studies that show the predicted effect with notable effect sizes. In this context, it is not relevant whether researchers tried many other manipulations that did not work. The aim is to find the one’s that did work and studies with significant results are the most likely ones to have real effects. Selection of significance, however, can lead to overestimation of effect sizes and underestimation of heterogeneity. Thus, methods that correct for bias are likely to outperform methods that do not in this setting.

I focus on the selection model implemented in the weightr package in R because it is the only tool that tests for the presence of bias and estimates heterogeneity. I examine the ability of these tests to detect bias when it is present and to provide good estimates of heterogeneity when heterogeneity is present. I also use the model result to compute estimates of the average effect size for all three sets of studies (all, only positive, only positive & significant).

The second method is PET-PEESE because it is widely used. PET-PEESE does not provide conclusive evidence of bias, but a positive relationship between effect sizes and sampling error suggests that bias is present. It does not provide an estimate of heterogeneity. It is also not clear which set of studies are the target of this method. Presumably, it is the average effect size of all studies, but if very few negative results are available, the estimate overestimates this average. I examine whether it produces a more reasonable estimate of the average of the positive results.

P-uniform is included because it is similar to the widely used p-curve method,but has an r-package and produces slightly superior estimates. I am using the LN1MINP method that is less prone to inflated estimates when heterogeneity is present. P-uniform assumes selection bias and provides a test of bias. It does not test heterogeneity. Importantly, p-uniform uses only positive and significant results. It’s performance has to be evaluated against the true average effect size for this subset of studies rather than all studies that are available. It does not provide estimates for the set of only positive results (including non-significant ones) or the set of all studies, including negative results.

Finally, I am using these simulations to examine the performance of z-curve.2.0 (Bartos & Schimmack, 2022). Using exactly the simulations that Carter et al. (2019) used prevents me from simulation hacking; that is, testing situations that show favorable results for my own method. I am testing the performance of z-curve to estimate average power of studies with positive and significant results and the average power of all positive studies. I then use these power estimates to estimate the average effect size for these two sets of studies. I also examine the presence of publication bias and p-hacking with z-curve.

This blog post examines a rather simple, but important scenario. It assumes that all studies tested a true null hypothesis (i.e., there is no real effect), but p-hacking produces a high percentage of significant results. This simulation assumes that there is only p-hacking and no selection bias. Thus, all studies that were conducted are “reported” and available for the meta-analysis, but p-hacking inflates the effect sizes of some studies.

Simulation

I focus on a sample size of k = 100 studies to examine the properties of confidence intervals with a large, but not unreasonable set of studies. Smaller sets of studies may benefit the selection model because wider confidence intervals are less likely to produce false positive results. In this scenario, the true population effect size is zero for any set of studies because it is zero in every study.

Figure 1 shows the distribution of the effect size estimates (extreme values below d = 1.2 and above d = 2 are excluded). The influence of p-hacking is visible

Histograms of effect sizes do not show the proportion of significant and non-significant results. This can be examined by computing the ratio of effect sizes over sampling error and treat these as approximate z-scores. Alternatively, the sample sizes can be used to compute t-values, and use the corresponding p-values to convert t-values into z-values.

Figure 2 shows the plot of the z-values and the analysis of the z-values with z-curve, using the standard selection model that fits the model to all statistically significant z-values. The plot shows clear evidence of selection bias. Z-curve’s estimate of average power for the significant results is 5% (compared to the true value of 2.5% because the ERR takes the sign of replication results into account). The confidence interval includes the true value of 2.5%. Thus, the model does not reject the null hypothesis that all significant results are false positive results. The estimated average power for all positive results (i.e., the expected discovery rate, EDR) is estimated at 6%, while the true value is 5%. The 95%CI includes the true value. Incidentally, z–curve uses this result to estimate that 89% of the significant results are false positives, but the confidence interval includes the true value of 100%.

The simulation study shows the performance of z-curve with just 100 studies and a much smaller set of significant results that are used for model fitting. This will only widen the CI and is unlikely to change the result that z-curve performs well in this scenario.

Model Specifications

The specification of the z-curve model was explained and justified above.

Carter et al. (2019) observe conversion issues with the default (3PSM) selection model in some conditions. This is one of these conditions. However, the problem is that the 3PSM model assumes heterogeneity and has problems when all population effect sizes are the same. This does not mean that selection models cannot be used with those data. A solution is to test the random model and when it fails to run the fixed effects model. I used this approach and obtained usable results in all simulations.

The specification of the selection model remains the same as in the previous simulations and differed from Carter et al.’s default 3PSM model. First, I added steps to model different selection for positive and negative results. Second, I added a step to allow for different selection of non-significant and significant negative results. Third, I added a step to identify p-hacking that produces too many just significant results. The range of just significant results was defined as p-values between .05 and .01, two-sided. The one-sided p-value steps are c(.005,.025,.050,.500,.975).

P-uniform and PET-PEESE do not require thinking and were used as usual.

Selection Model (weightr)

The random-effects model failed to converge in 43% of the simulations. Actually, this is the better result because it implies zero heterogeneity, which is actually the true heterogeneity. In the other 57% of the cases, the random effects model estimated non-zero heterogeneity, average tau = ..01, average 95%CI = .00 to .08, 98% of confidence intervals included the true value of 0. Therefore, the selection model provides valuable information about heterogeneity and shows that there is very little or no heterogeneity. This is important because it implies that all sets of studies have the same average effect size and the estimate for all studies is the same as for sets selected for significance.

The fixed and the random effects model had difficulties estimating the selection weights for a model with several steps. However, the model with a single step that distinguishes significant positive and all other effect size estimates worked well. Therefore, I reran the simulation with a fixed effect model and a single step at p = .025 (one-tailed).

The average effect size estimate for all studies was d = -.03, average 95%CI = -.06 to .00, and 67% of the confidence intervals included the true value of zero. Moreover, the other confidence intervals implied a small negative effect. The reason is that p-hacking introduces a negative bias in selection models because p-hacking produces more just significant results than really significant results, but the selection model assumes that they are just as likely to be selected as really significant results when the null-hypothesis is true. However, the bias is really small. Thus, the selection model provides the correct information about the effect size for all studies and because it shows that there is no heterogeneity, the results also imply that even the subset of studies with positive and significant results contains only false positive results.

The selection model also detected bias, although it was not able to distinguish between selection and p-hacking. The average selection weight for non-significant results was d = .10, average 95% = .03 to .18, and none of the confidence intervals included a value of 1 that would indicate no bias.

To summarize, the main problem for the selection model was that the model had to be reduced to a two-component model (2PSM), without a random parameter for heterogeneity, tau fixed at 0, and a single step to distinguish significant and non-significant results. This model fit well and correctly revealed that the 100 studies lack evidential value.

PET-PEESE

PET regresses effect sizes on the corresponding sampling errors. The average estimate was d = -.05, 95%CI = .-.13 to .03, and 78% of the confidence intervals included the true value of zero. Thus, PET also shows a negative bias due to p-hacking, but the bias is also small.

PEESE regresses the effect sizes on the sampling variance (i.e., the squared sampling error). It is only relevant, if PET shows a positive and significant result. This was not the case in this simulation. PEESE results also show no effect, d = -03, 95%CI = -.07 to .02, and 83% of confidence intervals included the true value.

In short, PET-PEESE performs well in this simulation, but it does not perform better than the selection model. .

P-Uniform

P-uniform also has a bias test, but it detected bias in only 8% of the simulations. Thus, the selection model has a better bias test.

The average effect size estimate for the set of studies with positive and significant results was d = -.15. Thus, the model is more strongly biased than the other models. The reason is that it does not use the non-significant results and p-hacking has a stronger influence on the small subset of studies with significant results. The small sample size of significant results also implies that the confidence interval is very wide, average CI = -1.07 to .25. While 95% of confidence intervals included the true value of 0, 65% of confidence intervals also included a value of d = .2, which is considered a small effect size. Thus, the model makes it more difficult to rule out that a set of studies has evidential value. In short, p-uniform does not add useful information in this simulation.

Z-Curve

One way to test selection bias with z-curve is to fit the model to all positive results, non-significant and significant results, and to test for the presence of excessive just significant results (z = 2 to 2.6 ~ p = .05 to .01). This test was only significant in 67% of all simulations. Thus, the selection model has more power to detect bias.

Z-curve did show that the studies lacked evidential value. That is, confidence intervals always included power of 5%, but this information was also provided by the selection model.

Conclusion

The main conclusion from this specific simulation study is that it was necessary to specify a fixed-effect selection model with a single step. This model always converged, always showed evidence of bias, and slightly underestimated the true effect of zero. PET-PEESE performed as well. P-uniform and z-curve did not do well because they rely on significant results and on average only 30% of the results were significant, limiting the sample size to about 30 studies, while the other models could use all 100 studies.

The high percentage of non-significant results also has other implications. First, this simulation of p-hacking is mild. Stronger p-hacking would turn more non-significant results into significant results. The high percentage of non-significant results is also not representative of many research areas in psychology because success rates in psychology journals are 90% or higher. Z-curve and p-uniform were developed to deal with situations of strong bias.

Finally, the simulation of mild p-hacking and no selection bias implies that a meta-analysis that does not assume bias would also produce reasonable estimates. The weighted mean of the observed effect size estimates is d = .03. Thus, p-hacking has a weak influence on the results. A simple correction for bias is to use only the non-significant results. The estimated effect size is d = -.01. This correction has a negative bias because positive and significant estimates are excluded, but the effect is very small. The comparison of the two estimates shows that the true average is close to zero. Of course, using the selection model is better, but the point here is that the simulated condition is not particularly interesting. There was never a lot of bias to correct here. This is very different from real data, where naive meta-analyses show effect sizes of d = .6 and bias-corrected estimates range from 0 to .3 (Carter et al., 2019). Future studies need to examine more severe p-hacking scenarios.

In short, this condition is not particularly interesting, but it also does not pose a challenge for the selection model. Thus, the selection model remains the model to beat, and Simulation 100 (d = 0, tau = 0, & high selection bias) remains the most problematic scenario for the selection model.

Carter et al. (2019): Simulation #100

In this series of blog posts, I am reexamining Carter et al.’s (2019) simulation studies that tested various statistical tools to conduct a meta-analysis. The reexamination has three purposes. First, it examines the performance of methods that can detect biases to do so. Second, I examine the ability of methods that estimate heterogeneity to provide good estimates of heterogeneity. Third, I examine the ability of different methods to estimate the average effect size for three different sets of effect sizes, namely (a) the effect size of all studies, including effects in the opposite direction than expected, (b) all positive effect sizes, and (c) all positive and significant effect sizes.

The reason to estimate effect sizes for different sets of studies is that meta-analyses can have different purposes. If all studies are direct replications of each other, we would not want to exclude studies that show negative effects. For example, we want to know that a drug has negative effects in a specific population. However, many meta-analyses in psychology rely on conceptual replications with different interventions and dependent variables. Moreover, many studies are selected to show significant results. In these meta-analyses, the purpose is to find subsets of studies that show the predicted effect with notable effect sizes. In this context, it is not relevant whether researchers tried many other manipulations that did not work. The aim is to find the one’s that did work and studies with significant results are the most likely ones to have real effects. Selection of significance, however, can lead to overestimation of effect sizes and underestimation of heterogeneity. Thus, methods that correct for bias are likely to outperform methods that do not in this setting.

I focus on the selection model implemented in the weightr package in R because it is the only tool that tests for the presence of bias and estimates heterogeneity. I examine the ability of these tests to detect bias when it is present and to provide good estimates of heterogeneity when heterogeneity is present. I also use the model result to compute estimates of the average effect size for all three sets of studies (all, only positive, only positive & significant).

The second method is PET-PEESE because it is widely used. PET-PEESE does not provide conclusive evidence of bias, but a positive relationship between effect sizes and sampling error suggests that bias is present. It does not provide an estimate of heterogeneity. It is also not clear which set of studies are the target of this method. Presumably, it is the average effect size of all studies, but if very few negative results are available, the estimate overestimates this average. I examine whether it produces a more reasonable estimate of the average of the positive results.

P-uniform is included because it is similar to the widely used p-curve method,but has an r-package and produces slightly superior estimates. I am using the LN1MINP method that is less prone to inflated estimates when heterogeneity is present. P-uniform assumes selection bias and provides a test of bias. It does not test heterogeneity. Importantly, p-uniform uses only positive and significant results. It’s performance has to be evaluated against the true average effect size for this subset of studies rather than all studies that are available. It does not provide estimates for the set of only positive results (including non-significant ones) or the set of all studies, including negative results.

Finally, I am using these simulations to examine the performance of z-curve.2.0 (Bartos & Schimmack, 2022). Using exactly the simulations that Carter et al. (2019) used prevents me from simulation hacking; that is, testing situations that show favorable results for my own method. I am testing the performance of z-curve to estimate average power of studies with positive and significant results and the average power of all positive studies. I then use these power estimates to estimate the average effect size for these two sets of studies. I also examine the presence of publication bias and p-hacking with z-curve.

This blog post examines a rather simple, but important scenario. It assumes that all studies tested a true null hypothesis (i.e., there is no real effect), but high selection bias produces a lot of significant results. Carter et al. (2019) found that this scenario produced problems for the default selection model (3PSM) and that it often falsely rejected the null hypothesis; that is, the selection model had a high type-I error rate. On the other hand, this scenario should be easy for p-uniform and z-curve because they rely only on the significant results and the distribution of p-values or z-scores generated by true null-hypothesis is easy to detect.

Simulation

I focus on a sample size of k = 100 studies to examine the properties of confidence intervals with a large, but not unreasonable set of studies. Smaller sets of studies may benefit the selection model because wider confidence intervals are less likely to produce false positive results. In this scenario, the true population effect size is zero for any set of studies because it is zero in every study.

Figure 1 shows the distribution of the effect size estimates (extreme values below d = 1.2 and above d = 2 are excluded). Selection bias leaves mostly positive results. The presence of some negative results can be explained by the fact that the selection simulation kept negative and statistically significant results.

Histograms of effect sizes do not show the proportion of significant and non-significant results. This can be examined by computing the ratio of effect sizes over sampling error and treat these as approximate z-scores. Alternatively, the sample sizes can be used to compute t-values, and use the corresponding p-values to convert t-values into z-values.

Figure 2 shows the plot of the z-values and the analysis of the z-values with z-curve, using the standard selection model that fits the model to all statistically significant z-values. The plot shows clear evidence of selection bias. Z-curve’s estimate of average power for the significant results is 5% (compared to the true value of 2.5% because the ERR takes the sign of replication results into account). The confidence interval includes the true value of 2.5%. Thus, the model does not reject the null hypothesis that all significant results are false positive results. The estimated average power for all positive results (i.e., the expected discovery rate, EDR) is estimated at 6%, while the true value is 5%. The 95%CI includes the true value. Incidentally, z–curve uses this result to estimate that 89% of the significant results are false positives, but the confidence interval includes the true value of 100%.

The simulation study shows the performance of z-curve with just 100 studies and a much smaller set of significant results that are used for model fitting. This will only widen the CI and is unlikely to change the result that z-curve performs well in this scenario.

Model Specifications

The specification of the z-curve model was explained and justified above.

Carter et al. (2019) observe conversion issues with the default (3PSM) selection model in some conditions. This is one of these conditions. However, the problem is that the 3PSM model assumes heterogeneity and has problems when all population effect sizes are the same. This does not mean that selection models cannot be used with those data. A solution is to test the random model and when it fails to run the fixed effects model. I used this approach and obtained usable results in all simulations.

The specification of the selection model remains the same as in the previous simulations and differed from Carter et al.’s default 3PSM model. First, I added steps to model different selection for positive and negative results. Second, I added a step to allow for different selection of non-significant and significant negative results. Third, I added a step to identify p-hacking that produces too many just significant results. The range of just significant results was defined as p-values between .05 and .01, two-sided. The one-sided p-value steps are c(.005,.025,.050,.500,.975).

P-uniform and PET-PEESE do not require thinking and were used as usual.

Selection Model (weightr)

The random-effects model failed to converge in 42% of the simulations. Actually, this is the better result because it implies zero heterogeneity, which is actually the true heterogeneity. In the other 58% of the cases, the random effects model estimated non-zero heterogeneity.

Selection bias should produce fewer non-significant than significant positive results. The average selection weight for non-significant positive results was .17, average 95%CI = .12 to .22, and 99% of confidence intervals did not include a value of 1. Thus, the model was able to detect selection bias in a set of 100 studies.

The average selection weight for p-values between .005 and .025 (one-sided) was 2.02, average 95%CI = 1.08, to 2.93, and 44% of CI did not include a value of 1. This means that the model falsely detected p-hacking in 44% of the simulations. Thus, the p-hacking test is not reliable in this condition. The reason will become clear in the next analysis of the effect size estimates.

The average estimate of the mean population effect size was d = .07, average 95%CI = .04 to .10, and only 11% of confidence intervals included the true value of 0. This finding replicates Carter et al.’s (2019) finding that the selection model has a high false positive risk in this scenario. At the same time, the bias in the effect size estimates is small and the model correctly estimates that the true effect size is less than small (d < .20). Only 4% of confidence intervals included a value of .2.

When the random-effect model converged, it produced an average estimate of heterogeneity close to zero, tau = .01, 95%CI = .00 to .07, and 100% of confidence intervals included a value of 0. Based on this result, the model could be respecified as a fixed-effects model, but the small amount of heterogeneity did not affect estimates for subsets of studies with positive or positive and significant results. That is, when there is no heterogeneity, all subsets of population effect sizes have the same value. The problem for the selection model was that it overestimated this value slightly with overconfidence and therefore rejected the true null hypothesis in many simulations. This small bias gives other models an opportunity to outperform the selection model.

PET-PEESE

PET regresses effect sizes on the corresponding sampling errors. The average estimate was d = .00, average 95%CI = .-.06 to .06, and 91% of the confidence intervals included the true value of zero. PEESE regresses the effect sizes on the sampling variance (i.e., the squared sampling error). PEESE overestimates the true average effect size even more, average estimated d = .13, average 95%CI = .08 to .17, and none of the confidence intervals included the true value. The overestimation of PEESE is not a problem when PET produces a non-significant result because the PET-PEESE rules state that the PET estimate should be used when PET is not significant. However, PET produced 9% false positives and in this case, the PEESE result would be used and lead to an overestimation. This is still better performance than the selection model, but other models might avoid the false positive problem of PET-PEESE.

P-Uniform

P-uniform also has a bias test, but it detected bias in only 73% of the simulations. Thus, the selection model has a better bias test. The average effect size estimate was d = .03, average 95%CI = -.30 to .21, and 96% of the confidence intervals included the true value of 0. Thus, p-uniform beats the selection model and PET-PEESE in this scenario. .

Z-Curve

One way to test selection bias with z-curve is to fit the model to all positive results, non-significant and significant results, and to test for the presence of excessive just significant results (z = 2 to 2.6 ~ p = .05 to .01). This test was significant in 99% of all simulations. This is as good as the bias test with the selection model and better than the bias test of p-uniform.

The p-hacking test fits the model to the “really” significant results (z > 2.6) and compares the frequency of just significant results to the predicted frequency. In this scenario, the test could not be used because there were too few “really” significant results (see Figure 2).

The average estimate of the average power of the positive and significant results was 7%, average 95%CI = 2.5% to 17%, and all confidence intervals included the true value of 2.5%. The estimate for the average power of all positive results was 6%, average 95%CI = 5% to 17%, and all confidence intervals included the true value of 5%. Thus, z-curve correctly showed that the simulated studies lacked evidential value, a term to state that we cannot reject the null hypothesis that the null hypothesis was true in all studies.

Converting the power estimates into effect size estimates, z-curve estimated an average effect size for positive and significant results of d = .10, average 95%CI = .00 to .27, and all confidence intervals included the true value of zero. The same result was obtained for the set of studies with positive results.

Conclusion

The main conclusion from this specific simulation study is that the selection model had an unacceptably high false positive rate. This finding replicates Carter et al.’s (2019) results. I also replicated the convergence problem when the model was specified as a random-effects model. However, I was able to fix this problem by using a fixed effects model. Moreover, the effect size estimate of the selection model had only a small positive bias. The model also correctly detected selection bias, but overestimated p-hacking. Thus, the selection model did not perform badly, but it showed some problems in this scenario.

All other methods did better in correcting for selection bias and showing that the 100 studies in a simulation were produced without a real effect. The best performance was obtained with z-curve, which also showed clear evidence of selection bias.

These results do confirm Carter et al.’s (2019) conclusion that no single method performs best in all situations. However, this does not mean that methods can produce wildly different results. The selection model only showed a small positive bias, and other methods showed that the data lacked evidential value.

This is very different from Carter et al.’s (2019) meta-analysis of the ego-depletion literature, where only PET-PEESE suggested lack of evidential value, but the selection model and p-curve suggested that at least some studies had moderate positive effect sizes. Given the present simulation results, it would be questionable to accept the PET-PEESE results and dismiss the results of the other methods, especially the selection model. The selection model performs well when studies have evidential value and shows only a small bias in this simulation. Thus, an average effect size estimate of d = .3 is unlikely to be a bias and it is more likely that the PET result is an underestimation of the true average effect.

Although I have carefully avoided promoting z-curve so far, I do believe that this simulation shows the value of complementing effect size meta-analysis with the selection model with a z-curve analysis of power. Taken both methods together, researchers would correctly infer that the data do not have evidential value and that all studies could have been obtained without a true effect in any one of the studies. As I am one of the parents of z-curve, this advice should be taken with a grain of salt, but there is little harm in adding one more method to a multiverse meta-analysis and to show that all methods produce robust evidence that the data under investigation are far from robust.

A Deep Dive Into Carter et al.’s (2019) Evaluation of Meta-Analytic Methods

This is mainly a landing page for individual blog posts that examine specific conditions in Carter et al.’s extensive and influential simulation study that varied (1) the size of study sets (k), (2) the average effect size (d), (3) heterogeneity of effect sizes (tau), (4) the presence of selection bias, and (5) p-hacking.

In my reexamination of this simulation study, I focus on the selection model implemented in the weightr package in R. Carter et al. (2019) used the default implementation of this model known as the 3 parameter selection model (3PSM). I examine whether modifications of the model increase its performance. Most important, is the modeling of p-hacking with a bin for just significant results between p = .005 and .025 (one-tailed) that corresponds to two-tailed p-values between .05 and .01. P-hacking produces more of these just significant results than we would expect based on the true power of studies and selection bias.

The first four simulations simulated a small average effect size (d = .2) with large heterogeneity (tau = .4). in a set of k = 100 “published” studies. With a normal distribution of population effect sizes, this implies that 95% of the effect sizes are in the range from d = -.6 to d = 1. I used either no or high p-hacking and selection bias as implemented in Carter et al.’s (2019) simulation code. This produced a 2 x 2 design with no or high p-hacking crossed with no or high selection bias. Aside from the selection model, I evaluated PET-PEESE, p-uniform, and z-curve.

Studies 1-4:

The main finding for this limited study was that the improved selection model performed well and performed as well or better than the other methods.

Simulation #296: No Selection Bias, No P-Hacking

Simulation #328: No Selection Bias / P-Hacking

Simulation #392: High Selection Bias / No P-Hacking

Simulation #424: High Selection Bias / High P-Hacking

Study 5

Here I examined simulation #100. This simulation assumes that the null hypothesis is true in all studies. It simulates high selection bias and no p-hacking. Replicating Carter et al.’s (2019), I find a high false positive rate for the selection model. However, the bias is small. Nevertheless, z-curve outperforms the selection model and the other methods in this simulation. Thus, it is useful to complement effect size meta-analysis with the selection model with z-curve analysis, especially when the average effect size estimate and heterogeneity is small.

Simulation #100: High Selection Bias / No P-Hacking

Carter et al. (2019): Simulation #392

In this series of blog posts, I am reexamining Carter et al.’s (2019) simulation studies that tested various statistical tools to conduct a meta-analysis. The reexamination has three purposes. First, it examines the performance of methods that can detect biases to do so. Second, I examine the ability of methods that estimate heterogeneity to provide good estimates of heterogeneity. Third, I examine the ability of different methods to estimate the average effect size for three different sets of effect sizes, namely (a) the effect size of all studies, including effects in the opposite direction than expected, (b) all positive effect sizes, and (c) all positive and significant effect sizes.

The reason to estimate effect sizes for different sets of studies is that meta-analyses can have different purposes. If all studies are direct replications of each other, we would not want to exclude studies that show negative effects. For example, we want to know that a drug has negative effects in a specific population. However, many meta-analyses in psychology rely on conceptual replications with different interventions and dependent variables. Moreover, many studies are selected to show significant results. In these meta-analyses, the purpose is to find subsets of studies that show the predicted effect with notable effect sizes. In this context, it is not relevant whether researchers tried many other manipulations that did not work. The aim is to find the one’s that did work and studies with significant results are the most likely ones to have real effects. Selection of significance, however, can lead to overestimation of effect sizes and underestimation of heterogeneity. Thus, methods that correct for bias are likely to outperform methods that do not in this setting.

I focus on the selection model implemented in the weightr package in R because it is the only tool that tests for the presence of bias and estimates heterogeneity. I examine the ability of these tests to detect bias when it is present and to provide good estimates of heterogeneity when heterogeneity is present. I also use the model result to compute estimates of the average effect size for all three sets of studies (all, only positive, only positive & significant).

The second method is PET-PEESE because it is widely used. PET-PEESE does not provide conclusive evidence of bias, but a positive relationship between effect sizes and sampling error suggests that bias is present. It does not provide an estimate of heterogeneity. It is also not clear which set of studies are the target of this method. Presumably, it is the average effect size of all studies, but if very few negative results are available, the estimate overestimates this average. I examine whether it produces a more reasonable estimate of the average of the positive results.

P-uniform is included because it is similar to the widely used p-curve method,but has an r-package and produces slightly superior estimates. I am using the LN1MINP method that is less prone to inflated estimates when heterogeneity is present. P-uniform assumes selection bias and provides a test of bias. It does not test heterogeneity. Importantly, p-uniform uses only positive and significant results. It’s performance has to be evaluated against the true average effect size for this subset of studies rather than all studies that are available. It does not provide estimates for the set of only positive results (including non-significant ones) or the set of all studies, including negative results.

Finally, I am using these simulations to examine the performance of z-curve.2.0 (Bartos & Schimmack, 2022). Using exactly the simulations that Carter et al. (2019) used prevents me from simulation hacking; that is, testing situations that show favorable results for my own method. I am testing the performance of z-curve to estimate average power of studies with positive and significant results and the average power of all positive studies. I then use these power estimates to estimate the average effect size for these two sets of studies. I also examine the presence of publication bias and p-hacking with z-curve.

The present blog post, examine the last cell of a 2 x 2 design that varies publication bias and p-hacking. In this post, I examine the condition with high publication bias and no p-hacking (Simulation #392). Like the other simulations (Simulation #296, Simulation #424, Simulation #328), I simulated 100 studies with a small true average population effect size, d = .2, and large heterogeneity, tau = .4. In the other simulations, a properly specified selection model performed well, and no other method performed better. The simulation of selection bias alone without p-hacking should favor models that assume selection bias and correct for it like the selection model, p-uniform, and z-curve. The main challenge for the selection model is that this simulation leaves mostly positive and significant results to be analyzed. Methods like p-uniform and z-curve were designed for scenarios like this. In contrast, the selection model uses information from non-significant results. Thus, it might not perform as well in this simulation.

Simulation

I focus on a sample size of k = 100 studies to examine the properties of confidence intervals with a large, but not unreasonable set of studies. Simulation 392 has the same mean and standard deviation of the true population effect sizes as the other simulations in a 2 x 2 design to explore the effects of publication bias and p-hacking: a small mean effect size of d = .2 and large heterogeneity, tau = 4. The curve in Figure 1 shows the distribution of true population effect sizes. 95% of these effect sizes are in the range from -.6 and 1.

The histogram in Figure 1 shows the distribution of population effect sizes in the studies that are “published,” that is, studies that are available for the meta-analysis. It shows that selection bias changes the population of studies. Selection for significance implies selection for larger effect sizes because studies with larger effect sizes have a higher probability to produce significant results. This means that it is important to distinguish between sets of studies. Should a meta-analysis estimate the original average for all studies, d = .2, only the average of studies with positive results, or the average of studies with positive or significant results?

Figure 2 shows the distribution of the effect size estimates (extreme values below d = 1.2 and above d = 2 are excluded). Selection bias leaves mostly positive results. The presence of some negative results can be explained by the fact that the selection simulation kept negative and statistically significant results.

Histograms of effect sizes do not show the proportion of significant and non-significant results. This can be examined by computing the ratio of effect sizes over sampling error and treat these as approximate z-scores. Alternatively, the sample sizes can be used to compute t-values, and use the corresponding p-values to convert t-values into z-values.

Figure 3 shows the plot of the z-values and the analysis of the z-values with z-curve, using the standard selection model that fits the model to all statistically significant z-values. 103 results with significant negative effects are excluded, leaving 4,897 studies with positive effects. Selection bias leads to an observed discovery rate (ODR, i.e., the percentage of significant positive results) of 91%, while the average true power is estimated at just 42% (it is really 40% based on the true population effect sizes and sample sizes). With 5,000 cases, z-curve can easily show the presence of selection bias because the expected discovery rate predicts at most 54% significant results, when 91% are significant. The figure shows the predicted missing non-significant studies (i.e., the proverbial file-drawer) with the dotted red line.

Z-curve also correctly estimates the average power of the only significant results, which is 66% based on the true population effect sizes of the studies with significant results. In short, a z-curve plot can reveal that the percentage of significant results in a dataset is too high. However, this excess of significant results could be caused by selection bias or p-hacking.

To distinguish selection bias and p-hacking, z-curve can be fitted to only “really” significant results with z-values greater than 2.6. P-hacking would produce more just significant results (2 to 2.6, p = .05 to .01 two-sided) than the model predicts. The results of this test are presented in Figure 4.

In this particular simulation, there are actually fewer just significant results than the model predicts. The p-value for the binomial test is not significant, p = .9883. As no p-hacking was simulated, this is the correct result (i.e., a true negative result). The simulation study examines the false positive rate of this test when 100 studies are simulated.

Simulation #328 demonstrated that it is difficult to detect and correct for p-hacking with z-curve when it is used. The choice was to accept that p-hacking leads to an underestimation of power as a p-hacking penalty and to use the results of the selection model. Thus, even if a p-hacking test in this simulation would have produced a significant result, it would not change the z-curve model, and the default selection model from Figure 3 was used.

Model Specifications

The specification of the z-curve model was explained and justified above.

The specification of the selection model is easier because it can take selection bias and p-hacking into account in a single model. I used the same approach that I used for the other simulations. I modeled selection bias with the standard approach; that is, a step at .025 (one-tailed) to distinguish positive and significant results from all other results. In addition, I added a step at .005 (one-tailed). This models p-hacking by allowing an excess of p-values between .05 and .01 (two-tailed). The selection weight for this range of p-values is used as a test of p-hacking. In addition, the model had steps at .5 to distinguish positive and negative effects, and .975 to distinguish negative non-significant and negative significant results. The selection weight for positive non-significant results is used as a test of selection bias, but we already saw that p-hacking alone can reduce the amount of non-significant results because these p-hacking moves these p-values into the range of just significant results. Thus, it is difficult to detect selection bias when evidence of p-hacking is present.

The key advantage of the selection model is that it does not rely on significance testing to specify models with or without bias. It will adjust the estimates of the average and standard deviation of the effect sizes based on the estimated amount of bias. The following simulation examines how well this adjustment works when only 100 studies are available to fit the model. .

P-uniform and PET-PEESE do not require thinking and were used as usual.

Selection Model (weightr)

Carter et al. (2019) observe conversion issues with the selection model in some conditions. In my simulation the model produced useable estimates 100% of the time. The test of selection bias showed an average weight for non-significant positive results of .08, average 95%CI = -.02 to .13, and all confidence intervals excluded a value of 1. Thus, the model correctly identified selection bias. The average weight for just significant results was close to 1, average weight = 1.11, average 95%CI = 0.59 to 1.62, and only 1% of the CI did not include a value of 1. Thus, the model correctly identified that bias was produced by selection rather than p-hacking.

Although the model detected selection bias, it did not fully adjust for it. The average effect size estimate for all studies was d = .26, average 95%CI d = .17 to .34, and only 49% of the CI included the true value of d = .20. However, the bias is small. The model also slightly underestimated heterogeneity, average tau = .38, 95% CI = .27 to .46, and only 77% of the confidence intervals included the true value of .40. However, the bias is small.

The estimated effect size for positive results was d = .41, average 95%CI = .29 to .52 and 85% of confidence intervals included the true value of d = .38. The estimated effect size for positive and significant results was d = .57, average 95%CI = .43 to .69, and 93% of the confidence intervals included the true value of d = .58. The performance of the model improves because most studies were positive and significant, and it is easier to estimate the average of studies that are available than to predict averages for sets of studies that are missing.

In short, the performance of the selection model was again good. It correctly distinguished selection and p-hacking and biases in effect size estimates and the estimate of heterogeneity were small. The good performance of the model makes it difficult for other models to do better.

The present results are in line with Carter et al.’s (2019) results based on the default selection model that also performed well in this condition. The key differences are limited to conditions that simulate p-hacking. Thus, the present results merely show that the extension of the model to allow for p-hacking did not reduce the performance of the selection model.

PET-PEESE

PET regresses effect sizes on the corresponding sampling errors. The average estimate was d = .33, average 95%CI = .22 to .43, and only 43% of the confidence intervals included the true parameter, d = .20. Even if this is considered a small bias, the bias is larger than the bias of the selection model. More problematic is that PET is assumed to underestimate true effect sizes, when the estimated effect sizes are positive and significant. In this case, researchers are advised to use PEESE.

PEESE regresses the effect sizes on the sampling variance (i.e., the squared sampling error). PEESE overestimates the true average effect size even more, average d = .42, average 95%CI = .36 to .50, and only 3% of the confidence intervals include the true value.

These findings also confirm Carter et al.’s (2019) findings. “PET-PEESE methods all demonstrated slight increases in upward bias under heterogeneity… In contrast, adding heterogeneity did not increase the bias of the 3PSM estimator” (p. 132). Thus, this is another condition in which the selection model beats PET-PEESE, and PET-PEESE has to demonstrate superior performance in yet unexplored conditions in which it outperforms the selection model.

P-Uniform

P-uniform detected bias in 81% of the simulations. This is lower than the 100% success rate of the selection model. Moreover, the selection model was able to distinguish selection and p-hacking. This makes the selection model a better tool to investigate bias in this simulated scenario.

P-uniform underestimated the average effect size of studies with positive and significant results, average d = .42, average 95%CI = .37 to .47. Only 2% of CI included the true value of d = .58. Thus, p-uniform performs worse than the selection model, even though it is designed for this scenario.

Z-Curve

Z-curve’s primary purpose is to estimate average power, but as shown above, it can be used to examine bias and p-hacking. Z-curve detected bias in 83% of simulation, which is not as good as the performance of the selection model. It also falsely detected p-hacking in 18% of studies, although p-hacking was not used. Thus, the selection model wins.

Consistent with many validation simulations, z-curve’s estimate of average power for positive and significant results that is called the expected replication rate was very good, average ERR = 68%, average 95%CI = 53% to 82%, and the true value of 69% was included in 99% of the confidence intervals.

The average estimate of the average power for the set of positive results (i.e.., the expected discovery rate) was 42%, average 95%CI = .13 to .75, but the confidence interval with only 100 studies is wide. Moreover, only 82% of the confidence intervals included the true value of 40%. Thus, power estimation for the EDR is more uncertain than the nominal 95% confidence intervals suggest.

To compare these results with the selection model, the power estimates can be converted into effect size estimates given the sample sizes of studies with significant results. The average estimate for the set of positive and significant results is d = .52, average 95%CI = .44 to .63, and 81% of the confidence intervals include the true value of d = .58. This is not as good as the estimates from the selection model, average d = .57, average 95%CI = .43 to .69, and 93% of the confidence intervals included the true value of d = .58. Thus, there is no advantage of using z-curve for effect size estimation. The only additional information that it provides is the average power.

For the set of positive results, z-curve effect average size estimate is d = .38, average 95%CI = .16 to .57, and 94% of confidence intervals included the true value of d = .40. While this looks good, it is not better than the performance of the selection model.

Conclusion

The main conclusion from this specific simulation study is that the selection model does well in a simulation with high selection bias. It correctly detects selection bias in all simulations and shows that excessive significant results are produced by selection rather than p-hacking. It slightly overestimates the average effect size for all studies, but it does well in the estimation of the average effect size of all positive studies and all positive and significant studies.

This is the fourth and final simulation of a scenario with 100 studies in the meta-analysis, a small average effect size, d = .2, and large heterogeneity, tau = .4, and the selection model always performed better or as good as other methods. Thus, it remains the model to beat in other scenarios.

It seems unlikely that Carter et al.’s (2019) simulations include conditions that are problematic for the selection model. Namely, it is not clear why milder p-hacking or selection would create problems. It is also not clear why less heterogeneity would be a problem for the model. While I will conduct more replications of Carter et al.’s simulations, I think it is more interesting to examine scenarios that were not included in their design. In their simulations, the selection model benefitted from the simulation of heterogeneity with a normal distribution. This distribution matches the assumption of the selection model. Simulations that change this distribution assumption are needed to really test the robustness of the selection model.

The main preliminary conclusion is that meta-analyses should always include results with a properly specified selection model that allows for (a) different biases for positive and negative results, (b) p-hacking, and (c) overrepresentation of marginally significant results (p < .10, two-sided). Marginally significant results were not modeled here because the simulations did use a strict alpha criterion of .05 to model p-hacking and selection bias. However, in real data, overrepresentation of these results is likely. The proper specification of steps for one-sided p-values is therefore c(.005, .025, .05, .05, .95, .975, 1). Compared to the default model with a single step at .025, this model has 6 more steps and can be called the 9PSM. Some steps can be omitted if a table of the p-value clusters shows zero frequencies, but the model often converges even with zero frequencies and fixes the weight to .01.