Z-Curve Works Just Fine: Emotion Research is Broken

Wu, C., Zhang, C., Li, X., & Astikainen, P. (2025). The impact of emotional face distractors on working memory performance: a meta-analysis of behavioural studies. Cognition and Emotion, 1–22. https://doi.org/10.1080/02699931.2025.2568559

Introduction

In a recent critique of z-curve (under review), some authors (who shall remain anonymous at this point) claim that z-curve performs poorly, based on a few simulations with homogeneous data and small sets of studies. These simulations ignore the fact that z-curve has been validated in extensive simulation studies that are open, reproducible, and have passed peer review, including review by critics like Erik van Zwet (Bartoš & Schimmack, 2022; Brunner & Schimmack, 2020). Z-curve 3.0 also addresses the problem of small, homogeneous datasets by estimating heterogeneity and adapting the model accordingly — using fewer components when the data are homogeneous and the default components create estimation problems. There is nothing new in these simulation studies.

A second criticism is that z-curve does not work well for standard meta-analyses, which often have very few significant results to fit a model. This is also irrelevant because z-curve does not aim to estimate an average effect size from a set of close replication studies. Z-curve is explicitly designed to examine the credibility of heterogeneous sets of studies that test different hypotheses.

The only reason to mention this critique is that it highlights a broader problem. Many meta-analyses in psychology do not increase precision of effect size estimates by combining direct replication studies. Instead, they combine studies that investigate a loosely defined research question with different specific paradigms. These meta-analyses often have considerable heterogeneity, and it is unclear what we learn from estimating the average effect size of such conceptual replications. With large heterogeneity, some studies may show positive effects and others negative effects. Unless we can identify which studies produce which effects, we have not really learned anything from these studies.

Mindless Meta-Analysis of Mindless Mini-Paradigms

The manuscript uses a meta-analysis of emotional faces to claim that z-curve produces unreliable and misleading results when it is used to analyze the data (Wu et al., 2025). Below, I show that z-curve provides exactly the same information as other state-of-the art meta-analytic methods and that the main conclusion is that 51 effects from 37 studies provide no scientific evidence about the influence of emotion on attention and working memory.

The main claim examined in this meta-analysis is that “task-irrelevant emotional faces may affect working memory (WM) performance by involuntarily capturing attention” (Wu et al., 2025). Let me briefly state that I have actually conducted studies on the relationship of attention and emotional stimuli (Schimmack, 2005) and that emotional faces are unlikely to trigger emotional responses that can interfere with cognitive tasks. The key determinant is arousal level and looking at a facial expression of an emotion does not produce strong arousal responses.

Wu et al. found 51 behavioral effects from 37 studies. This means the data are nested because some studies included several emotional expressions that could be compared to a control condition. This is not a problem for z-curve or other meta-analytic methods because all methods can be combined with a clustered bootstrap approach to produce confidence intervals that take the nested structure of the data into account.

Figure 1 shows the distribution of the 51 effect sizes.

Notably, most results are close to zero. Wu et al. estimated an average effect size of -.04 standard deviations with a 95% confidence interval ranging from -.13 to .04. One might consider this interval small enough to conclude that the meta-analysis provides evidence against any notable effect (minimum effect size of interest, d = .2).

However, the studies are not close replications of the same mini-paradigm. Thus, the more interesting question is whether there is heterogeneity and some studies showed inhibition effects and others, like the g = 1.60 effect size estimate, showed a positive effect.

Wu et al.’s heterogeneity tests showed evidence of heterogeneity, p < .001, and they estimated that population effect sizes could range from g = -.47 to .39. Thus, it would be false to conclude that the null hypothesis is true. Instead, effects vary in unpredictable ways across studies.

I first used Vevea and Wood’s (2006) selection model to estimate the prediction interval using a method that takes selection bias into account. I used a clustered bootstrap to take the nested structure of the data into account. The prediction interval was a bit wider on the positive side, PI = -.49 to .84, presumably because the single large positive value gets more weight in an analysis based on 37 clusters rather than 51 individual effects. There was no evidence of selection bias.

Z-curve differs from Vevea and Wood’s selection model in several ways. One important difference is that it does not use non-significant results under the assumption that they are biased and only significant results provide unbiased information. The 51 effect sizes include 7 significant results with a negative sign and 4 with a positive sign. Ignoring the sign, there are only 11 significant results that happen to come from 10 independent studies. This is the absolute minimum z-curve analysis, but users are warned that larger sets of studies are needed for meaningful z-curve analyses.

To run a z-curve analysis, a simple approach is to divide the effect sizes by the sampling error and treat the ratios as approximate z-values. A z-curve plot shows the distribution of the absolute z-scores (Figure 2).

Z-curve.3.0 first fits a simple one-component model with free standard deviations to the data to examine heterogeneity. The results show that the standard deviation of the ncp (not the observed z-values), are heterogeneous (sd_ncp = 2.51, 95%CI = 0.057 to 5.50).

The Observed Discovery Rate (ODR) is the percentage of significant results in this set of studies. The Expected Discovery Rate is the probability of obtaining a significant result in all studies that were conducted whether reported or not. The EDR of 10% is less than the ODR of 22%, but with only 11 significant results, z-curve cannot produce a reliable estimate of the EDR. Therefore the confidence interval ranges from the minimum of 5% to the maximum of 100%. This is not a problem of z-curve. This result merely shows that the 11 significant results provide no information about the true discovery rate. There may be bias or there may not be bias. Even Vevea and Wood’s model that could use the non-significant results was unable to say whether bias is present or not. The point estimate of the EDR of 10%, still allows for 49% false positive results, but it is also possible that all significant results are true positives with opposite signs.

The Expected Replication Rate (ERR) is the average probability that an exact replication of a significant result with the same sample size produces a significant result again. The ERR of 52% seems encouraging, but given heterogeneity, this average is based on studies with low and high power. Studies with p-values just below .05 (z > 1.96) have a low probability of a successful replication. For z-values between 2 and 4, the probability of a successful replication is 48%, 95%CI = .13 to .83. Only 3 studies with z-values above 4 have a high probability to produce a significant result again, 98%, 95%CI = .80 to 1.00.

In short, despite the small number of significant results, z-curve results are consistent with other meta-analytic results. The data provide no information about the influence of facial expressions of emotions on working memory because (a) most studies had insufficient power to detect small effects, (b) different paradigms produce different effects, and (c) heterogeneity in low powered studies makes it impossible to identify conditions that produce real effects.

Inspection of the individual studies clearly shows that Gonzalez-Garrido’s (2015) positive effect of g = 1.6 is an outlier. Inspection of this study shows that the emotional faces were target stimuli and not distractors in this study. Thus, this study should not have been included in a meta-analysis that examines the influence of incidental facial expressions as distractors. This leaves only two significant negative effects by Stout (2015, 2017). Whether these are real effects requires direct replications. Neither z-curve nor other meta-analytic methods can answer this question.

How can there be 78% Non-Significant Results?

The most surprising aspect of Wu et al.’s (2025) meta-analysis is that only 22% of the effect sizes were significant. It is well known that psychology articles typically report significant results with success rates of 90%. Wu et al’s (2025) meta-analysis sheds light on the higher frequency of non-significant results in meta-analysis than in original articles.

First. meta-analysis often do not use the same estimates of sampling error that the original studies used . All of the studies in this meta-analysis were within-subject designs with many repeated trials. In these designs, stable differences between participants can be estimated and removed from the error variance. This reduces sampling error and increases the power of studies to obtain significant results for true effects. In contrast, the meta-analysis treated these studies as if they were between-subject designs that have low power with n = 10 to 20 per cell. Thus, many of the non-significant results in the meta-analysis may have been significant in the original articles. This also means that the meta-analytic estimates are more variable than they need to be. Meta-analysts should use the actual sampling errors of studies to avoid this problem.

The second reason for the emergence of non-significant results is that studies have multiple dependent variables. Here performance and reaction times were often used. Some studies reported significant reaction time effects, while the performance scores did not differ significantly. Moreover, the main focus of these studies were neuro-imaging measures (e.g., EEG, fMRI) so that even non-significant results for behavioral measures may have been publishable as long as some brain-correlates were significant. Z-curve is designed to detect biases in the reporting of focal tests. If secondary results are often reported even if they are not significant, a standard meta-analysis that uses significant and non-significant results is less biased.

Conclusion

In conclusion, standard effect-size meta-analyses differ in important ways from z-curve analyses. Standard meta-analyses aim to obtain a precise estimate of an average effect size for a set of similar studies. In contrast, z-curve examines the credibility of individual claims across a set of studies that test different hypotheses. These literature-wide credibility assessments are needed to identify problematic literatures like the one examined in Wu et al.’s meta-analysis, where 37 published articles using different paradigms produced no scientific insights into the influence of emotion on working memory.Given the lack of robust findings in the behavioral measures, it is unclear what we can learn from neural correlates of null effects. My own work on attention and emotion suggests that highly arousing stimuli are necessary to attract attention and influence working memory (Schimmack, 2005). Standardized photographs of facial expressions, presented repeatedly across hundreds of trials, are unlikely to meet this threshold.

Replicability-Index

Improving the replicability of empirical research

Z-Curve Works Just Fine: Emotion Research is Broken

Introduction

Mindless Meta-Analysis of Mindless Mini-Paradigms

How can there be 78% Non-Significant Results?

Conclusion

Like this:

Leave a ReplyCancel reply

Introduction

Mindless Meta-Analysis of Mindless Mini-Paradigms

How can there be 78% Non-Significant Results?

Conclusion

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from Replicability-Index