Replicability Index: A Blog by Dr. Ulrich Schimmack

Blogging about statistical power, replicability, and the credibility of statistical results in psychology journals since 2014. Home of z-curve, a method to examine the credibility of published statistical results.

Show your support for open, independent, and trustworthy examination of psychological science by getting a free subscription. Register here.

For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication” (Cohen, 1994).

DEFINITION OF REPLICABILITYIn empirical studies with sampling error, replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion (Schimmack, 2017). 

See Reference List at the end for peer-reviewed publications.

Mission Statement

The purpose of the R-Index blog is to increase the replicability of published results in psychological science and to alert consumers of psychological research about problems in published articles.

To evaluate the credibility or “incredibility” of published research, my colleagues and I developed several statistical tools such as the Incredibility Test (Schimmack, 2012); the Test of Insufficient Variance (Schimmack, 2014), and z-curve (Version 1.0; Brunner & Schimmack, 2020; Version 2.0, Bartos & Schimmack, 2021). 

I have used these tools to demonstrate that several claims in psychological articles are incredible (a.k.a., untrustworthy), starting with Bem’s (2011) outlandish claims of time-reversed causal pre-cognition (Schimmack, 2012). This article triggered a crisis of confidence in the credibility of psychology as a science. 

Over the past decade it has become clear that many other seemingly robust findings are also highly questionable. For example, I showed that many claims in Nobel Laureate Daniel Kahneman’s book “Thinking: Fast and Slow” are based on shaky foundations (Schimmack, 2020).  An entire book on unconscious priming effects, by John Bargh, also ignores replication failures and lacks credible evidence (Schimmack, 2017).  The hypothesis that willpower is fueled by blood glucose and easily depleted is also not supported by empirical evidence (Schimmack, 2016). In general, many claims in social psychology are questionable and require new evidence to be considered scientific (Schimmack, 2020).  

Each year I post new information about the replicability of research in 120 Psychology Journals (Schimmack, 2021).  I also started providing information about the replicability of individual researchers and provide guidelines how to evaluate their published findings (Schimmack, 2021). 

Replication is essential for an empirical science, but it is not sufficient. Psychology also has a validation crisis (Schimmack, 2021).  That is, measures are often used before it has been demonstrate how well they measure something. For example, psychologists have claimed that they can measure individuals’ unconscious evaluations, but there is no evidence that unconscious evaluations even exist (Schimmack, 2021a, 2021b). 

If you are interested in my story how I ended up becoming a meta-critic of psychological science, you can read it here (my journey). 

References

Brunner, J., & Schimmack, U. (2020). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta-Psychology, 4, MP.2018.874, 1-22
https://doi.org/10.15626/MP.2018.874

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17, 551–566
http://dx.doi.org/10.1037/a0029487

Schimmack, U. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychology/Psychologie canadienne, 61(4), 364–376. 
https://doi.org/10.1037/cap0000246

Mastodon

Z-Curve Works Just Fine: Emotion Research is Broken III

Zhao, Z., & Zhang, X. (2026). The effect of emotion on prospective memory: a three-level meta-analytic review. Cognition & emotion40(2), 329–344. https://doi.org/10.1080/02699931.2025.2508391

Introduction

In a recent critique of z-curve (under review), some authors (who shall remain anonymous at this point) argued that z-curve performs poorly when it is applied to meta-analyses in the emotion literature (Schimmack, 2026). However, the critic did not even publish z-curve analyses of these meta-analyses, nor did it show how other meta-analytic methods perform with the same data. In a previous post, I showed that in the first meta-analysis of this critical article, z-curve performed just as well as other methods (Schimmack, 2026). The demonstration that z-curve estimates vary a lot across bootstrapped samples is not a flaw of z-curve. It merely shows that the data do not contain useful information.

This blog post examines the performance of z-curve and an effect-size selection model (Vevea & Wood, 2025) for the second meta-analysis that was used to claim that z-curve fails to provide meaningful information.

Influence of Valence on Prospective Memory

The meta-analysis included studies that examined the influence of affective valence (neutral, positive, negative) on prospective memory; that is the ability to remember to perform planned actions at a future point in time. The dataset contains 171 effect sizes. 116 are comparisons of neutral stimuli with emotional ones, and 55 compare positive to negative stimuli.

Figure 1 shows a histogram of the effect sizes (Hedge’s g).

The most notable feature of the histogram is that the mode is close to zero. There are a few extremely large negative effect sizes, but most effect sizes are positive. This pattern suggests large heterogeneity.

I first analyzed these data comparing neutral to affective stimuli. I used Vevea and Wood’s (2005) selection model and used clustered bootstraps to take the nested structure of the data into account. There was evidence of selection bias: the weight for negative effects was below 1, w = .33, 95%CI = .15 to .74.

The estimated mean was not significantly different from zero, g = .05, 95%CI = -0.57 to 0.54. The estimated heterogeneity was large, tau = 1.31, 95%CI = 1.00 to 1.59. The prediction interval ranged from g = -2.64 to 2.64.

Given the wide prediction interval, it is impossible to predict the effect size of a specific study without further information.

Z-curve shows a different picture (Figure 2). The observed discovery rate is 65%. That is 65% of the results are statistically significant. Based on the distribution of the significant z-values, the model estimates an expected discovery rate of 68%. Thus, there is no evidence of selection bias. Moreover, an EDR of 68% implies that most studies have high power and the false positive risk is low, FDR = 2%. Even with the lower limit of the EDR confidence interval, 25%, the FDR is a modest 16%. The model also predicts that an exact replication of one of the significant results has a 77% probability of producing a significant result again. The lower limit of the ERR is still 64%. Even the non-significant results have an average 50% probability of producing a significant result if they were done again. This suggests that many of these results are false negatives. For studies with z-values between 4 and 6 replicability increases to 92%, 95%CI = .85 to .95. Moreover, there are 102 z-values greater than 6 that are practically guaranteed to produce a significant result again in an exact replication study.

In conclusion, z-curve does not claim that all studies have a common effect size. Instead, it examines how credible a set of heterogenous results are. The results here show that the awe-literature is credible with a low false positive risk. It also shows that results with significant results are likely to replicate if the z-value is above 4. Other results are likely to produce significant results in close replications with large samples. These results are somewhat better than the average credibility of emotion research (Soto & Schimmack, 2024).

What (TF) is McShane Doing?

McShane shows plots like the one in Figure 3 that claim to show the “Sampling Distributions of z-curve Point Estimates of Average Power”


First, it is not clear what McShane means by “average power.” Z-curve estimates Expected Discovery Rates and Expected Replication Rates, not average power. Second, it is not clear what McShane means by sampling distribution of z-curve estimands (EDR, ERR). The actual sampling distributions look very different from his plot.

Figure 4 shows the histogram for 5,000 bootstraps with study-level sampling.

The figure is very different and explains the EDR results. The point estimate is 68%. There is little uncertainty about the upper limit, 79%, but high uncertainty about the lower limit, 25%. The reason is that the EDR is based on a prediction about the unknown distribution of non-significant results. This creates uncertainty and wide CI especially if only 84 studies are available.

The results for the ERR are different because ERR is an estimate for significant results that are all available. Consistent with the narrow CI for the ERR ranging from 64% to 89%, the histogram shows a narrow distribution in this range.

Conclusion

This work was motivated by claims that z-curve is an unreliable method to conduct meta-analysis that produces wide and uninformative confidence intervals. Closer inspection shows that the problem is not z-curve. Instead, the problem are the data. The studies are variable and have low power to detect any real effects, if they exist. A fixed-effect meta-analysis would ignore this variability and might produce a statistically significant result that is then presented as evidence for effects of affective valence on prospective memory. Z-curve accurately reveals the truth that there is no reliable signal in the data, as does the effect size selection model. The information may not be desirable but z-cure is just a messenger. Unfortunately, the messenger often gets blamed for the bad news.

P.S. AI Coding

Coding research results for meta-analyses or z-curve analyses is time consuming and difficult when results are not clearly reported. Mistakes are nearly inevitable. Inspection of outliers can help to detect coding mistakes (e.g., confusing SE and SD). AI is now powerful enough to code articles and it should be used at least to check human coding.

Z-Curve Works Just Fine: Emotion Research is Broken

Wu, C., Zhang, C., Li, X., & Astikainen, P. (2025). The impact of emotional face distractors on working memory performance: a meta-analysis of behavioural studies. Cognition and Emotion, 1–22. https://doi.org/10.1080/02699931.2025.2568559

Introduction

In a recent critique of z-curve (under review), some authors (who shall remain anonymous at this point) claim that z-curve performs poorly, based on a few simulations with homogeneous data and small sets of studies. These simulations ignore the fact that z-curve has been validated in extensive simulation studies that are open, reproducible, and have passed peer review, including review by critics like Erik van Zwet (Bartoš & Schimmack, 2022; Brunner & Schimmack, 2020). Z-curve 3.0 also addresses the problem of small, homogeneous datasets by estimating heterogeneity and adapting the model accordingly — using fewer components when the data are homogeneous and the default components create estimation problems. There is nothing new in these simulation studies.

A second criticism is that z-curve does not work well for standard meta-analyses, which often have very few significant results to fit a model. This is also irrelevant because z-curve does not aim to estimate an average effect size from a set of close replication studies. Z-curve is explicitly designed to examine the credibility of heterogeneous sets of studies that test different hypotheses.

The only reason to mention this critique is that it highlights a broader problem. Many meta-analyses in psychology do not increase precision of effect size estimates by combining direct replication studies. Instead, they combine studies that investigate a loosely defined research question with different specific paradigms. These meta-analyses often have considerable heterogeneity, and it is unclear what we learn from estimating the average effect size of such conceptual replications. With large heterogeneity, some studies may show positive effects and others negative effects. Unless we can identify which studies produce which effects, we have not really learned anything from these studies.

Mindless Meta-Analysis of Mindless Mini-Paradigms

The manuscript uses a meta-analysis of emotional faces to claim that z-curve produces unreliable and misleading results when it is used to analyze the data (Wu et al., 2025). Below, I show that z-curve provides exactly the same information as other state-of-the art meta-analytic methods and that the main conclusion is that 51 effects from 37 studies provide no scientific evidence about the influence of emotion on attention and working memory.

The main claim examined in this meta-analysis is that “task-irrelevant emotional faces may affect working memory (WM) performance by involuntarily capturing attention” (Wu et al., 2025). Let me briefly state that I have actually conducted studies on the relationship of attention and emotional stimuli (Schimmack, 2005) and that emotional faces are unlikely to trigger emotional responses that can interfere with cognitive tasks. The key determinant is arousal level and looking at a facial expression of an emotion does not produce strong arousal responses.

Wu et al. found 51 behavioral effects from 37 studies. This means the data are nested because some studies included several emotional expressions that could be compared to a control condition. This is not a problem for z-curve or other meta-analytic methods because all methods can be combined with a clustered bootstrap approach to produce confidence intervals that take the nested structure of the data into account.

Figure 1 shows the distribution of the 51 effect sizes.

Notably, most results are close to zero. Wu et al. estimated an average effect size of -.04 standard deviations with a 95% confidence interval ranging from -.13 to .04. One might consider this interval small enough to conclude that the meta-analysis provides evidence against any notable effect (minimum effect size of interest, d = .2).

However, the studies are not close replications of the same mini-paradigm. Thus, the more interesting question is whether there is heterogeneity and some studies showed inhibition effects and others, like the g = 1.60 effect size estimate, showed a positive effect.

Wu et al.’s heterogeneity tests showed evidence of heterogeneity, p < .001, and they estimated that population effect sizes could range from g = -.47 to .39. Thus, it would be false to conclude that the null hypothesis is true. Instead, effects vary in unpredictable ways across studies.

I first used Vevea and Wood’s (2006) selection model to estimate the prediction interval using a method that takes selection bias into account. I used a clustered bootstrap to take the nested structure of the data into account. The prediction interval was a bit wider on the positive side, PI = -.49 to .84, presumably because the single large positive value gets more weight in an analysis based on 37 clusters rather than 51 individual effects. There was no evidence of selection bias.

Z-curve differs from Vevea and Wood’s selection model in several ways. One important difference is that it does not use non-significant results under the assumption that they are biased and only significant results provide unbiased information. The 51 effect sizes include 7 significant results with a negative sign and 4 with a positive sign. Ignoring the sign, there are only 11 significant results that happen to come from 10 independent studies. This is the absolute minimum z-curve analysis, but users are warned that larger sets of studies are needed for meaningful z-curve analyses.

To run a z-curve analysis, a simple approach is to divide the effect sizes by the sampling error and treat the ratios as approximate z-values. A z-curve plot shows the distribution of the absolute z-scores (Figure 2).

Z-curve.3.0 first fits a simple one-component model with free standard deviations to the data to examine heterogeneity. The results show that the standard deviation of the ncp (not the observed z-values), are heterogeneous (sd_ncp = 2.51, 95%CI = 0.057 to 5.50).

The Observed Discovery Rate (ODR) is the percentage of significant results in this set of studies. The Expected Discovery Rate is the probability of obtaining a significant result in all studies that were conducted whether reported or not. The EDR of 10% is less than the ODR of 22%, but with only 11 significant results, z-curve cannot produce a reliable estimate of the EDR. Therefore the confidence interval ranges from the minimum of 5% to the maximum of 100%. This is not a problem of z-curve. This result merely shows that the 11 significant results provide no information about the true discovery rate. There may be bias or there may not be bias. Even Vevea and Wood’s model that could use the non-significant results was unable to say whether bias is present or not. The point estimate of the EDR of 10%, still allows for 49% false positive results, but it is also possible that all significant results are true positives with opposite signs.

The Expected Replication Rate (ERR) is the average probability that an exact replication of a significant result with the same sample size produces a significant result again. The ERR of 52% seems encouraging, but given heterogeneity, this average is based on studies with low and high power. Studies with p-values just below .05 (z > 1.96) have a low probability of a successful replication. For z-values between 2 and 4, the probability of a successful replication is 48%, 95%CI = .13 to .83. Only 3 studies with z-values above 4 have a high probability to produce a significant result again, 98%, 95%CI = .80 to 1.00.

In short, despite the small number of significant results, z-curve results are consistent with other meta-analytic results. The data provide no information about the influence of facial expressions of emotions on working memory because (a) most studies had insufficient power to detect small effects, (b) different paradigms produce different effects, and (c) heterogeneity in low powered studies makes it impossible to identify conditions that produce real effects.

Inspection of the individual studies clearly shows that Gonzalez-Garrido’s (2015) positive effect of g = 1.6 is an outlier. Inspection of this study shows that the emotional faces were target stimuli and not distractors in this study. Thus, this study should not have been included in a meta-analysis that examines the influence of incidental facial expressions as distractors. This leaves only two significant negative effects by Stout (2015, 2017). Whether these are real effects requires direct replications. Neither z-curve nor other meta-analytic methods can answer this question.

How can there be 78% Non-Significant Results?

The most surprising aspect of Wu et al.’s (2025) meta-analysis is that only 22% of the effect sizes were significant. It is well known that psychology articles typically report significant results with success rates of 90%. Wu et al’s (2025) meta-analysis sheds light on the higher frequency of non-significant results in meta-analysis than in original articles.

First. meta-analysis often do not use the same estimates of sampling error that the original studies used . All of the studies in this meta-analysis were within-subject designs with many repeated trials. In these designs, stable differences between participants can be estimated and removed from the error variance. This reduces sampling error and increases the power of studies to obtain significant results for true effects. In contrast, the meta-analysis treated these studies as if they were between-subject designs that have low power with n = 10 to 20 per cell. Thus, many of the non-significant results in the meta-analysis may have been significant in the original articles. This also means that the meta-analytic estimates are more variable than they need to be. Meta-analysts should use the actual sampling errors of studies to avoid this problem.

The second reason for the emergence of non-significant results is that studies have multiple dependent variables. Here performance and reaction times were often used. Some studies reported significant reaction time effects, while the performance scores did not differ significantly. Moreover, the main focus of these studies were neuro-imaging measures (e.g., EEG, fMRI) so that even non-significant results for behavioral measures may have been publishable as long as some brain-correlates were significant. Z-curve is designed to detect biases in the reporting of focal tests. If secondary results are often reported even if they are not significant, a standard meta-analysis that uses significant and non-significant results is less biased.

Conclusion

In conclusion, standard effect-size meta-analyses differ in important ways from z-curve analyses. Standard meta-analyses aim to obtain a precise estimate of an average effect size for a set of similar studies. In contrast, z-curve examines the credibility of individual claims across a set of studies that test different hypotheses. These literature-wide credibility assessments are needed to identify problematic literatures like the one examined in Wu et al.’s meta-analysis, where 37 published articles using different paradigms produced no scientific insights into the influence of emotion on working memory.Given the lack of robust findings in the behavioral measures, it is unclear what we can learn from neural correlates of null effects. My own work on attention and emotion suggests that highly arousing stimuli are necessary to attract attention and influence working memory (Schimmack, 2005). Standardized photographs of facial expressions, presented repeatedly across hundreds of trials, are unlikely to meet this threshold.

Open Science Requires Open Admission of Mistakes

Open Science in Psychology

What is open science? Isn’t open science a tautology like “new innovation.” If there is open science, what is closed science? The need for open science arises from the fact that many academic practices are unscientific. They benefit academics without advancing or even hurting science. For example, conducting experiments and not reporting the results when they do not show a favorable outcome is a common academic practice that many people would recognize as undermining science. In psychology, this academic practice is widespread and explains why psychology journals have success rates over 90% (Sterling et al., 1995). Aside from just not publishing unfavorable results, academics also use a number of questionable statistical practices to turn failures into successes (John et al., 2012). All of these practices are well known and accepted among academics who understand the pressure to publish, while the general public focuses on the outcome and not the personal consequences of individual researchers.

Open science is basically the idea of an utopia where academic work produces scientific progress and creates incentive structures that reward honest attempts to advance science rather than meeting invalid indicators like publication and citation counts that can be gamed and can waste millions of dollars without any real progress.

In psychology, Brian Nosek spearheaded the Open Science movement and founded the Open Science Foundation (OSF). He also wrote several influential articles to promote Open Science practices in psychology (e.g., Nosek & Bar-Anan, 2012; Nosek, Spies, & Motyl, 2012).

These articles laid out a comprehensive vision to reform unscientific and counterproductive practices and incentive structures in psychology. Key elements focussed on (a) aligning incentives so truth-seeking wins over career advancement, (b) restructuring the unit of research itself from small teams to distributed collaborations, and (III) promoting a culture of transparency, openness to criticism, and willingness to find out you were wrong.

The Open Science movement has changed psychology in ways that nobody in 2010 could have imagined. Helped by empirical evidence that many results in Brian Nosek’s field of social psychology could not be replicated (a replication rate of 25% in the Open Science Reproducibility Project, 2015), journals now often demand assurances that results are reported honestly and reward practices that limit researchers’ abilities to change hypotheses or results when the original results are disappointing.

However, in other ways, progress has been limited. The main problem is that open admission of mistakes is still rare and researchers fear that any admission of mistakes harms their reputation. Thus, the incentive structure continues to reward promoting false claims. This problem is exacerbated by psychological mechanisms that have been documented in psychological research for decades and are highly robust. Motivated biases make it easier for people to see mistakes in others’ work than in their own work. The Bible calls this “seeing a splinter in others’ eyes, but missing the beam in one’s own eye.” The Nobel Laureate Feynman warned fellow scientists, “The first principle is that you must not fool yourself — and you are the easiest person to fool.”

Motivated Blindness

Ironically Brian Nosek’s work on the IAT provides an example of motivated blindness. All his knowledge and intelligence that helped him to spot the problem in colleague’s work with small samples that does not replicate, does not help him to see the problems in his own work on implicit biases. Originally invented by Anthony Greenwald, Brian Nosek helped to promote the Implicit Association Test (IAT) as a measure of associations that are sometimes called implicit, automatic, or unconscious. The IAT is a reaction time task, but modern technology made it possible to administer it on a website, hosted by Project Implicit and backed by Harvard University.

The IAT was never validated to the psychometric standards required for individual assessment. In practice, it functions like a distorting mirror — reflecting back what people largely already know about their attitudes, buried under substantial measurement error. If it were presented that way, no one would object, and no one would need a warning. But Project Implicit does not present it that way. Instead, visitors are warned that the test may reveal something undesirable about themselves. That warning only makes sense if the results are trustworthy. A distorting mirror does not come with a warning — it comes with a laugh. By framing the IAT as capable of revealing uncomfortable truths, Project Implicit treats an unvalidated research tool as a diagnostic instrument.

The problem is that even in 2024, Brian Nosek is still unable to openly admit that the IAT does not measure implicit biases (reference) and that his own studies, which convinced him the IAT is valid, were flawed. For example, in one study he claimed that a weak correlation of r = .2 between racial bias on the IAT and self-reported racial attitudes demonstrated convergent validity (reference). This is false. A correlation of r = .5 between self-reported height and self-reported weight does not validate either measure — it simply shows that two different constructs share a common method. Convergent validity requires measuring the same construct with different methods, not different constructs with the same method. When the IAT is compared to other implicit measures, the correlations are equally weak and, more importantly, no higher than the correlations with self-report measures (Schimmack, 2021). The IAT therefore provides no evidence that it reveals something about individuals that they do not already know. If somebody is biased against a particular group, they know it. The IAT does not uncover hidden biases — it merely repackages what people can already report about themselves.

While Brian Nosek is no longer actively involved in IAT research, he is still associated with Project Implicit and has made no attempt to correct the misinformation about the IAT given to visitors of the website that even administers mental health IATs without proven validity. Moreover, his students continue to publish misleading articles that make false claims about the IAT. These articles are published in journals that claim to promote open science, but do not allow for open criticism of statistical errors in their publications.

The article “On the Relationship Between Indirect Measures of Black Versus White
Racial Attitudes and Discriminatory Outcomes: An Adversarial
Collaboration Using a Sample of White Americans” by Axt et al. (2026) seems to meet the latest standards of open science. The research team is diverse with different opinions about the IAT. Hypotheses are preregistered with a clear criterion to claim validity. Brain Nosek was not a collaborator, but strongly endorsed this article on social media as a posterchild of open science practices.

Yet, the paper had a major limitations. It totally ignored the criticism of earlier structural equation modeling studies that failed to take shared method variance into account (Schimmack, 2021) and it made the same mistake again. By including two IATs, the published model treated all shared variance between the two IATs as valid variance, ignoring the well known evidence that IAT scores are also influenced by factors unrelated to the associations being measured. The authors could have avoided this mistake because they inspected Modification Indeces that show problem with a theoretically specified model They used these modification indices to adjust the measurement model for self-ratings, but not for the two IATs.

This mistake itself is not the main problem. Even a large team of scientists can make mistakes, especially if they are not trained in psychometrics and are working with measurement models. The real problem is that the editor of the journal that published the article is unwilling to correct it (Schimmack, 2026). This decision does not meet Open Science standards of open admission of mistakes or even engagement with criticism. Open science requires open discussion and responding to scientific criticism. I emailed Dr. Axt on December 2nd about my concerns and reanalysis of his data, but did not receive a response. This reaction highlights how far we still have to come before we can reach Brian Nosek’s utopia of open criticism and open admission of mistakes. Marketing the IAT as a “window into the unconscious” (Banaji & Greenwald’s, 2013, words, not mine) was a mistake, but Greenwald, Banaji, and Nosek have yet to admit so openly. Instead, Project Implicit continues to give people invalid feedback and Harvard does not care. This is not Open Science. This is naked self-interest to preserve a reputation that was earned with the false promise of addressing racial bias in the United States of America.

Why Do I care?

After cognitive performance tests, the IAT is arguably the most influential psychological test. Implicit bias was a major topic during the 2016 presidential campaign. Hillary Clinton made implicit bias a campaign issue, claiming that many Americans still harbor implicit racial biases. Asked for comment, Greenwald relied on IAT results for the two candidates to “go out on a limb to predict that Clinton’s vote margin on November 8 will exceed the prediction of the final pre-election polls.” The opposite happened. Trump became president and created a new culture that made open expression of racial bias “great again.”

Greenwald’s trust in the IAT was not justified. The IAT had already failed to predict racial bias in the 2008 election that Barack Obama won despite widespread racial prejudice. The IAT did not predict this outcome, but self-reports showed that some people openly admitted to biases that predicted their voting intentions over and above party affiliation (Greenwald et al., 2009).

Hillary Clinton’s endorsement of implicit bias may have cost her votes. The notion of implicit bias is that white people no longer endorse racist ideology, are motivated to avoid racial biases, but are still unconsciously influenced by them. That narrative has not aged well. A decade later, a presidential candidate can stand on a debate stage and say “they’re eating the cats and dogs” to applause, and win. The problem America faces is not hidden bias operating below the threshold of awareness. It is open prejudice, stated plainly, rewarded electorally, and entirely accessible to self-report.

The implicit bias framework misjudged the landscape. It assumed that the social norm against racism was strong and stable, and that the remaining work was to address what operated beneath it. Instead, the norm itself collapsed. Many white Americans are fully aware of their racial biases, are not motivated to change them, and are willing to vote for a candidate who hesitated to distance himself from the KKK. These voters were probably more offended by the suggestion that they are motivated to be unbiased than by the accusation that they have racial biases. Implicit bias training — which cost organizations millions — failed to address the real problem because it was designed for a world in which people wanted to be fair but couldn’t help themselves. That is not the world we live in.

Conclusion

Open science promises to align academic structures, incentives, and practices with the scientific aim of discovering the truth. To do so, science needs to check itself, notice mistakes, and correct them. However, the incentive structure continues to work against this goal. It is telling that Brian Nosek, the most visible proponent of open science in psychology, is unable to follow his own open science principles and admit that his work on the IAT did not produce a valid measure of implicit biases.

One might think that Nosek is in an enviable position to admit past mistakes given his achievements in making psychology more open. He is the Executive Director of the Center for Open Science and has a legacy that does not depend on the IAT. Other psychologists, like John Bargh, built their careers on a single line of research. When social priming failed to replicate, there was little else to fall back on. Walking away from the IAT should be easier by comparison. The fact that Nosek is unable to acknowledge the problems of the IAT shows even more the power of motivated blindness. It also highlights the most important change that is needed to make psychology a science. We need to normalize failure and see it as the inevitable outcome of exploration. Every failure that is openly acknowledged is a learning opportunity that makes success more likely the next time. Daniel Kahneman is a rare example of a psychologist who admitted mistakes in public and gained in recognition as a result. Maybe we should give Brian Nosek a Nobel Prize for his open science work so that he can admit his mistakes about the IAT.

References

Axt, J. R., Connor, P., Hoogeveen, S., Clark, C. J., Vianello, M., Lahey, J. N., Hahn, A., To, J., Petty, R. E., Costello, T. H., Mitchell, G., Tetlock, P. E., & Uhlmann, E. L. (2026). On the relationship between indirect measures of Black versus White racial attitudes and discriminatory outcomes: An adversarial collaboration using a sample of White Americans. Journal of Personality and Social Psychology. Advance online publication. https://dx.doi.org/10.1037/pspa0000480

Greenwald, A. G., Smith, C. T., Sriram, N., Bar-Anan, Y., & Nosek, B. A. (2009). Implicit race attitudes predicted vote in the 2008 U.S. presidential election. Analyses of Social Issues and Public Policy, 9(1), 241–253.

Nosek, B. A., & Bar-Anan, Y. (2012). Scientific Utopia I: Opening Scientific Communication Psychological Inquiry, 23(3), 217–243. DOI: 10.1080/1047840X.2012.692215

Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientific Utopia II: Restructuring Incentives and Practices to Promote Truth Over Publishability. Perspectives on Psychological Science, 7(6), 615–631. DOI: 10.1177/1745691612459058

Nosek, B. A. (2024, November 8). Highs and lows on the road out of the replication crisis [Interview]. Clearer Thinking with Spencer Greenberg, Episode 235.

Schimmack, U. (2021). The Implicit Association Test: A method in search of a construct. Perspectives on Psychological Science, 16(2), 396–414. https://doi.org/10.1177/1745691619863798

Schimmack, U. (2021). Invalid claims about the validity of Implicit Association Tests by prisoners of the implicit social-cognition paradigm. Perspectives on Psychological Science, 16(2), 435–442. DOI: 10.1177/1745691621991860


Average Statistical Power After Publication: Evaluating the Credibility of Meta-Analyses

Statistical power is widely known as a tool for planning sample sizes before studies are conducted. Less well known is the use of statistical power after publication to evaluate the credibility of published results in sets of studies, such as meta-analyses.

The basic idea is simple. Statistical power is the probability of obtaining a statistically significant result, typically p < .05. When the null hypothesis is true, this probability equals the alpha criterion, usually 5%. When the null hypothesis is false, power depends on the true population effect size, the sample size, and the significance criterion.

Several publication-bias tests estimate the average power of completed studies and compare it to the actual number of significant results in those studies (Ioannidis & Trikalinos, 2007; Schimmack, 2012). If the observed frequency of significant results is greater than the expected frequency based on average power, this suggests that non-significant results are missing from the published record. This reduces the credibility of the published results. The published literature is less robust than it appears, effect sizes are inflated, and the false-positive risk is higher than the observed success rate suggests.

The negative effects of publication bias on the credibility of published findings are well known and not controversial. Although publication bias is common, there is broad agreement that it is a problem. Publication-bias tests make it possible to detect this problem empirically.

The main challenge for power-based bias tests is estimating the true average power of completed studies. Developing and comparing different estimation methods is an active area of research, but these methods rely on the same basic principle: a set of studies cannot honestly produce substantially more significant results than its average power predicts. With reported success rates above 90% in many psychology journals, power-based tests typically show clear evidence of selection bias.

A few critical articles and blog posts have raised concerns about estimating average true power. However, these criticisms often do not engage with the actual goal of methods that use average power to detect publication bias. For example, McShane et al. acknowledge that average power says “something about replicability if we were able to replicate in the purely hypothetical repeated sampling sense and if we defined success in terms of statistical significance.” McShane objects that this is not useful because new replication studies can differ from original studies. But the purpose of computing average power in meta-analyses of completed studies is not to plan future studies or predict their exact outcomes. The purpose is to examine the credibility of the completed studies.

If a published literature reports 90% significant results but the completed studies had only 20% average power, the published record does not provide credible evidence for the claim, even if it contains hundreds of significant results. Critics of average-statistical power calculations often ignore this important information. Average power is not merely a planning tool. It is also a diagnostic tool for evaluating whether a published body of evidence is too successful to be credible.

An Average Power Primer

An Average Power Primer: Clarifying Misconceptions about Average Power and Replicability

Cohen (1988) introduced power analysis for the planning of studies to reduce false negative (type-II error) rates in psychological science. After the replication crisis, the importance of a priori power analyses has gained increasing attention. However, the estimation of actual power of studies remains neglected. This article clarifies important differences between power analyses with hypothetical effect sizes to plan studies and power analyses of actual studies that have been completed. Knowing the actual power of completed studies is important because it can be used to assess publication bias. Sets of studies that have high success rates, but low power do not provide credible evidence for a hypothesis.

Subjective Wellbeing – Chapter 08

Life-Events, Adaptation, and SWB

Summary

Chapter 8 examines whether major life events produce lasting changes in subjective well-being. It begins with adaptation theory, especially the “hedonic treadmill” idea, which claims that people quickly return to their baseline level of happiness after good or bad events. The chapter argues that this view is too pessimistic. People do adapt to some changes, but not all. Life circumstances can have lasting effects, especially when they affect important goals, daily experiences, income, status, relationships, or health.

The chapter distinguishes two mechanisms that can make gains fade over time. First, aspirations can rise. As people get better housing, higher income, or newer products, their standards also increase, so satisfaction may not rise much. Second, emotional reactions are often strongest when circumstances change. A new house or improved condition may feel exciting at first, but the emotional boost fades as the new situation becomes normal. These mechanisms differ across life domains. They may be strong for income or housing, but weaker for close relationships, where ongoing engagement continues to matter.

The chapter then reviews evidence on unemployment. Unemployment is one of the clearest examples of a life event with a strong and persistent negative effect on well-being. It reduces income, status, structure, purpose, and social contact. Panel studies show that people do not simply adapt to long-term unemployment. Their well-being remains lower while they are unemployed and improves when they find new work. Much of the effect appears to operate through income and financial satisfaction, but unemployment also affects status and purpose.

Housing shows a different pattern. Moving to a better home increases housing satisfaction, and this improvement can last. However, global life satisfaction often changes little. This does not mean housing is unimportant. Rather, housing may fade into the background of daily life and may be underweighted when people make global life evaluations. Domain-specific measures show that housing conditions matter, especially when they affect daily life through noise, crowding, poor physical conditions, safety, or comfort. The chapter uses housing to show why domain satisfaction is essential for understanding well-being.

Disability provides a more complex case. Early claims that people adapt almost completely to disability were based on weak evidence. Better panel studies show that acquired disability often produces lasting declines in life satisfaction, especially when it involves broader health deterioration. However, people born with disabilities often report higher well-being than those who acquire disabilities later. This supports the ideal-based framework: people born with a disability form their goals and identity around that condition, whereas people who acquire a disability must revise previously formed ideals. Adaptation depends less on time alone than on whether people can build new goals compatible with their changed circumstances.

The chapter gives special attention to relationships. Cross-sectional studies show that partnered people are generally happier than singles, but earlier research underestimated the effect because it focused on marriage rather than partnership. Weddings may produce only temporary increases in well-being, but having a stable partner appears to have a lasting positive effect for most people. Cohabitation and committed partnership matter more than legal marital status. Most people want a partner, and those without one tend to report lower well-being. Happy lifelong singles exist, but they appear to be the exception rather than the rule.

Partnership improves well-being partly through material advantages, because couples often share income and expenses. However, income explains only a small part of the partnership effect. Family satisfaction and relationship quality explain more. Partnership provides emotional support, shared life management, intimacy, and companionship. Sexual satisfaction contributes somewhat, but relationship satisfaction is much more important. Thus, the benefits of partnership are not reducible to money or sex.

The chapter also discusses spousal similarity in well-being. Spouses are more similar in well-being than would be expected from genetics alone, and their well-being tends to change in the same direction over time. This suggests that shared environments, such as household income, housing, relationship quality, and common life events, influence both partners. Some similarity may reflect assortative mating or stable shared conditions, but the evidence points strongly to environmental influences within couples.

The conclusion is that adaptation is real but not automatic. Some changes, such as improvements in housing, may produce lasting domain-specific satisfaction without strongly affecting global life satisfaction. Other events, such as unemployment, divorce, and disability, can reduce well-being until circumstances or goals change. Pursuing happiness through life changes is not futile, but people need to consider how changes will affect everyday life, goal progress, and long-term priorities. Novelty can be exciting, but lasting well-being depends more on stable fit between actual life, personal ideals, and daily experience.