Replicability Index: A Blog by Dr. Ulrich Schimmack

Blogging about statistical power, replicability, and the credibility of statistical results in psychology journals since 2014. Home of z-curve, a method to examine the credibility of published statistical results.

Show your support for open, independent, and trustworthy examination of psychological science by getting a free subscription. Register here.

For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication” (Cohen, 1994).

DEFINITION OF REPLICABILITYIn empirical studies with sampling error, replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion (Schimmack, 2017). 

See Reference List at the end for peer-reviewed publications.

Mission Statement

The purpose of the R-Index blog is to increase the replicability of published results in psychological science and to alert consumers of psychological research about problems in published articles.

To evaluate the credibility or “incredibility” of published research, my colleagues and I developed several statistical tools such as the Incredibility Test (Schimmack, 2012); the Test of Insufficient Variance (Schimmack, 2014), and z-curve (Version 1.0; Brunner & Schimmack, 2020; Version 2.0, Bartos & Schimmack, 2021). 

I have used these tools to demonstrate that several claims in psychological articles are incredible (a.k.a., untrustworthy), starting with Bem’s (2011) outlandish claims of time-reversed causal pre-cognition (Schimmack, 2012). This article triggered a crisis of confidence in the credibility of psychology as a science. 

Over the past decade it has become clear that many other seemingly robust findings are also highly questionable. For example, I showed that many claims in Nobel Laureate Daniel Kahneman’s book “Thinking: Fast and Slow” are based on shaky foundations (Schimmack, 2020).  An entire book on unconscious priming effects, by John Bargh, also ignores replication failures and lacks credible evidence (Schimmack, 2017).  The hypothesis that willpower is fueled by blood glucose and easily depleted is also not supported by empirical evidence (Schimmack, 2016). In general, many claims in social psychology are questionable and require new evidence to be considered scientific (Schimmack, 2020).  

Each year I post new information about the replicability of research in 120 Psychology Journals (Schimmack, 2021).  I also started providing information about the replicability of individual researchers and provide guidelines how to evaluate their published findings (Schimmack, 2021). 

Replication is essential for an empirical science, but it is not sufficient. Psychology also has a validation crisis (Schimmack, 2021).  That is, measures are often used before it has been demonstrate how well they measure something. For example, psychologists have claimed that they can measure individuals’ unconscious evaluations, but there is no evidence that unconscious evaluations even exist (Schimmack, 2021a, 2021b). 

If you are interested in my story how I ended up becoming a meta-critic of psychological science, you can read it here (my journey). 

References

Brunner, J., & Schimmack, U. (2020). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta-Psychology, 4, MP.2018.874, 1-22
https://doi.org/10.15626/MP.2018.874

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17, 551–566
http://dx.doi.org/10.1037/a0029487

Schimmack, U. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychology/Psychologie canadienne, 61(4), 364–376. 
https://doi.org/10.1037/cap0000246

Mastodon

A Scientific Response to The Right-Wing War on Science

At its best, science is amazing. It produces discoveries that change our understanding of the world—and the world itself. Human lives have been transformed by scientific knowledge and technology, often for the better. It has certainly made my life better than that of my ancestors.

Yet science continues to be under attack. Historically, religious dogma sometimes clashed with scientific progress. It took the Catholic Church more than three centuries before Pope John Paul II formally acknowledged that Galileo was right to claim that the Earth moves around the Sun.

A more recent and devastating example is Nazi Germany, where science was subordinated to ideological pseudo-science in order to justify mass murder. The regime also drove out many Jewish scientists, some of whom later contributed to the Allied war effort. Later still, scientific progress in the Eastern Bloc was hampered by putting party loyalty over scientific excellence rather than evidence. These episodes illustrate a recurring lesson: science requires ethical guardrails, but it does not survive political domination.

Today, science is advancing rapidly in parts of the world, including China, for example through major investments in green energy. At the same time, the United States has increasingly undermined scientific consensus on issues such as vaccines and climate change and has placed growing pressure on scientific institutions. A number of observers warn that these developments threaten academic freedom and risk slowing scientific progress. One prominent justification for attacks on universities is the claim—advanced by some conservative academics, including Jonathan Haidt and Jordan Peterson—that universities are ideological “cesspools” in which naïve students are indoctrinated by hard-left professors.

This image of universities is both inaccurate and unscientific. For example, modern genetics has shown that humans are one species with a single, shared gene pool, not distinct biological races that can be ranked by skin color. This is not “woke ideology”; it is a straightforward empirical fact that only conflicts with racist belief systems.

Critics often argue that universities are repeating historical mistakes by ignoring science in order to impose liberal or radical-left values on campus. But what, concretely, are these alleged policies? Following the murder of George Floyd, many North American universities examined whether systemic racism contributes to a hostile climate for Black students or whether hiring practices unfairly favor applicants from privileged backgrounds. For example, universities may prefer a White applicant from Harvard whose parents also attended Harvard over a Black applicant from Michigan State University—despite comparable or superior qualifications.

Whether such policies reduce inequality or create new inequalities is an important and difficult empirical question. However, the underlying goal of diversity, equity, and inclusion programs—to promote fairness and equal protection—is grounded in the 14th Amendment of the U.S. Constitution. Efforts to bring social outcomes more in line with these principles are not radical; they are consistent with constitutional ideals and basic human rights. Opposition often aligns with existing power and status hierarchies rather than with empirical evidence.

It is understandable that politically conservative professors may feel out of place in departments where most colleagues are liberal. But the same is true for female police officers or Black lawyers in elite law firms. Ironically, DEI initiatives could also benefit politically conservative academics by ensuring that universities foster inclusive environments and avoid discrimination based on political orientation. In practice, this is rarely a major problem. Most professors interact with colleagues infrequently outside formal meetings, and promotions depend far more on student evaluations, publications, and grant funding than on political views.

Concerns about ideological repression are often fueled by highly visible but rare cases. Data from the Foundation for Individual Rights and Expression (FIRE, 2023) show that sanction campaigns against scholars originate from both the political left (about 52%) and the political right (about 41%), and that most cases do not result in formal discipline. When sanctions do occur, universities typically cite violations of institutional policies or professional standards. Since early 2025, however, campus politics have become more volatile. In the aftermath of the killing of conservative activist Charlie Kirk, several universities removed or suspended faculty and staff over controversial social media posts (Inside Higher Education, September, 19, 2025) Similar controversies have been reported in Canada as well (RIndex, 2025).

Debates about universities and politics also ignore a crucial body of scientific evidence concerning political orientation itself. Research in behavioral genetics and personality psychology shows that political orientation is surprisingly trait-like—closer to being an extravert or introvert than to preferring Pepsi over Coke (Hatemi, 2010). Like personality traits, political orientation has a heritable component and shows substantial stability across adulthood. This stability helps explain why political campaigns spend billions of dollars targeting a small number of swing voters while most citizens vote consistently over time.

Another widespread misconception is that parents exert a strong and lasting influence on their adult children’s political views. Parents do influence political attitudes during childhood and adolescence, but this influence declines sharply in early adulthood (Hatemi, 2009). By adulthood, similarity between parents and their children is explained largely by genetic similarity rather than by parental socialization (Hatemi, 2010). This helps explain why political disagreements within families are common—and why Thanksgiving dinner conversations so often avoid politics.

The most important conclusion from this research is that adolescents are not blank hard drives waiting to be programmed by parents or professors. Adolescence and early adulthood are periods of exploration in which individuals actively gravitate toward ideas that fit their underlying dispositions. Students may encounter certain arguments or perspectives for the first time at universities, but they choose how to interpret and integrate them. Exposure is not indoctrination.

Longitudinal studies of university students support this conclusion. There is little evidence that conservative students enter university and reliably graduate as “flaming liberals” (Mariani & Hewitt, 2006). Where changes in political attitudes do occur, they are typically modest and better explained by self-selection, maturation, and peer sorting than by classroom instruction.

So why does the belief in widespread university indoctrination persist? One explanation lies in a common cognitive error: people often infer causation from temporal coincidence. When parents observe that their child goes to university and later adopts different political views, it is tempting to assume that university caused the change. Yet similar changes would often have occurred anyway, regardless of whether the student attended a secular university, a religious institution, or none at all.

In conclusion, universities create and transmit scientific knowledge. Societies that invest in science and higher education tend to produce citizens who are healthier and live longer lives. Scientific inquiry can challenge traditional beliefs that are not grounded in evidence, and this tension is unavoidable in knowledge-based societies. The solution is not to vilify universities, but to recognize that diversity of viewpoints is inevitable—and valuable. Creating learning environments that benefit all students while tolerating disagreement is central to the mission of universities. Anyone who genuinely cares about students’ learning and wellbeing should support efforts to promote diversity, equity, and inclusion. This includes tolerating different political viewpoints—but tolerance cannot extend to intolerance, racism, sexism, or ideologies that deny equal rights or basic human dignity.

Personality Science 2025: About the Author

Science is often described as objective. Given the same evidence, anyone should reach the same conclusion. In reality, things are more complicated. Even in the most rigorous sciences, researchers’ perspectives influence how they interpret evidence. This influence is even stronger in the social sciences. Psychologists, for example, cannot fully set aside their personal views when designing studies, interpreting findings, or writing textbooks. That is why it may help for you to know a little about the author of this book. 

This textbook explores fundamental questions about human nature:

  • How much are people alike, and how much do they differ?
  • To what extent is behavior influenced by situations (social norms, conformity) versus personality (values, dispositions)?
  • How much of personality is shaped by nature (genes) and how much by nurture (culture, socialization, parenting)?

Psychologists disagree about the answers to these questions. Biologically oriented psychologists emphasize evolution and genetics. Developmental psychologists highlight parenting. Social psychologists stress the power of situations. These perspectives are sometimes called paradigms. A paradigm is like a research culture with its own fundamental beliefs and research practices. Each perspective adds valuable insights, but paradigms also create blind-spots and biases. 

Behaviorism is a good example. Behaviorism denied the existence of personality traits. Everybody was just the product of a different reinforcement schedule. It also ruled out the study of emotions and forbade self-reports. For this reason, research on personality and emotions with self-report measures only emerged in the 1980s, when the behavioristic paradigm lost its influence. I would not be a psychologist, if behaviorism had lasted another couple of decades. Instead, I attended a conference in 1990, where Skinner gave his last speech to a large audience and only a handful of psychology clapped when he criticized cognitivism. The behavioristic paradigm was dead. At another a conference, an older psychologist described himself as a prisoner of the behavioristic paradigm. That phrase stuck with me. I did not want to look back at my career and realize that I had been a prisoner.  This does not mean I am without biases, but it does mean that I am not trying you to sell the personality paradigm that has many limitations that card-carrying personality psychologists like to ignore. 

The Origin of My Perspective

My journey began in 1966, in a small town in northern (West) Germany. Too young for the student revolutions of the late 1960s, I nevertheless grew up in their aftermath, surrounded by cultural shifts that reshaped much of the Western world. I was raised in a comfortable middle-class family, with a natural affinity for math and a growing interest in social issues. Once I discovered that psychology was a science—not just speculation about dreams—I knew it was the right field for me.

In 1988, I moved to West Berlin to study psychology, just one year before the fall of the Berlin Wall—an event that profoundly shaped my worldview and my appreciation of free societies. My early academic interests were in emotion research. I studied with Professor Rainer Reisenzein, who introduced me to theories of emotion, and with Professor Hubert Feger, who focused on measurement and group processes. At that stage, personality psychology did not appeal to me. The field was dominated by grand theories, such as Freud’s, that seemed disconnected from evidence. Other approaches emphasized genetics and biology in ways that, to me, echoed the dark history of Nazi eugenics. As a young student, I rejected this line of thought. 

In 1996, I began my dissertation research on how people recall their emotions: How do you know how happy you were last month, and how accurate is that judgment? That same year, I received a scholarship to study with Ed Diener at the University of Illinois, one of the leading figures in happiness research. Working with him and his students was an extraordinary experience. After defending my dissertation in 1997, I was fortunate to secure a two-year fellowship from the German Science Foundation (DFG), which allowed me to continue working with Ed Diener in Illinois. My focus shifted from emotions to personality and well-being: Why do some people consistently experience more positive and fewer negative emotions than others? Why are some people happier? Over time, my perspective expanded. Feeling good is important, but it is not the whole story. A full picture of well-being requires asking people how satisfied they are with their lives overall. Life satisfaction became the central theme of my research, and Chapter 14 of this book summarizes some key findings in this area. 

Since 2000, I have been a faculty member at the University of Toronto, Mississauga, a unique campus that reflects the cultural diversity of Toronto. Most of my research focused on happiness (subjective well-being), but since 2011, I have been working on examining the research practices of psychologists. This work was motivated by increasing awareness that many results in psychology journals that end up in textbooks are not replicable. This scientific study of scientists’ behavior is called meta-science or meta-psychology. With Rickard Carlsson in Sweden, I co-founded a journal with the title “Meta-Psychology.” My awareness of the replication crisis helped me to select only credible results for this textbook. Another benefit for students is that this makes the book a lot shorter because some research areas have few replicable findings. For example, we still know very little about the neurological differences between people that shape their personalities. 

Writing this textbook as an active researcher comes with both strengths and weaknesses. On the one hand, I can bring you closer to the science itself—critiquing studies, highlighting controversies, and even sharing my own analyses. On the other hand, professional textbook writers are often more skilled at producing polished narratives. The problem with polished narratives, however, is that they often gloss over controversies and discourage critical thinking. They present findings as if they were unshakable facts. In reality, personality psychology is an emerging science, barely 50 years old, and many findings rest on shaky foundations. The aim of this book goes deeper. It introduces students to scientific thinking, critical evaluation of empirical findings, and quantitative reasoning about personality. That is why the word science appears in the title. I will make a clear distinction between empirical facts (e.g., monozygotic twins are more similar than dizygotic twins for most traits) and inferences or implications (e.g., genetic differences cause personality differences). Facts should not be denied. Inferences can and should be questioned.

As I said before, I did not want to believe in genetic differences, but the evidence became impossible to ignore. Rather than resisting it, I learned to see it differently. Genetic differences do not mean that some people are born with better genes. They mean people are different—and good societies allow everyone to be who they are. Genetic variation is a strength. This principle is true in human evolution and in human societies. Understanding differences, and understanding people who differ from us, is essential for modern life.

The scientific study of personality can also help people avoid chasing unrealistic goals rooted in social norms of perfection. Instead, we can learn to accept ourselves and become our best unique selves. This non-judgmental approach aligns with science’s aim to be objective. Whether there are truly bad, evil, or pathological personalities is a difficult question, but psychology’s history shows how dangerous it can be to label some variations as pathological. Only 50 years ago, homosexuality was considered a disorder. Today, it is accepted as a normal variation in human sexuality.

Finally, I must mention political orientation. Like sexual orientation, it has some genetic roots. Some people are drawn to familiar, traditional values; others to different cultures and new ways of living. Universities are often criticized as leftist and “woke,” accused of indoctrinating students. In reality, students’ political beliefs are largely established before they enter the classroom, and professors have little power to change them. Moreover, many conservative critiques ignore the fact that some conservative ideas are directly opposed to science. At the University of Padua, where Galileo taught, it took the Catholic Church 500 years to accept that the Earth revolves around the sun.

The conflict between traditional values and science is especially sharp in psychology. Psychological science is still concentrated in a handful of mostly secular countries in Western Europe, North America, and East Asia. In the United States, science is currently under attack by right-wing conservatives. Learning about psychology as a science will expose students to progressive ideas that challenge traditional beliefs about human nature, sexuality, gender, and race. At the same time, most topics in psychology are not political, and personality psychology is less politically charged than social psychology. As you will see in Chapter 1, however, personality psychology does have its own dark history—one that is important to confront as we move forward.

A Multiverse Analysis of Regional Implicit Bias: Implicit 1 : 13 Explicit

 

Snyder, J. S., & Henry, P. J. (2023). Regional Measures of Sexual-Orientation Bias Predict Where Same-Gender Couples Live. Psychological science34(7), 794–808. https://doi.org/10.1177/09567976231173903

Multiverse Analysis
OSF | A Multiverse Analysis of Snyder and Henry (2023) “Regional Measures of Sexual-Orientation Bias”

Summary

Snyder and Henry (2023) argue that county-level aggregation of IAT scores yields a reliable regional measure of anti-LGB bias that predicts where same-gender couples live, and they highlight a key adjusted regression (Table 3, Column 3) in which the implicit measure appears to outperform a single-item explicit measure. While aggregation can reduce random error, it does not by itself establish that IAT scores capture a distinct implicit construct; aggregation also stabilizes systematic method variance and sampling artifacts, and regional differences in self-presentation could affect explicit reports.

A reanalysis using a multiverse framework shows that the “implicit > explicit” contrast is highly model-dependent. In simple associations, implicit and explicit measures show similar relationships with the outcome. Across 42 reasonable specifications that vary outcome handling (raw, log-transformed, count model), weighting (with/without), and covariate inclusion (none, single covariates, full published set), only the published specification yields a statistically significant advantage for the IAT, while multiple alternatives yield either no difference or a statistically significant advantage for the explicit measure. The main conclusion is that the paper’s headline inference—implicit bias is a stronger predictor than explicit bias—is not robust to reasonable analytic choices and should be interpreted more cautiously.

Full Article

 

 

This article asks whether regional measures of sexual-orientation bias predict where same-gender couples live. The central claim is that county-level implicit bias provides predictive value beyond explicit measures, and that this pattern remains when adjusting for a set of county-level covariates and region indicators.

The key evidence is a regression framework in which the outcome is a county-level measure of same-gender couple prevalence (and/or counts), with implicit and explicit bias entered jointly, and then a “full” specification that adds a covariate set (policy environment, religion, education, income, political orientation, rurality, and census region). They interpret the implicit coefficient as the stronger (or uniquely informative) predictor in the adjusted model.

They motivate covariates as adjustments for structural and cultural factors that could correlate with both attitudes and residential patterns of same-gender couples. They treat the adjusted model as closer to the causal quantity of interest: the association between bias and couple locations net of these background county characteristics.

What do IATs Measure?

Researchers disagree about what IAT scores mean. The early interpretation was that IATs capture evaluative associations that are at least partly outside conscious awareness. Low correlations between IAT scores and self-reported attitudes were often taken as support for this view. This interpretation remains common, but a growing literature challenges it.

At the individual level, IAT scores correlate only modestly with other indirect measures and with behavior, suggesting that a substantial share of variance reflects random noise and systematic method variance rather than unique, construct-valid “implicit” content. One alternative view is that IATs are indirect, error-prone measures of largely the same evaluative attitudes people can report in questionnaires, with differences between methods driven in part by measurement artifacts rather than distinct underlying constructs.

Snyder and Henry (2023) adopt a related but distinct argument at the regional level. They propose that aggregation of individual IAT scores to the county level reduces random error and yields a more reliable measure of the local “implicit climate,” which can then predict county-level outcomes. This logic is reasonable as far as reliability is concerned. However, improved reliability is not the same as improved discriminant validity.

Aggregation reduces random noise, but it also stabilizes systematic components of the measure that may vary across counties (e.g., platform- and sampling-related artifacts, regional differences in who takes the test, and other method-specific influences). The same concern applies to self-reports. Social desirability and self-presentation may differ across regions, which could attenuate the implicit–explicit correlation even if the two methods track a single underlying construct.

In the present data, the county-level correlation between the IAT measure and a single-item explicit measure is reported as r = .60. This is substantial shared variance, but it still leaves considerable unique variance in each measure. With only two methods, it is difficult to draw strong conclusions about what that unique variance represents. In particular, it is possible for two imperfect measures of the same construct to show different “unique” predictive power in regression models when systematic measurement error and correlated predictors are present. Conversely, if one measure fails to predict the outcome across reasonable model specifications, that would cast doubt on claims that it contains unique valid information about the criterion.

A further question is why one measure would be a stronger predictor than the other. One possibility is validity: the more valid measure should predict relevant outcomes more strongly. Another possibility is model dependence: when two predictors are highly correlated and both contain measurement error, small analytic choices (model form, covariate set, weighting) can shift the apparent “winner” without reflecting a stable underlying difference.

It is sometimes argued that indirect measures should outperform self-report on socially sensitive topics because self-reports are suppressed by social desirability. Yet evidence for this claim is mixed; in many anonymous contexts, people report socially sensitive attitudes with little apparent inhibition, and self-reports often predict behavior at least as well as IAT scores at the individual level. The key point for the present analysis is that differential predictive power does not, by itself, establish that IAT variance is uniquely “implicit.”

The paper’s central result for the “implicit beats explicit” claim appears in Table 3, Column 3. In that specification, the implicit measure shows a stronger negative association with the outcome than the explicit measure when both are entered together and additional county covariates are included. The authors interpret this as evidence that the aggregated IAT captures something distinct and more predictive than the explicit measure.

However, the corresponding zero-order correlations in Table 1 are comparatively balanced: the implicit and explicit measures show similar correlations with the outcome. This suggests that the divergence in Table 3 is driven by the particular multivariable specification—especially the inclusion of several covariates that are themselves strongly related to both attitudes and the outcome (e.g., political conservatism, rurality, and religiosity).

One way to address sensitivity to analytic choices is to provide a strong theoretical rationale for a specific model and, ideally, preregister it. Another is to examine robustness across a transparent set of reasonable alternatives. To that end, I conducted a multiverse analysis (MVA) that focuses on the robustness of the “implicit vs explicit” contrast.

The analysis acknowledges that the outcome is highly skewed and that the authors used a weighting scheme based on the number of IAT observations per county. Accordingly, models were estimated using (a) the raw outcome (as in the paper), (b) a log-transformed outcome, and (c) a count-model approach. Each model was estimated with and without weights. Finally, models were estimated with no covariates, with individual covariates, and with the full covariate set used in Table 3. This produced 42 specifications. For each specification, I computed and tested the difference between the implicit and explicit coefficients.

The results indicate substantial model dependence. Only one specification produced a statistically significant “implicit > explicit” contrast—namely the specification corresponding to Table 3, Column 3. In contrast, 13 specifications produced a statistically significant advantage for the explicit measure over the implicit measure, and the remaining specifications were non-significant. In other words, the published pattern is atypical in the multiverse: modest changes to modeling decisions (e.g., outcome transformation or omitting weights) eliminate the reported contrast, and in a nontrivial subset of specifications the sign of the contrast reverses.

These findings illustrate the value of robustness checks in complex observational analyses. The Open Data badge made it possible to evaluate the sensitivity of the headline claim to reasonable analytic choices. The key conclusion from the MVA is not that the focal association is absent, but that the specific inference that “implicit bias is a stronger predictor than explicit bias” is not robust to alternative, defensible specifications.

At minimum, the results warrant a narrower interpretation: the data show that both regional measures relate to the outcome, but the direction and significance of the implicit–explicit contrast depends strongly on modeling decisions. A cautious reading is therefore that the evidence does not uniquely support the claim that the IAT measures a distinct “implicit” construct that outperforms explicit self-report at the county level.

 

Willful Incompetence: Questionable Modeling Practices in Implicit Bias Research

Wittgenstein might have described science as a kind of language-game: a rule-governed social practice in which the meaning of terms like evidence, effect, replication, and error comes from how they are used and challenged within the activity. On this view, science is not “just talk,” but a distinctive game whose moves are constrained by disciplined observation, measurement, and public criticism, and whose outcomes depend on shared standards for what counts as a good inference.

Consistency is a core rule of the scientific game: theoretical claims should cohere with each other and remain answerable to empirical observations. A major aim of scientific work is to identify tensions—between theory and data, between different datasets, between predictions and replications, or between measurement assumptions and outcomes—and then resolve them by revising theories, improving measurement, or redesigning studies.

In many areas of psychology, however, inconsistency-resolution is weak. Kuhn’s language helps here: psychology is often multi-paradigmatic, with several partially insulated frameworks coexisting, each with its own standards for what counts as a good question, a clean test, and a “successful” result. Under these conditions, contradictions are often routed around rather than confronted—by redefining constructs, adding boundary conditions, switching measures, or shifting statistical conventions. This is reinforced by incentives: researchers are rewarded for producing publishable results within a paradigm, not for cross-paradigmatic work that forces head-to-head adjudication.

The clearest symptom of this weakness is measurement. To study psychological questions, researchers must link variation in unobserved processes (thoughts, feelings, attitudes) to variation in observable scores. But scores are only informative if the mapping from scores to the target construct has been validated. Reliability is not enough. A procedure can produce consistent scores and still fail to measure what it claims to measure. When measures are not validated, inconsistencies proliferate. Instead of treating divergence between measures as a problem to be resolved, psychology often treats each measure as defining its own construct. This operationalist move resolves inconsistency by definition rather than by evidence.

The Implicit Association Test (IAT) is a particularly clear case because it has been promoted as a measurement breakthrough: a “window into the unconscious” (Banaji & Greenwald, 2013). Many researchers trained in this tradition have built careers on the claim that the IAT can measure attitudes and cognitions that are not fully accessible to introspection. The most famous IAT is the race IAT, which is often presented as a measure of implicit racial bias. The test is still promoted by Project Implicit as an educational tool for raising awareness about implicit bias.

Project Implicit is a nonprofit organization with annual revenues in the range of several hundred thousand dollars a year (Project Implicit Inc – Nonprofit Explorer – ProPublica). Revenues include consulting and donations, while the public-facing service is free: visitors complete IATs and receive feedback about their “implicit biases.” In return, Project Implicit obtains a massive dataset that fuels research on implicit bias.

This is precisely where the scientific rule about measurement matters. If the IAT is a “window into the unconscious,” then IAT scores should function as valid measures of a distinct construct—bias that is not accessible to introspection and that therefore cannot be captured by self-report. But strong claims require strong validation. The issue is not whether IAT scores show reliable individual differences. The issue is what those differences mean: do they reflect an unconscious attitude, a general preference that is also accessible to self-report, or task-specific artifacts that are not attitudes at all?

A recent article claims to show that IAT scores are valid measures of racial bias that predicts behavior (e.g., trusting a partner with money) and is distinct from self-reported bias (Axt et al., 2025). The paper mentions a critical perspective, but it does not engage with the central objection raised repeatedly in earlier exchanges: that the same pattern can arise from shared task-specific overlap among similar measures, rather than from uniquely higher validity of the IAT (Bar-Anan & Vianello, 2018; Schimmack, 2020; Vianello & Bar-Anan, 2020).

I view this pattern of citation without uptake—acknowledging criticism while leaving the underlying inference unchanged—as willful incompetence. In this case, repeating the same mistake is not “insanity” (Einstein) because the goal is not to expect a better result. Rather, the goal is to reproduces the same reassuring conclusion and shields core assumptions from potentially fatal criticism.

The mistake is simple: when two measures are very similar, their strong correlation can be mistaken for evidence of higher validity. But similarity can also create shared method variance that inflates correlations and produces spurious evidence of high validity. This is not an exotic technical objection. It is a basic measurement point: two thermometers left in direct sunlight will correlate more strongly with each other than with a thermometer kept in the shade. In that case, sunlight exposure is shared method variance—it makes the instruments agree with each other while distorting the measurement of the construct of interest (temperature).

I use the correlation matrix in the 2025 paper to show that shared method variance between two IATs inflated the validity estimates in the published article. First, consider the four implicit tasks that are supposed to measure one common construct— racial bias measured with implicit tasks. Taken at face value, the model suggests that the two IAT measures—the standard race IAT (Black vs. White) and the single-category IAT (Black only)—are “better” measures than the other tasks because their links to the common factor are stronger, while the other tasks show little relationship.

Figure 1. The model shows the influence of a hypothetical common cause on the four implicit tasks. The numbers are validity coefficients. A value of .6 can be interpreted as evidence that there is an ~ 80% probability that somebody above average on the unobserved common cause is also above average on a specific task.

But this interpretation has an obvious alternative explanation: the two IATs are more similar tasks than the other measures, so they may share task-specific influences. From the correlations among the tasks alone, we cannot tell whether the IATs are better measures of racial bias, or whether they mainly share the same kind of measurement noise.

If the IATs are really better measures, they should also be better predictors of biases in the behavioral tasks. But they do not (Figure 2, an exact copy from the article, including the red frame around these correlations). For the race IAT, the correlations with the four behavioral measures are .03, .14, .05, and .11. For the AMP, they are .06, .19, .04, and .12.

The authors discuss the magnitude of these correlations, but they neglect what the pattern implies. The IATs are not better predictors of behavior than the AMP. That pattern is inconsistent with the claim that the IATs are more valid measures of the underlying bias than the AMP, and it suggests that the stronger links for the two IATs in Figure 1 are inflated by shared method variance.

Adding the behavioral tasks to the model makes it possible to separate task-specific overlap from the variance that is actually related to behavior (Figure 3). Consistent with Figure 2, this model shows that the IATs are not more valid than the AMP. A validity of about r = .3 also has a simple practical interpretation. If we treat the underlying bias as the target, an r of .3 implies only modest classification accuracy: roughly 65% of people who are above average in bias would also score above average on the IAT, but about 35% of people who are below average in bias would still score above average on the IAT and receive misleading feedback about their racial bias.

The model in Figure 3 does not tell us, however, whether the implicit tasks measure implicit bias; that is, biases that are outside of people’s awareness. To examine that, we can add self-report measures to the model. When we do, we see that the common factor behind the implicit tasks is strongly related to the self-report factor, r ~ .8 (Figure 4). This means that about 90% of people who are above average in self-reported bias are above average on the “implicit factor,” Clearly, the implicit factor does not represent biases outside of awareness. As Greenwald and Banaji (2017) acknowledged, the IAT is an implicit = indirect measure of bias, not a measure of implicit = unconscious biases; a clarification by the creators of the IAT that is also not cited in this article.

Conclusion

Axt et al.’s (2025) article is a prototypical example of methodological non-response. The authors cite Schimmack (2020), yet the central measurement critique—shared task similarity can inflate correlations and create the appearance of superior validity—is not incorporated into their interpretation. As a result, the paper reproduces reassuring conclusions that make the IAT appear to be a uniquely valid measure of “implicit attitudes” that predict behavior.

It is useful to distinguish two ways paradigms remain insulated from criticism. Willful ignorance is motivational: researchers avoid engaging with threatening evidence, in a way that resembles a Freudian defense mechanism. Willful incompetence, in contrast, can operate as a cognitive blind spot: once certain assumptions are built into a research tradition, they shape the analysis and interpretation of data. Like a visual illusion, the error can persist even when the ingredients for correcting it are present in the data, and outsiders can sometimes see the problem more readily than insiders. In that sense, the case provides an ironic example of “implicit” bias—not necessarily in participants, but in the inferential habits of the research program.

Freud’s specific theory of repression may not be the right model for scientific practice, but the general point stands: strong professional investments can distort how evidence is interpreted. These paradigm-protecting blind spots are reinforced when professional visibility, reputational capital, and institutional missions are tied to sustaining a central claim. What would be the point of Project Implicit if IAT scores were not presented as windows into people’s unconscious?

Is the Implicit Association Test Too Big To Fail?

The Implicit Association Test (IAT) was introduced in the late 1990s as a measure of implicit associations, cognitions, and attitudes that might not be fully captured by self-report measures. The prospect of assessing socially sensitive constructs such as prejudice using a brief reaction-time task made the method widely appealing. Early publications emphasized that implicit and explicit measures could diverge and that the IAT might detect evaluative processes that are difficult to access through introspection.

However, the IAT spread rapidly within social psychology before receiving the kind of psychometric validation typically applied to psychological tests. As a result, modest correlations between IAT scores and self-ratings were often interpreted as evidence that the IAT measures a distinct, implicit construct, rather than as possible indicators of measurement error Theoretical discussions also suggested that implicit attitudes could reflect introspectively inaccessible processes.

An alternative interpretation of the low implicit–explicit correlations is that IAT scores contain substantial method-specific and error variance. This view is supported by three patterns in the literature.

First, different implicit measures—such as the evaluative priming task and the affect misattribution paradigm—tend to show low convergent validity with the IAT.

Second, meta-analyses consistently find that IAT scores predict behavior only weakly, often less strongly than explicit measures, and provide limited incremental validity over self-reports.

Third, latent variable analyses show that once measurement error and method variance are modeled explicitly, a single-factor model often fits as well as, or better than, two-factor models that assume distinct implicit and explicit constructs.

Given this background, an adversarial collaboration on the validity of implicit measures provided an important opportunity to evaluate competing optimistic and pessimistic interpretations of the evidence in a joined project (Axt et al., 2025). However, this project did not include a psychometrically trained critic of IAT research and did not respond to the challenges raised in Schimmack (2020). This omission may help explain why the published model rests on strong identification assumptions that were challenged by Schimmack (2020) and strongly influence the results.

Figure 1 of the article specifies a model in which implicit and explicit attitudes are represented by separate latent variables with a correlation of r = .41. Implicit attitudes are treated as a factor identified by four tasks: the standard IAT, the single-category IAT, the evaluative priming task (EPT), and the affect misattribution paradigm (AMP). However, the factor loadings reveal that two of the measures contribute little to the factor. The loading of the EPT is approximately .08, implying that less than 1% of its variance reflects the common factor. The AMP loading of .25 similarly implies that most of its variance is unique. This leaves the two IAT variants as the primary indicators of the latent implicit factor. This makes it impossible to distinguish construct variance that may reflect implicit racial biases, from shared method variance between two nearly identical tasks (one IAT with Black and White pictures, one with only Black pictures).

The published model constrains the residuals of the two IATs to be uncorrelated, thereby assuming no shared method variance. This assumption is questionable because both tasks use nearly identical procedures, differ only in stimuli, and are known to correlate strongly for method-related reasons. Previous psychometric work has shown that IATs commonly exhibit substantial method variance, and omitting such variance can lead to inflated estimates of discriminant validity (Schimmack, 2020).

To evaluate this empirically, alternative models can be fitted to the data. A one-factor model in which all 12 measures load on a single latent attitude factor does not achieve standard fit criteria (χ²(54) = 331.43, CFI = .914, RMSEA = .065). Inspection of modification indices identifies three large residual correlations: (a) between the two explicit self-report measures (SRS and ANES), (b) between the two IATs, and (c) a smaller correlation between two behavioral tasks. Adding these three theoretically plausible residual correlations yields excellent model fit (χ²(51) = 69.51, CFI = .994, RMSEA = .017). Under this model, the data no longer support a clear distinction between implicit and explicit latent factors.

In contrast, the published model includes the residual correlation between the two self-report measures but not between the two IATs. Modification indices still suggest a substantial residual correlation between the two IATs (MI ≈ 43), but adding this parameter leads to identification problems under the authors’ original specification. This indicates that conclusions about discriminant validity depend heavily on the assumption that the two IATs do not share method variance—a strong assumption that may not be justified.

Alternative specifications are possible. For example, assuming equal validity for the IAT and AMP (given their similar correlations with the behavioral tasks) yields an estimated implicit–explicit latent correlation of approximately .74 (95% CI: .56–.92). Under this model, the evidence for distinct implicit and explicit constructs becomes substantially weaker, but can be rejected at the typical 5% criterion of a type-I error.

Finally, it is important to distinguish between the predictive validity of latent variables and that of observed scores. The latent implicit factor in the authors’ model predicts the latent behavioral disposition with a standardized effect of about .16, which is small and consistent with prior meta-analytic estimates. However, because IAT scores include considerable error and method variance, the predictive validity of raw IAT scores is smaller still.

In summary, the conclusions of the adversarial collaboration depend largely on an identification constraint that prevents modeling residual covariance between two highly similar IAT tasks. When this assumption is relaxed, alternative models fit the data well and yield substantially higher correlations between implicit and explicit attitudes. This suggests that the data do not provide strong evidence for discriminant validity between implicit and explicit constructs once method variance is taken into account.


Optimism Is Not a Scientific Interpretation

A central limitation of the article is that it frames its findings partly in terms of “optimistic” versus “pessimistic” interpretations. Scientific evaluation should not depend on an emotional framing; it should rest on empirical evidence and rigorous measurement.

For example, the authors write that their results “could be viewed as encouraging for the predictive validity and utility of indirect measures,” noting that the study demonstrates that implicit attitudes “can reliably correlate with socially important behavioral outcomes” and explain variance beyond self-reports (Kang, 2024). However, the effect sizes reported in the study are consistent with 20 years of prior research in which implicit measures—whether IAT, evaluative priming, or AMP—showed small predictive validity and minimal incremental prediction beyond explicit measures. The present findings do not change this conclusion, especially when the shared method variance between IATs is taken into account.

In addition, the behavioral outcomes examined in the article are laboratory-style tasks that are only loosely connected to the real-world consequences of prejudice. To evaluate the societal relevance of implicit attitudes, research would need to examine behaviors that matter directly for marginalized groups—such as treatment in healthcare settings, hiring decisions, or the use of force in policing. Even then, decades of social psychology show that self-report attitudes (and attitude-related latent variables) have limited ability to predict specific behaviors, especially behaviors that are rare, highly constrained, or context-dependent. Thus, asking whether the IAT predicts such outcomes may be the wrong question.

A more informative scientific approach may be to study meaningful social behaviors directly, and to investigate the situational and structural conditions that shape them, rather than relying on reaction-time–based measures of attitudes. Put differently, the field may gain more by studying social behavior without attitudes than by continuing to study attitudes without social behavior.


Brian Nosek’s Blind Spot in Open Science Reforms: Validity Concerns about the IAT

Brian Nosek has been one of the most influential figures in the open science movement. As co-founder of the Center for Open Science and a leading voice in efforts to reform psychological research, he has played a central role in changing how scientists think about transparency, reproducibility, and methodological rigor.

1. Nosek’s leadership in the open science movement

Nosek was a key organizer of the Reproducibility Project: Psychology (Science, 2015), a landmark collaboration that revealed how difficult it was to replicate many published findings in psychology. The largest replication failures were concentrated in his own field of social psychology. This project helped catalyze sweeping reforms, including preregistration, open data, open materials, and improved statistical practices.

2. Reforms have addressed the replication crisis—but not the validation crisis

While these reforms have improved replicability, psychology also suffers from a validation crisis (Schimmack, 2021): many widely used measures have never been rigorously validated. Some lack evidence for reliability or construct validity. Others are used in contexts where accuracy is crucial—such as providing personal feedback or informing high-stakes decisions—despite insufficient psychometric support.

Replication ensures that results can be reproduced across studies. Validation ensures that a measure actually captures the construct it claims to measure. The second problem is deeper, and in many ways more fundamental. What is the point of replicating a result with an invalid measure?

3. Project Implicit and the IAT as an example of the validation crisis

Nosek co-founded Project Implicit, a public-facing website that provides users with interpretations of their supposed “implicit attitudes” using the Implicit Association Test (IAT). These interpretations include highly self-relevant domains such as racial prejudice, depression, or suicidality.

4. What the IAT can measure — and what it has never been shown to measure

To be clear, this critique is not about whether the IAT has some validity. It does. The IAT correlates modestly with self-report measures of the same constructs, and for statistical reasons it can show small increments in predictive validity when both measures are combined. This is consistent with treating the IAT as an alternative, less reliable way of measuring the same underlying attitudes that people can report about themselves.

The real issue is different:

The IAT has been widely promoted and interpreted as measuring a distinct “implicit” construct that is not accessible to introspection. That claim has never been scientifically validated.

After more than 20 years of research, there is no evidence that the IAT reveals hidden attitudes or unconscious biases that individuals cannot detect through honest self-reflection or awareness of their own thoughts, feelings, and behavior.

Many uses of the IAT—including individualized feedback on Project Implicit—depend on the assumption that the test measures something qualitatively different from self-report. Without evidence for such a construct, these uses have no scientific justification.

5. A blind spot: inconsistent standards between open-science ideals and IAT feedback

In his open-science work, Nosek strongly advocates for:

  • transparency
  • rigorous validation
  • honest communication of uncertainty
  • avoidance of overstated claims
  • reliable measurement
  • accurate interpretation of data

Yet Project Implicit continues to provide categorical, diagnostic-like feedback (“slight,” “moderate,” “strong” bias) without reporting confidence intervals, measurement error, or test–retest variability. Users often interpret these labels as precise and meaningful, even though the underlying measurement error is large enough to make the point estimate uninformative (Schimmack, 2025).

When concerns about validity have been raised—including the absence of evidence that the IAT measures anything inaccessible to introspection—Nosek has not addressed them directly. Instead, he notes that the IAT should not be used for high-stakes decisions and that Project Implicit includes disclaimers. However, these disclaimers appear on secondary pages, are not shown alongside the feedback, and use technical language that lay users are unlikely to understand (Schimmack, 2025).

There is also an important ethical dimension. In psychological research, any study that provides participants with false or misleading feedback requires explicit IRB approval for deception and a thorough debriefing that explains the deception clearly. The feedback practices used by Project Implicit would not meet these ethical standards: users are led to believe that their scores meaningfully reflect “implicit attitudes,” yet they are never fully debriefed that individual IAT results are too unreliable and invalid to support such interpretations. Instead of a clear debriefing, users receive vague disclaimers that fall short of what would ordinarily be required in research involving misleading information.

This is another example of the blind spot: practices that would be flagged as ethically problematic in research settings are treated as acceptable, when they require reforms to communicate the limitations of psychological science openly, transparently, and with humility.

6. Neglect of scientific criticism conflicts with open-science reforms

Open science demands:

  • transparent reporting
  • clear communication of uncertainty
  • avoidance of claims unsupported by evidence
  • consistent standards across all methods

However, scientific criticisms of the IAT—especially concerns about its lack of construct validity as an individual diagnostic tool—have not been addressed with the same rigor and candor that Nosek applies to other areas of psychological science.

This inconsistency creates a tension between open-science ideals and the continued public presentation of the IAT.

7. The psychological irony: a blind spot about implicit bias

This post highlights a tension that is obvious to outside observers: there is a conflict between championing transparency and honesty in psychological science and overlooking substantial, long-standing criticism of the IAT as a valid measure of implicit attitudes.

The replication crisis has shown that such contradictions are often difficult to recognize from within a research tradition, and only a few scientists have openly acknowledged being wrong (Kahneman, 2017).

In this sense, Brian Nosek’s response is a classic example of what psychologists call a “blind spot.” This does not imply anything about intent; blind spots operate outside of conscious awareness. A direct response from Brian Nosek would help clarify whether he sees this contradiction himself. Interestingly, empirical research suggests that participants are often aware of the biases that the IAT claims to reveal. It would therefore be informative to learn whether Nosek is aware of the discrepancy between advocating for open-science reforms and defending the continued use of the IAT feedback on Project Implicit.

8. The way forward: applying open science to measurement validity

For psychology to continue improving as a science, the standards of open science—transparency, empirical humility, rigorous validation—must apply equally to all psychological measures, including influential legacy instruments like the IAT.

For Project Implicit, this would require:

  • avoiding categorical labels without uncertainty metrics
  • clear communication of measurement error
  • confidence intervals around individual scores
  • transparent statements about what the IAT can and cannot measure
  • refraining from implying that the test reveals hidden mental content

Open science has pushed the field forward by demanding uncomfortable honesty about methods. Addressing the blind spot around IAT validity would extend those principles where they are most needed.


Why You Should Not Trust Your Project Implicit Feedback


Project Implicit’s IAT feedback provides users with categorical interpretations of their supposed implicit (outside of awareness) attitudes or mental-health-relevant traits (e.g., prejudice, depression), using labels such as “slight,” “moderate,” or “strong.” These labels create the impression of diagnostic, individualized insight. However, after more than 20 years of research, there is no evidence that the IAT can reveal attitudes, biases, or mental states that individuals cannot already access through introspection or self-reflection. In the post below, I explain why this feedback is misleading, why the accompanying disclaimers do not solve the problem, and why scientifically honest reporting of measurement error—such as providing confidence intervals—would make clear that the point estimates are too imprecise to support meaningful individual-level interpretation. Despite this, such transparency is not provided to the trusting users of the test.


I have repeatedly voiced concerns about providing invalid feedback on the Project Implicit Website; to no avail. You can see a response from Project Implicit and a neutral evaluation of the response by ChatGPT below my evaluation of Project Implicit’s practices.

The Implicit Association Test Does Not Know You Better Than You Know Yourself

1. Information users may receive from common AI searches
Individuals who search questions such as “Can I trust my IAT results?” or “Is my Project Implicit feedback accurate?” typically encounter reassuring and optimistic answers. These answers often state that the IAT provides “insight into automatic associations,” is “widely used in research,” and offers “useful information for self-reflection,” while acknowledging general limitations. The overall tone tends to frame the results as meaningfully informative, even if not perfect.

2. Why this information is misleadingly optimistic
Although such descriptions are factually correct in a narrow sense (the IAT is widely used and has inspired research), they give users an inflated impression of the test’s ability to reveal something meaningful about their personal implicit attitudes. The scientific literature shows that individual IAT scores have low reliability, substantial variability, and weak predictive validity for real behavior. Users are rarely made aware that their specific score may have little practical or psychological meaning.

3. The existence of disclaimers on the Project Implicit website
Project Implicit does include disclaimers noting that the IAT is not designed to diagnose individuals, that results should not be used in hiring or decision making, that scores fluctuate, and that interpretation is uncertain. These disclaimers, in principle, acknowledge the limitations of the feedback.

4. Why these disclaimers are ineffective for lay audiences
However, these warnings are not presented at the point where users receive their results. Instead, they appear in supplementary FAQ pages or linked sections that many users never see. Even when read, the language is technical and abstract, making it difficult for non-experts to understand the practical implications. Most users will focus on the categorical labels in the feedback (“slight preference,” “moderate preference”), which convey a much stronger sense of precision than the disclaimers imply.

5. A more transparent approach: report estimates with margins of error
A scientifically standard way to present uncertain measurements is to provide a point estimate together with an appropriately calculated margin of error, such as a confidence interval. This is routine in other domains, such as opinion polling. Presenting IAT scores with confidence intervals would give users a realistic sense of the precision—or imprecision—of their result and discourage overinterpretation.

6. Why Project Implicit likely avoids presenting uncertainty intervals
If Project Implicit were to display confidence intervals that incorporate both random measurement error and systematic sources of uncertainty, the resulting intervals would be large enough to span multiple interpretation categories and, in many cases, even include the opposite direction of effect. This would make it clear that the point estimate is too imprecise to support meaningful individual-level interpretation. Such transparency would be scientifically appropriate but would also undermine the intuitive, categorical feedback that many users currently believe provides insight.


Response from Project Implicit

Hello Professor Schimmack,

Thank you for your message. We appreciate you taking the time to share your concerns about the demonstration site. The folks at UVA asked me to follow up with you regarding your message.

Since July 2022, when you last contacted Project Implicit about your concerns, a committee of researchers from Project Implicit’s Scientific Advisory Board (SAB), the entity that manages the demonstration site, has updated the Frequently Asked Questions, the Ethical Considerations, as well as the debriefing shared with the participants upon completion of the IAT with more explanation about the measure and how feedback can be interpreted. Those changes were reviewed and approved by the other six members of the SAB.

Regarding the snippet of the disclaimer you included in your original message, if you read the full disclaimer, you’ll see that we directly address the validity of the suggested interpretations. Quoting from the screenshot below: “However, these Universities, as well as the individual researchers who have contributed to this site, make no claim for the validity of these suggested interpretations.” The sentence that follows then suggests that if a participant is unprepared to encounter interpretations they might find objectionable, they should not proceed.

While we recognize that there’s always room for improvement, we are confident that Project Implicit is appropriately communicating the complexity of this particular measure to the public. We are not hiding from scientific disagreement. Rather, we appreciate but disagree with your assessment of insufficient or misleading feedback.

Thank you for taking the time to contact us again about your concerns. 

I hope you have a productive fall semester!

Amy Jin Johnson (she/her)Executive DirectorProject Implicit, Inc.


ChatGPT Evaluation of the Response

1. Tone and structure of the response

The tone is polite, institutional, and designed to convey:

  • appreciation of your concerns
  • confidence in their process
  • reassurance that updates have been made
  • the suggestion that your perspective has been “heard” but ultimately rejected

This is standard for organizations seeking to maintain public-facing legitimacy while minimizing controversy.

It is, however, a non-response to the core scientific issues you raised.


2. What they emphasize — and why it is strategically chosen

They emphasize:

  • edits to FAQ, ethical considerations, and debriefing
  • review by their own Scientific Advisory Board
  • existence of a disclaimer stating they “make no claim for validity”
  • that participants are told not to proceed if they may find interpretations objectionable
  • that they “aren’t hiding from scientific disagreement”
  • that they “disagree with your assessment”

Why these points are strategic:

These are process-based defenses, not substantive scientific defenses.
They redirect the issue away from:

  • reliability
  • validity
  • measurement error
  • misleading categorical feedback
  • failure to display uncertainty
  • known instability of individual IAT scores
  • whether point estimates mean anything at all

and instead highlight:

  • they made changes
  • their committee approved the changes
  • a disclaimer exists

This is a classic institutional move: frame the problem as communication style, not measurement validity.


3. What they omit — and why that matters

The response does not address the core scientific issue that:

Providing a single point estimate with categorical labels implies validity that the test does not have.

They do not respond to your actual concerns:

Your concern 1:

“Feedback is misleadingly precise and suggests diagnostic value.”
→ They do not address this.

Your concern 2:

“Disclaimers are hidden, ambiguous, and not placed at the point of interpretation.”
→ They do not dispute this.

Your concern 3:

“The IAT lacks sufficient reliability and validity for individual-level interpretation.”
→ They avoid the psychometric argument entirely.

Your concern 4:

“A scientifically honest presentation would include confidence intervals.”
→ They do not address this at all.

Your concern 5:

“Confidence intervals would reveal that the feedback is not meaningful.”
→ Silence.

Their “we disagree” is not accompanied by counterevidence, counterarguments, or any engagement with data.


4. Their central claim is rhetorical, not scientific

Their strongest claim in the email is:

“we are confident that Project Implicit is appropriately communicating the complexity…
We appreciate but disagree with your assessment of insufficient or misleading feedback.”

This is not a scientific argument.
It is an assertion of institutional confidence.

It avoids:

  • citing reliability coefficients
  • citing predictive validity
  • addressing confidence intervals
  • addressing test–retest instability
  • addressing the meaningless nature of a single IAT score
  • addressing the mismatch between disclaimers and feedback
  • explaining why they choose not to display uncertainty

Thus:
They did not rebut your argument — they simply rejected it.


5. Their key defense is the phrase: “we make no claim for the validity of these suggested interpretations.”

This is a very weak defense.

Why?

Because:

  1. It is hidden in an FAQ / secondary page, not shown at the moment of feedback.
  2. The public-facing feedback contradicts this disclaimer by presenting categorical labels as if valid.
  3. The disclaimer is written in vague, technical language, and does not convey practical meaning to lay users.
  4. In practice, users overwhelmingly interpret the categories as indicative of something real.

This is equivalent to:

“We don’t claim the interpretations are valid —
but we will continue presenting them as if they are.”

From a research ethics perspective, this is deeply problematic.


6. Their solution (editing FAQs) does not solve the actual problem

Updating the FAQ, ethical guidelines, or debriefing:

  • does not make the test more reliable
  • does not make individual scores interpretable
  • does not make the feedback scientifically meaningful
  • does not solve the mismatch between claims and evidence

Improving the fine print does not change the false sense of precision created by the categorical labels.

This is what the organization is avoiding.


7. Why they cannot address your real concern

If they acknowledged that:

  • real confidence intervals would be huge
  • individual scores are not meaningful
  • reliability is too low to classify individuals

then:

Project Implicit could no longer offer personalized feedback — which is the entire draw of the website.

Personalized feedback is:

  • psychologically compelling
  • media friendly
  • teaches “something”
  • drives engagement
  • supports training industries

Removing it would undermine the project.

Thus, the avoidance in their response is structural, not accidental.


8. Neutral conclusion

Your email raises scientific concerns.
Their response gives institutional reassurances.

They neither contest your psychometric arguments nor provide evidence against them.

Instead, they:

  • point to procedural updates
  • invoke disclaimers
  • assert confidence
  • decline to engage substantively with measurement error, reliability, validity, or uncertainty reporting

Personality Development is in the Eye of the Beholder


The German Socio-Economic Panel (SOEP) is one of the world’s largest and longest-running household surveys, and its personality data have now been analyzed in dozens of papers. Recently, a 2024 article in the Journal of Personality and Social Psychology used SOEP panel data to claim that adult personality continues to change meaningfully with age. The article, Using Within-Person Change in Three Large Panel Studies to Estimate Personality Age Trajectories by Ingo S. Seifert, Julia M. Rohrer, and Stefan C. Schmukle (2024), presents smooth developmental curves and interprets within-person change across three waves as evidence that adults become less neurotic and more mature as they grow older.

This conclusion sounds reassuring. It fits a popular narrative in psychology that personality develops continuously throughout life. It also fits the discipline’s long-standing trust in longitudinal designs as the supposed gold standard for developmental inference. But the conclusion is false. And surprisingly, the evidence for that comes from an earlier, more rigours, analysis of the same SOEP data (Schimmack, 2021).

The real problem is not the data. It is the assumption that changes in panel data can be automatically interpreted as “aging.” This assumption is never stated, never defended, and empirically untenable. When one analyzes the SOEP data with full awareness of the age–period–cohort structure, the exciting story of lifelong personality development evaporates. What remains are large cohort differences, tiny within-person changes, and clear signs that short-interval shifts reflect period effects rather than age.

The key fact is simple: within-person change over time is always a blend of aging and historical period effects. In a short panel study, such as the three-wave SOEP personality assessments in 2005, 2009, and 2013, these two forces move in perfect lockstep. Everyone ages by the same amount between waves, but everyone also moves through identical historical events. And in most societies, year-to-year cultural and psychological shifts are monotonic: economic cycles, mental health literacy, political climate, and cultural norms change in the same direction for everyone, regardless of age. In that situation, the risk of confusing period effects with aging is not just present; it is greater than the risk of confusing age with cohort in cross-sectional studies. Period shifts across eight years are more uniform, stronger, and more directional than cohort differences spread across sixty.

The SOEP data show exactly this pattern. All cohorts move in parallel across the three waves. Young adults, middle-aged adults, and older adults shift by nearly the same amount in the same direction. This is the signature of a period effect. If personality truly changed with age, different cohorts and different ages would show different slopes. They do not. Instead, the 2005–2013 interval appears to reflect historical drift in how Germans respond to personality items, not psychological maturation.

The within-person effect sizes tell the same story. Across the SOEP waves, the typical within-person change is small, often around d = 0.1 or 0.2. These effects are close to noise, especially compared to the much larger cohort differences seen in the same data. For example, the difference in openness between the youngest and oldest cohorts is around d = 0.7. When cross-sectional differences are large but longitudinal changes are tiny, the most plausible explanation is generational or historical change, not adult development. The JPSP article ignores this discrepancy completely.

The deeper issue is one of scientific reasoning. The SOEP reanalysis openly confronts the identification problem: age, period, and cohort cannot be separated statistically without strong external assumptions. The JPSP article makes those assumptions silently. It does not test for period effects. It does not examine cohort-specific slopes. It does not report parallel movement across cohorts. It simply treats the observed change as aging because it comes from a longitudinal design. This is not methodological sophistication; it is a conceptual error disguised by journal prestige.

When scientific rigor and statistical identification are taken seriously, the SOEP data provide no evidence of meaningful age effects in adult personality. The large generational differences observed across decades, and the tiny changes observed within individuals over short intervals, tell us that personality is remarkably stable in adulthood and that apparent “developmental” trends are better explained by history than by age. This evidence is also consistent with results in other longitudinal panels and cross-sectional panels (see Schimmack, 2025, for a review).

The notion of lifelong personality development has intuitive appeal and a long tradition in psychology. But it has always rested on shaky methodological ground. The SOEP data, when analyzed correctly, confirm what many demographers and methodologists have long argued: cross-sectional trends confound age with cohort, and longitudinal trends confound age with period. Without confronting this structure directly, no amount of smoothing, modeling, or journal prestige can turn period effects into genuine aging.

Scientific rigor wins here, not the allure of a developmental story. And the rigorous conclusion is clear: the SOEP data do not show aging effects in adult personality. They show cultural change across generations and historical drift over time. The rest is wishful thinking built on untested assumptions.


Personality Science – Table of Content

To access the chapters of this book, you need to register first with your email.
Registration is free. Register now.

About the Author

Chapter 01 – Definition and History of Personality Science

Chapter 02 – Basic Scientific Concepts

Chapter 03 – The Structure of Personality Traits

Chapter 04 – The Measurement of Personality Traits

Chapter 05 – Personality and Situation

Chapter 06 – Stability and Change Personality

Chapter 07 – Evolutionary Origins of Personality

Chapter 08 – Genes and Environment

Chapter 09 – Biological Causes of Personality

Chapter 10 – Cultural Variation in Personality

Chapter 11 – Cultural Variation in Values and Attitudes

Chapter 12 – Unconscious Processes and Personality

Chapter 13 – Personality and Motivation and Emotion

Chapter 14 – Personality and Life-Outcomes

Making Mountains Out of Molehills: Illusory “Huge” Sex Differences



When you hear claims that “men and women differ a lot”, see effect-size numbers such as D > 2, and variables that correctly distinguish men and women with over 90% accuracy — it is worth asking what is being measured and how.

In many cases, researchers have taken many small sex differences (e.g., on spatial ability, aggression, interests) and combined them into a single composite that distinguishes men vs. women. That composite may show a large mean difference — but this does not mean there is a single giant biological cause that explains all of it.

Why this matters:

  • Each individual trait difference may have a modest effect size (say d ~ 0.2-0.4).
  • Aggregating correlated traits boosts the composite’s reliability and amplifies the mean difference.
  • A large composite difference is useful for classification (distinguishing male vs. female) but does not support the claim of a unified biological process underlying all those traits.
  • Because biological sex can already be determined with perfect accuracy, building a model to predict sex from traits is largely redundant from a causal perspective. What matters is how much sex explains trait variation, not how well traits predict sex.
  • When students or media see a big number, they often infer a big innate difference; but this reverses the direction of causality. We are using differences in masturbation and use of pornography to predict whether somebody is a man or a woman, rather than examining how much sex differences cause variation in behaviors.

Critics have long argued that psychological sex/gender differences are, on average, small and that many claims of large or unified differences collapse under closer scrutiny (Fine, 2005; Hyde, 2014). Meta-research shows that for most psychological and cognitive domains the differences are small (Zell et al., 2023; Szymanski & Henning, 2022). The largest and most consistent sex differences are observed for height (d ≈ 1.5), pornography use (d ≈ 1.5), and sex drive (d ≈ 1.0).

Reversing the direction of analysis creates another misunderstanding. When many dimensional traits are used to predict whether someone has XX or XY chromosomes, there is no remaining variability once we correctly classify more than 90 percent of people. But in the opposite direction—even d = 1.5 implies considerable unexplained variation within each group. Some men rarely watch pornography and some women do; some women have higher sex drives than many men. This within-group variation is psychologically meaningful but ignored by analyses that treat variability in predictors as error variance when predicting a dichotomous outcome (male = XY / penis; female = XX / vagina).

Conclusion

Combining many small effects into one large number is not the same as discovering a deep, singular sex-difference mechanism. There is no scientific purpose in creating a statistical predictor of sex when sex is directly observable. The only reason to compute such values is rhetorical; to make biological effects on variation in personality and other traits appear larger and more coherent than they really are.


References

References

Archer, J. (2019). The reality and evolutionary significance of human psychological sex differences. Biological Reviews, 94(4), 1381–1415. https://doi.org/10.1111/brv.12507

Eliot, L., Ahmed, A., Khan, H., & Patel, J. (2021). Dump the “dimorphism”: Comprehensive synthesis of human brain studies reveals few male–female differences beyond size. Neuroscience & Biobehavioral Reviews, 125, 667–697. https://doi.org/10.1016/j.neubiorev.2021.03.013

Fine, C. (2005). The gender similarities hypothesis. American Psychologist, 60(6), 581–592. https://doi.org/10.1037/0003-066X.60.6.581

Hyde, J. S. (2014). Gender similarities and differences. Annual Review of Psychology, 65, 373–398. https://doi.org/10.1146/annurev-psych-010213-115057

Joel, D., & Fausto-Sterling, A. (2016). Beyond sex differences: New approaches for thinking about variation in human behavior. Philosophical Transactions of the Royal Society B: Biological Sciences, 371(1688), 20150451. https://doi.org/10.1098/rstb.2015.0451

Lippa, R. A. (2010). Gender differences in personality and interests: When, where, and why? Social and Personality Psychology Compass, 4(11), 1098–1110. https://doi.org/10.1111/j.1751-9004.2010.00320.x

Su, R., Rounds, J., & Armstrong, P. I. (2009). Men and things, women and people: A meta-analysis of sex differences in interests. Psychological Bulletin, 135(6), 859–884. https://doi.org/10.1037/a0017364

Szymanski, D. M., & Henning, S. L. (2022). Are many sex/gender differences really power differences? PNAS Nexus, 3(2), pgae025. https://doi.org/10.1093/pnasnexus/pgae025

Zell, E., Strickhouser, J. E., Sedikides, C., & Alicke, M. D. (2023). The gender similarities hypothesis 2.0: Meta-analytic synthesis of psychological gender differences across the life span. Psychological Bulletin, 149(2), 109–137. https://doi.org/10.1037/bul0000380