Dr. Ulrich Schimmack’s Blog about Replicability

For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication” (Cohen, 1994).

DEFINITION OF REPLICABILITYIn empirical studies with sampling error, replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion (Schimmack, 2017).

BLOGS BY YEAR:  20192018, 2017, 2016, 2015, 2014

Featured Blog of the Month (January, 2020): Z-Curve.2.0 (with R-package) 




  1. 2018 Replicability Rankings of 117 Psychology Journals (2010-2018)

Rankings of 117 Psychology Journals according to the average replicability of a published significant result. Also includes detailed analysis of time trends in replicability from 2010 to 2018). 

Golden2.  Introduction to Z-Curve with R-Code

This post presented the first replicability ranking and explains the methodology that is used to estimate the typical power of a significant result published in a journal.  The post provides an explanation of the new method to estimate observed power based on the distribution of test statistics converted into absolute z-scores.  The method has been developed further to estimate power for a wider range of z-scores by developing a model that allows for heterogeneity in power across tests.  A description of the new method will be published when extensive simulation studies are completed.


3. An Introduction to the R-Index


The R-Index can be used to predict whether a set of published results will replicate in a set of exact replication studies. It combines information about the observed power of the original studies with information about the amount of inflation in observed power due to publication bias (R-Index = Observed Median Power – Inflation). The R-Index has predicted the outcome of actual replication studies.

Featured Image -- 203

4.  The Test of Insufficient Variance (TIVA)


The Test of Insufficient Variance is the most powerful test of publication bias and/or dishonest reporting practices. It can be used even if only two independent statistical results are available, although power to detect bias increases with the number of studies. After converting test results into z-scores, z-scores are expected to have a variance of one.   Unless power is very high, some of these z-scores will not be statistically significant (z .05 two-tailed).  If these non-significant results are missing, the variance shrinks, and TIVA detects that the variance is insufficient.  The observed variance is compared against the expected variance of 1 with a left-tailed chi-square test. The usefulness of TIVA is illustrated with Bem’s (2011) “Feeling the Future” data.

train-wreck-15.  MOST VIEWED POST (with comment by Noble Laureate Daniel Kahneman)

Reconstruction of a Train Wreck: How Priming Research Went off the Rails

This blog post examines the replicability of priming studies cited in Daniel Kahneman’s popular book “Thinking fast and slow.”   The results suggest that many of the cited findings are difficult to replicate.

http://schoolsnapshots.org/blog/2014/09/30/math-prize-for-girls-at-m-i-t/6. How robust are Stereotype-Threat Effects on Women’s Math Performance?

Stereotype-threat has been used by social psychologists to explain gender differences in math performance. Accordingly, the stereotype that men are better at math than women is threatening to women and threat leads to lower performance.  This theory has produced a large number of studies, but a recent meta-analysis showed that the literature suffers from publication bias and dishonest reporting.  After correcting for these effects, the stereotype-threat effect was negligible.  This blog post shows a low R-Index for the first article that appeared to provide strong support for stereotype-threat.  These results show that the R-Index can warn readers and researchers that reported results are too good to be true.

GPower7.  An attempt at explaining null-hypothesis testing and statistical power with 1 figure and 1500 words.   Null-hypothesis significance testing is old, widely used, and confusing. Many false claims have been used to suggest that NHST is a flawed statistical method. Others argue that the method is fine, but often misunderstood. Here I try to explain NHST and why it is important to consider power (type-II errors) using a picture from the free software GPower.


8.  The Problem with Bayesian Null-Hypothesis Testing


Some Bayesian statisticians have proposed Bayes-Factors to provide evidence for a Null-Hypothesis (i.e., there is no effect).  They used Bem’s (2011) “Feeling the Future” data to argue that Bayes-Factors would have demonstrated that extra-sensory perception does not exist.  This blog post shows that Bayes-Factors depend on the specification of the alternative hypothesis and that support for the null-hypothesis is often obtained by choosing an unrealistic alternative hypothesis (e.g., there is a 25% probability that effect size is greater than one standard deviation, d > 1).  As a result, Bayes-Factors can favor the null-hypothesis when there is an effect, but the effect size is small (d = .2).  A Bayes-Factor in favor of the null is more appropriately interpreted as evidence that the alternative hypothesis needs to decrease the probabilities assigned to large effect sizes. The post also shows that Bayes-Factors based on a meta-analysis of Bem’s data provide misleading evidence that an effect is present because Bayesian statistics do not take publication bias and dishonest reporting practices into account.

hidden9. Hidden figures: Replication failures in the stereotype threat literature.  A widespread problem is that failed replication studies are often not published. This blog post shows that another problem is that failed replication studies are ignored even when they are published.  Selective publishing of confirmatory results undermines the credibility of science and claims about the importance of stereotype threat to explain gender differences in mathematics.

20170620_14554410. My journey towards estimation of replicability.  In this blog post I explain how I got interested in statistical power and replicability and how I developed statistical methods to reveal selection bias and to estimate replicability.

Why Are Red States “Immune” to Covid-19?

Joey loves crowds. He is boisterous, speaks with a loud and booming voice, and is always ready to high-five everybody. No, I am not describing a super-spreader of Covid-19. It is a textbook description or caricature of an extrovert, or as personality psychologists say extravert.

Personality psychologists have studied extraversion and introversion for nearly one-hundred years, although most of the research emerged in the past 40 years. We know that extraversion is a heritable trait that runs in families. We know that it remains fairly stable throughout adulthood, and we know that it influences behavior. There also has been research on regional variation in extraversion across the world and across US states (Elleman, Condon, Russin, & Revelle, 2018). I used their data to create the map of extraversion for US states. The map shows the highest level of extraversion in Illinois and the lowest level of extraversion in Wyoming followed by Idaho and Utah. While Illinois has fairly high rates of Covid-19, especially in Chicagoland, Wyoming and Idaho have relatively low levels of positive cases. They are also solid “red” states who voted for Trump in the 2016 election with 67% and 59%. It is therefore possible that extraversion partially explains why Covid-19 is more prevalent in “blue” (liberal) states. Residents in blue states may be more extraverted and may have a harder time to follow social distancing rules.

Of course, extraversion would only be one of several factors that play a role. Another obvious factor is that urban areas are more strongly affected by Covid-19 than rural areas, and rural voters are more likely to vote for Trump. There are many other possible differences among the US states that might play a role, but preliminary analysis suggests that they do not predict Covid-19 to a substantial degree. So, to keep things short, I will focus on the two factors that I found to be significant predictors of the spread of Covid-19: urbanization and extraversion.

To examine whether this relationship is stable over time, I used confirmed positive cases reported on the Covid-Tracking website and created indicators for three, three-week periods: March 23 to April 12, April 13 to May 3, and May 4 to May 24. Predictor variables were (a) the percentage of votes for Trump in the 2016 election, (b) extraversion scores from the supplement to Elleman et al.’s article (Table 8), and (c) urbanization scores (wikipedia).

The data were analyzed using structural equation modeling to examine the relationship among the six variables. [I also examined more complex models that included deaths. The effects of the predictor variables on death were mostly mediated by confirmed positives, with the exception of a unique, negative relationship between Trump support and deaths at time 1 only.] Model fit was excellent, CFI = 1.00, RMSEA = .000. This does not mean that the model reveals the truth, but it does show that the model is consistent with the data. Thus, the model tells one possible story about the negative relationship between Trump support and Covid-19 deaths across the states.

The numbers show that urbanity is much stronger negative predictor of Trump support than extraversion. The effect of extraversion is small, and not statistically significant by conventional standards, but there are only 49 states (I excluded the island state Hawaii), making it hard to get statistical significance. The effect of urbanization and extraversion are more equal and explain a notable amount of variation in Covid-19 cases across states. The numbers also show that the effect is not getting weaker over time. It may actually become stronger. This means that both urbanization and extraversion are stable predictors of the Covid-19 pandemic in the USA. Even in the past three weeks, after several states with Republican governors eased restrictions, there is no evidence that cases are increasing notably in red states.

It is not difficult to find newspaper articles that talk about a second wave and spikes in new cases in Texas or other red states. These stories are based on the idea that red states are ignoring the danger of Covid-19, but so far this idea lacks empirical support. For every story of a pastor who died of Covid-19 after defying stay-at-home orders, there are thousands of churches who are holding services online, and hundreds of people flouting social-distancing norms in Central Park, NY. Don’t get me wrong. Trump’s disregard of science and ramblings about light therapy are a disgrace, but this does not mean that 40% of the US population follows the covidiot in the White House and drinks bleach. At least forty-percent of US voters are likely to vote for him again. Don’t ask me how anybody can vote for him again. That is not the point of this blog post. The blog post is about the empirical fact that so far Covid-19 has predominantly hit states that did not vote for Trump. I suggest that this seemingly paradox finding is not paradox at all. Joey, the extraverted bachelor who lives in an apartment in New York City, who voted for Hillary is much more likely to get infected than Joyce who lives with her family on a farm in Wyoming. Painting all Trump-voters as covidiots would be a similar mistake as Hillary Clinton calling Trump-supporters a “basket of deplorables.” If all Trump-supporters were covid-idiots, we should have seen a positive relationship between Trump-support and Covid-19 cases, especially after controlling for confounding variables like urbanization and extraversion. The fact that this positive relationship stubbornly refuses to emerge may suggest that Republican governors and residents in red states are not as stupid as their leader.

The Covid-Statistic Wars are Heating Up

After a general consensus or willingness to accept social distancing measures imposed by politicians (often referred to as lock-downs), societies are polarizing. Some citizens want to open stores, bars, and restaurants (and get a hair cut). Others want to keep social distancing measures in place. Some people on both sides are not interested in scientific arguments for or against their position. Others like to find scientific evidence that seemingly supports their viewpoint. This abuse of science is becoming more common in a polarized world. As a scientist, I am concerned about the weaponizing of science because it undermines the ability of science to inform decisions and to correct false beliefs. Psychological research has shown how easily we assimilate information that matches our beliefs and treat disconfirming evidence like a virus. These motivated biases in human reasoning are very powerful and even scientists themselves are not immune to these biases.

Some economists appear to be afflicted by a bias to focus on the economic consequences of lock-downs and to downplay the effects of the virus itself on human lives and the economy. The idea is that lock-downs were not helpful to save lives at immense costs to the economy. I am not denying the severe consequences of unemployment (I actually co-authored an article on unemployment and well-being), but I am shocked by claims in a tweet that social distancing laws are ineffective that have been retweeted 3,500 times or blog posts that make similar claims accompanied by scatterplots that give the claims the appearance of scientific credibility.


There is nothing wrong with these graphs. I have examined the relationship between policies and Covid-19 deaths across US states and across countries, and I have also not found a significant correlation. The question is what this finding means. Does it imply that lock-down measures were unnecessary and have produced huge economic costs without any benefits? As some responses on twitter indicated, interpreting correlational data is not easy because many confounding factors influence the correlation between two variables.

Social distancing is unnecessary if nobody is infected

Let’s go back in time and impose social distancing policies across the world in May 2019 randomly in some countries and not in others. We observe that nobody is dying of Covid-19 in countries with and without ‘lock-down’. In addition, countries with lock-down suffer high rates of unemployment. Clearly, locking countries down without a deadly virus spreading is not a good idea. Even in 2020 some countries were able to contain relatively small outbreaks and are now mostly Covid-free. This is more or less true of countries like Taiwan, Australia, and New Zealand. However, these countries impose severe restrictions on travel to ensure that no new infections are brought into the country. When I tried to book a flight from Toronto to Sydney, I was not able to do so. So, the entire country is pretty much in lock-down to ensure that people in Australia cannot be infected by visitors from countries that have the virus. Would economists argue that these country-wide lock-downs are unnecessary and only hurt the tourist industry?

This image has an empty alt attribute; its file name is image-16.png

The fact that Covid-19 spread unevenly across countries also creates a problem for the correlation between social-distancing policies and Covid-19 deaths across countries. The more countries are actively trying to stem the spread of the virus, the more severe social-distancing measures will be, while countries without the virus are able to relax social distancing measures. Not surprisingly, some of the most sever restrictions were imposed at the peak of the epidemics in Italy and Spain. This produces a positive correlation between severity of lock-downs and spread of Covid-19, which could be falsely interpreted as evidence that lock-downs even increase the spread of Covid-19. A simple correlation between lock-down measures and Covid-19 deaths across countries is simply unable to tell us something about the effects of lock-down measures on deaths within countries.

Social Distancing Effects are Invisible if there is no Variation in Social Distancing Across Countries

To examine the effectiveness of social-distancing measures, we need to consider timing. First, social distancing measures may be introduced in response to a pandemic. Later on, we might see that countries or US states that imposed more severe restrictions were able to slow down the spread of the virus more. However, now we encounter a new problem. Most countries and states responded to the declaration of Covid-19 as a pandemic by the WHO on March 11 with very similar policies (school closures). This makes it difficult to see the effects of social distancing measures because we have little variation in the predictor variable. We simply do not have a large group of countries with a Covid-19 epidemic that did nothing. This means, we lack a proper control group to see whether spread in these countries would be bigger than in countries with severe lock-downs. Even countries like the UK closed schools and bars in mid March.

Sweden is often used as the example of a country that did not impose severe restrictions on citizens and kept schools open. It is difficult to evaluate the outcome of this political decision. Proportional to the population, Sweden ranks number 6 in the world in terms of Covid-19 deaths, but what is a proper comparison standard? Italy and Spain had more severe restrictions and more deaths, but their epidemics started earlier than in Sweden. Other Nordic countries like Norway, Denmark, and Finland have much lower fatalities than Sweden. This suggests that social distancing is effective in reducing the spread, but we do not have enough data for rigorous statistical analysis.

Social Distancing Policies Explain Trajectories of Covid-19 spread in hot-spots.

One advantage of epidemics it is possible to foresee the future because exponential growth produces a very notable trajectory over time that is hard to miss in statistical analyses. If every individual infects two or three other people, the number of cases will grow exponentially until a fairly large number of the population is infected. This is not what happened in Covid-19 hot spots. Let’s examine New York as an example. In mid March, the number of detected cases and deaths increased exponentially, with numbers doubling every three days.

The number of new cases peaked in the beginning of April and has been decreasing until now. One possible explanation for this pattern is that social-distancing policies that were mandated in mid-March were effective in slowing down the spread of the virus. Anybody who claims that lock-downs are ineffective needs to provide an alternative explanation for the trajectory of Covid-19 cases and deaths over time.

Once more it is difficult to show empirically what would have happened without “lock-downs”. The reasons is that even in countries that did not impose strict rules people changed their behaviors. Once more we can use Sweden as a country without ‘lock-down’ laws. As in New York, we see that rapid exponential growth was slowed down. This did not happen while people were living their lives as they did in January 2020. It happened because many Swedes changed their behaviors.

The main conclusion is that the time period from March to May makes it very difficult to examine scientifically what measures were effective in preventing the spread of the virus and what measures were unnecessary. How much does wearing masks help? How many lives are saved by school closures? The best answer to these important questions is that we do not have clear answers to these questions because there was insufficient variation in the response to the pandemic across nations or across US states. Most of the variation in Covid-19 deaths is explained by the connectedness of countries or states to the world.

Easing Restrictions and Covid-19 Cases

The coming months provide a much better opportunity to examine the influence of social distancing policies on the pandemic. Unlike New Zealand and a few other countries, most countries do have community transmission of Covid-19. The United States provide a naturalistic experiment because (a) the country has a large population and therewith many new cases each day and (b) social distancing policies are made at the level of the 50 states.

Currently, there are still 20,000 new confirmed (!) positive cases in the United States. There are also still over 1,000 deaths per day.

There is also some variation across states in the speed and extend to which states ease restrictions on public life (NYT.05.20). Importantly, there is no state where residents are just going back to live as it was in January of 2020. Even states like Georgia that have been criticized for opening early are by no means back to business as usual.


So, the question remains whether there is sufficient variance in opening measures to see potential effects in case-numbers across states.

Another problem is that it is tricky to measure changes in case-numbers or deaths when states have different starting levels. For example, in the past week New York still recorded 41 deaths per 1 Million inhabitants, while Nebraska recorded only 13 deaths per 1 Million inhabitants. However, in terms of percentages, cumulative deaths in New York increased by only 3%, whereas the increase in Nebraska was 23%. While a strong ‘first wave’ accounts for the high absolute number in New York, it also accounts for the low percentage value. A better outcome measure may be whether weekly numbers are increasing or decreasing.

Figure 1 shows the increase in Covid-19 deaths in the past 7-days (May 14 – May 20) compared to the 7 days after some states officially eased restrictions (May 2 – May 8).

It is clearly visible that states that are still seeing high numbers of deaths are not easing restrictions (CT, NJ, MA, RI, PA, NY, DE, IL, MD, LA). It is more interesting to compare states that did not see a big first wave that vary in their social distancing policies. For this analysis, I limited the analysis to the remaining states.

States below the regression line are showing faster decreases than other states, whereas states above the regression line show slower decreases or increased. When the opening policies on May 1 (NYT) are used as predictors of deaths in the recent week with deaths two weeks before as covariate, a positive relationship emerges, but it is not statistically significant. It is a statistical fallacy to infer from this finding that policies have no influence on the pandemic.

More important is the effect size, which is likely to be somewhere between -2 and + 4 deaths per million. This may seems a small difference, but we have to keep in mind that there is little variation in the predictor variable. Remember, even in Georgia where restaurants are open, the number of diners is only 15% of the normal number. The hypothetical question is how much bigger the number of Covid-19 cases would be if restaurants were filled at capacity and all other activities were back to normal. It is unlikely that citizens of open states are willing to participate in this experiment. Thus, data alone simply cannot answer this question.


Empirical science rely on data and data analysis. However, data are only necessary and not sufficient to turn a graph into science. Science also requires proper interpretation of the results and honest discussion of their limitations. It is true that New York has more Covid-19 deaths than South Dakota. It is also true that some states like South Dakota never imposed severe restrictions. This does not imply that stay-at-home orders in New York caused more Covid-19 deaths. Similarly, the lack of a correlation between Covid-19 policies and Covid-19 cases or deaths across US states does not imply that these policies have no effect. Another explanation is that there are no states that had many deaths and did not impose stay-at-home orders. For this reason, experts have relied on models of epidemics to simulate scenarios what would have happened if New York City had not closed schools, bars, and night clubs. These simulations suggest that the death toll would have been even greater. The same simulations also suggest that many more lives could have been saved if New York City had been closed down just one week earlier (NPR). Models may sound less scientific than hard data, but data are useless and can be misleading when the necessary information is missing. The social-distancing measures that were imposed world-wide did reduce the death toll, but it is not clear which measures reduced it by how much. The coming months may provide some answers to this questions. S. Korea quickly closed bars after one super spreader infected 40 people in one night (businessinsider). What will happen in Oklahoma where bars and nightclubs are reopening? Personally, I think the political conflict about lock-downs is unproductive. The energy may be better spend on learning from countries that have been successful in controlling Covid-19 and who are able to ease restrictions.

Reply to Vianello, Bar-Anan, Kurdi, Ratliff, and Cunningham

I published a critique of the Implicit Association Test. Using structural equation modeling of multi-method studies, I find low convergent validity among implicit measures of attitudes (prejudice, self-esteem, political orientation) and low discriminant validity between explicit and implicit measures. The latter finding is reflected in high correlations between factors that reflect the shared variance among explicit measures and the shared variance among implicit measures. Using factor loadings to quantify validity, I found that the controversial race IAT has at most 20% valid variance in capturing racial attitudes. Most if not all of this variance is shared with explicit measures. Thus, there is no evidence that IAT scores reflect a distinct form of implicit prejudice that may operate outside of conscious awareness.

This article elicited a commentary by Vianello and Bar-Anan (ref.) and by Kurdi, Ratliff, and Cunningham (pdf). Here is a draft of my response to their commentaries. As you will see, there is little common ground; even the term “validity” is not clearly defined making any discussion about the validity of the IAT meaningless. To make progress as a science (or to become a science), psychologists need to have a common understanding of psychological measurement and methods that can be used to evaluate the validity of measures quantitatively.


Just like pre-publication peer-reviews, the two post-publication commentaries have remarkably little overlap. While Vianello and Bar-Anan (VBA) question my statistical analyses, Kurdi, Ratcliff, and Cunningham accept my statistical results, but argue that these results do not challenge the validity of the IAT.

VBA’s critique is clearer and therefore easier to refute by means of objective model comparisons. The key difference between VBA’s model and my model is the modelling of method variance. VBA’s model assume that all implicit measures of different constructs are influenced by a single method factor. In contrast, my model assumes that implicit measures of prejudice (e.g., the standard race IAT and the Brief Implicit Association Test with the same racial stimuli) share additional method variance. As these hypotheses are nested models, it is possible to test these competing models directly against each other. The results show that a model with content-specific method variance fits the data better (Schimmack, 2020a). The standard inference from a model comparison test is that the model with the worse fit is not an adequate model of the data, but VBA ignored the poorer fit of their model and present a revised model that does not model method variance properly and therefore produces misleading results. Thus, VBA’s commentary is just another demonstration of the power of motivated reasoning that undermines the idealistic notion of a self-correcting science.

KRC ask whether my results imply that the IAT cannot be a valid measure of automatic cognition?  To provide a meaningful answer to this question, it is important to define the terms valid, measure, automatic, and cognition.  The main problem with KRC’s comment is that these terms remain undefined. Without precise definitions, it is impossible to make scientific progress. This is even true for the concept of validity that has no clear meaning in psychological measurement (Schimmack, 2020c). KRC ignore that I clearly define validity as the correlation between IAT scores and a latent variable that represents the actual variation in constructs such as attitudes towards race, political parties, and the self.  My main finding was that IAT scores have only modest validity (i.e., low correlations with the latent variable or low factor loadings) as measures of racial preferences, no validity as a measure of self-esteem, and no proven validity as measures of some implicit constructs that are distinct from attitudes that are reflected in self-report measures. Instead, KRC consistently mischaracterize my findings when they write that “the reanalyses reported by Schimmack find high correlations between relatively indirect (automatic) measures of mental content, as indexed by the IAT, and relatively direct (controlled) measures of mental content.” This statement is simply false and confuses correlations of measures with correlations of latent variables.  The high correlations between latent factors that represent shared variance among explicit measures and implicit measures provide evidence of low discriminant validity, not evidence of high validity. Moreover, the modest loadings of the race IAT on the implicit race factor show low validity of the IAT as a measure of racial attitudes.

After mischaracterizing my results, KRC go on to claim that my results do “not cast any doubt on the ability of IATs to index attitudes or to do so in an automatic fashion” (p. 5).  However, the low convergent validity among implicit measures remains a problem for any claims that the IAT and other implicit measures measure a common construct with good validity. KRC simply ignore this key finding even though factor loadings provide objective and quantitative information about the construct validity of IAT scores.

The IAT is not the only research instrument with questionable construct validity.  However, the IAT is unique because it became a popular measure of individual differences without critical evaluation of its psychometric properties. This is particularly problematic when people are given feedback with IATs on the Project Implicit website, especially for IATs that have demonstrably no validity like the self-esteem IAT.  The developers of the IAT and KRC defend this practice by arguing that taking an IAT can be educational.  “At this stage in its development, it is preferable to use the IAT mainly as an educational tool to develop awareness of implicit preferences and stereotypes” However, it is not clear how a test with invalid results can be educational. How educational would it be to provide individuals with randomly generated feedback about their intelligence?  If this sounds unethical, it is not clear why it is acceptable to provide individuals with misleading feedback about their racial attitudes or self-esteem. As a community, psychologists should take a closer look at the practice of  providing online feedback with tests that have low validity because this practice may undermine trust in psychological science.

KRC’s commentary also fails to address important questions about the sources of stability and change in IAT scores over time. KRC suggest that “the jury is still out on whether variation in responding on the IAT mostly reflects individual differences or mostly reflects the effects of the situation” (p. 4). The reason why two decades of research have failed to answer this important question is that social cognition researchers focus on brief laboratory experiments that have little ecological validity and that are unable to demonstrate stability of individual differences over time. However, two longitudinal studies suggest that IAT scores measure stable attitudes rather than context-dependent automatic cognitions. Wil Cunningham, one of the commentators, provided first evidence that variance in IAT scores reflects mostly random measurement error and stable trait variance, with no evidence of situation-specific state variance (Cunningham et al., 2001). Interestingly, KRC ignore the implications of this study. This year, an impressive study examined this question with repeated measures of a six-year period (Onyeador et al., 2020; Schimmack, 2020). The results confirmed that even over this long time-period, variance in IAT scores mostly reflects measurement error and a stable trait without notable variance due to changes in situations.

Another important topic that I could only mention briefly in my original article is incremental predictive validity. KRC mention Kurdi et al.’s (2019) meta-analysis as evidence that the IAT and self-report measures tap different constructs. They fail to mention that the conclusions of this meta-analysis are undermined by the lack of credible, high-powered studies that can demonstrate incremental predictive validity. To quote Kurdi et al.’s abstract “most studies were vastly underpowered” (p. 569).  The authors conducted tests of publication bias, but did not find evidence for it.  The reason could be that they used tests that have low power to detect publication bias. Some studies included in the meta-analysis are likely to have reported inflated effect sizes due to selection for significance, especially costly fMRI studies with tiny sample sizes. For example, Phelps et al. (2000) report a correlation of r(12) = .58 between scores on the race IAT and differences in amygdala activation in response to Black and White faces.  Even if we assume that 20% of the variance in the IAT is valid, the validation corrected correlation would be r = 1.30. In other words, this correlation is implausible given the low validity of race IAT scores.  The correlation is also much stronger than the predictive validity of the IAT in Kurdi et al.’s meta-analysis. The most plausible explanation for this result is that researchers’ degrees of freedom in fMRI studies inflated this correlation (Vul et al., 2009). Consistent with this argument, effect sizes in studies with larger sample sizes are much smaller and evidence of incremental predictive validity can be elusive, as in Greenwald et al.’s study of the 2018 election.  At present, there is no pre-registered, high-powered study that provides clear evidence of incremental predictive validity. Thus, IAT proponents have failed to respond to Blanton et al.’s (2009) critique of the IAT. Responses to my renewed criticism suggest that IAT researchers are unable or unwilling to respond to valid scientific criticism of the IAT with active coping. Instead, they prefer to engage in emotion-focused, repressive coping that makes IAT researchers feel better without addressing substantive measurement problems.

In conclusion, my critique of the IAT literature and the response by IAT researchers shows a wider problem in psychology that I have called the validation crisis (Schimmack 2020c). Although measurement is at the core of any empirical science, many psychologists lack formal training in psychological measurement. As a result, they create and use measures of unknown validity. This is particularly true for social psychologists because social psychologists in the 1970s and 1980s actively rejected the idea that characteristics within individuals are important for the understanding of human behavior (“the power of the situation”). However, when the cognitive revolution started, the focus shifted from observable situations and behaviors to mental states and processes. To study these phenomena that are not directly observable requires valid measures, just like telescopes need to be validated to observe planets in distant galaxies. The problem is that social cognition researchers developed methods like the IAT to make claims about cognitive processes that are not observable to outside observers or by means of introspection without taking the time to validate these measures. To make progress, the next generation of social psychologists needs to distinguish clearly between constructs and measures and between random and systematic measurement error. As all measures are contaminated by both sources of measurement error, constructs need to be measured with multiple, independent methods that show convergent validity (Campbell & Fiske, 1959; Cronbach & Meehl, 1955).  Psychology also needs to move from empty qualitative statements like “the IAT can be valid” to empirically-based statements about the amount of validity of a specific IAT in specific populations in clearly defined situations. This requires a new program of research with larger samples, ecologically valid situations, and meaningful criterion variables.


Blanton, H., Jaccard, J., Klick, J., Mellers, B., Mitchell, G., & Tetlock, P. E. (2009). Strong claims and weak evidence: Reassessing the predictive validity of the IAT. Journal of Applied Psychology, 94, 567–582. doi:10.1037/a0014665

Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81–105. doi:10.1037/h0046016

Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302. doi:10.1037/h0040957

Cunningham, W. A., Preacher, K. J., & Banaji, M. R. (2001). Implicit attitude measures: Consistency, stability, and convergent validity. Psychological Science, 12(2), 163–170. https://doi.org/10.1111/1467-9280.00328

Onyeador, I. N., Wittlin, N. M., Burke, S. E., Dovidio, J. F., Perry, S. P., Hardeman, R. R., … van Ryn, M. (2020). The Value of Interracial Contact for Reducing Anti-Black Bias Among Non-Black Physicians: A Cognitive Habits and Growth Evaluation (CHANGE) Study Report. Psychological Science, 31(1), 18–30. https://doi.org/10.1177/0956797619879139

Schimmack, U. (2020a). Open Communication about the invalidity of the race IAT. https://replicationindex.com/2019/09/15/open-communication-about-the-invalidity-of-the-race-iat/

Schimmack, U. (2020b). Racial bias as a trait. https://replicationindex.com/2019/11/28/racial-bias-as-a-trait/ (retrieved 4/21/20)

Schimmack, U. (2020c). The validation crisis. Meta-Psychology (blog)

Vul, E., Harris, C., Winkielman, P., & Pashler, H. (2009). Puzzlingly high correlations in fMRI studies of emotion, personality, and social cognition. Perspectives on Psychological Science, 4(3), 274–290. https://doi.org/10.1111/j.1745-6924.2009.01125.x

Covid-19 behaves like tourists

Many people are wondering about variation in the Covid-19 pandemic across countries. Why (the North of) Italy and not Portugal. How was South Korea able to contain the virus, but other countries did not even though they had less time. The New York Times published a long article that examined this question, but nobody really knows.

Some of the speculations focus on biological factors that may protect individuals or may make them more vulnerable. However, so far these factors explain a small portion of the variation in death rates. The biggest predictor is the number of people who are infected by the virus. Australia and New Zealand have few deaths because Covid-19 did not spread widely among their populations.

One possible explanation could be the response of countries to the pandemic. Countries like the UK and Sweden may have more deaths because they did not lock down their countries. The problem with these speculations is that many factors are likely to contribute to the variation and it is difficult to spot these factors without statistical analyses.

The NYT article mentions that hundreds of studies are underway to look for predictors of variation across nations, but no results are being mentioned. Maybe researchers are cautious.

“Doctors who study infectious diseases around the world say they do not have enough data yet to get a full epidemiological picture, and that gaps in information in many countries make it dangerous to draw conclusions”

Drawing conclusions is different from exploring data. There is nothing dangerous about exploring patterns in data. Clearly many people are curious and statistical analysis can provide more valuable information than armchair speculations about climate or culture.

As a cross-cultural psychologists, I am familiar with many variables that distinguish nations from each other. The most prominent dimension is individualism. Western cultures tend to be more individualistic than Asian cultures. This might suggest that culture plays a role because Asian cultures have had fewer Covid-19 deaths. However, individualism as measured by Hofstede’s dimension is a weak predictor and did not survive statistical controls. Other dimensions that were less plausible also did not predict variation in Covid-19 deaths.

However, one variable that was a predictor was the number of tourists that travel to a country (tourism data).

Tourism reflects how connected a country is with the rest of the world. Australia and New Zealand are not only islands, they are also geographically isolated which explains why relatively few people visit these otherwise attractive locations. Covid-19 also has speared much of Eastern Europe and many Eastern European countries rank low on the tourism index.

Additional analysis show that tourism is becoming a weaker predictor over time. The reason is the recent rise of cases and deaths in Latin America. Latin America was relatively unaffected in April, but lately Ecuador and Brazil have seen alarming increases in cases.

The graph also shows that tourism does not explain all of the differences between countries. For example, the UK has way more cases than predicted by the regression line. This may reflect the slow response to the Covid-19 crises in the UK. Sweden is also above the regression line, possibly due to the policy to keep schools and businesses. Switzerland is a direct neighbor of the North of Italy, where the epidemic in Europe started. Canada is above the regression line, but was on the regression line on April 15. The reason is that Canada acted quickly in the beginning, but is now seeing a late increase in death in care homes.

In conclusion, these results suggest that timing is a big factor in the current differences across countries. Countries with high death tolls were simply unlucky to be at the center of the pandemic or well connected to it. As the pandemic progresses, this factor will become less important. Some countries, like Austria and (the South of) Germany that were hit early have been able to contain the spread of Covid-19. In other countries, numbers are increasing, but no country is seeing increases as dramatic as in Italy (or New York) where Covid-19 spread before social distancing measures were in place. New factors may predict what will happen in the times of the “new normal” when countries are trying to come out of lock-downs.

I don’t think that publishing these results is dangerous. The results are what they are. It is just important to realize that they do not prove that tourism is the real causal factor. It is also possible that tourism is correlated with some other variables that reflect the real cause. To demonstrate this, we need to find measures of these causal factors and demonstrate that they predict variation in death tolls of nations better than tourism and statistically remove the relationship of tourism with Covid-19 deaths. So, this blog post should be seen as a piece of a puzzle rather than the ultimate answer to a riddle.

Politics vs. Science: What Drives Opening Decisions in the United States?

The New York Times published a map of the United States that shows which states are opening up today on May 1.

I coded these political decisions on a 1 = shut down or restricted to 3 = partial reopening scale and examined numerous predictor variables that might drive the decision to ease restrictions.

Some predictor variables reflect scientific recommendations such as the rate of testing or the number of deaths or urbanization. Others reflect political and economic factors such as the percentage of Trump supporters in the 2016 election.

The two significant predictors were the number of deaths adjusted for population (on a log-scale) and support for Trump in the 2016 election. The amount of testing that is being carried out in different states was not a predictor.

Another model showed that states that have not been affected by Covid-19 are more likely to open. These are states where the population is more religious, White, and rural.

It was not possible to decide which of these variables are driving the effect because predictor variables are too highly correlated. This simply shows the big divide between “red”, rural, religious states and “blue,” agnostic, and urban states.

A bigger problem than differences between states are probably differences within states between urban centers and rural areas, where a single state-wide policy is unlikely to fit the needs of urban and rural populations. A big concern remains that decisions about opening are not related to testing, suggesting that some states who are opening do not have sufficient testing to detect new cases that may start a new epidemic.

Covid-19 in Quebec versus Ontario: Beware of Statistical Models

I have been tracking the Covid-19 statistics of Canadian provinces for several weeks (from March 16 to be precise). Initially, Ontario and Quebec were doing relatively well and had similar statistics. However, over time the case numbers increased, deaths, especially in care homes were increasing and the numbers were diverging. The situation in Quebec was getting worse and recently the number of deaths relative to the population was higher than in the United States. Like many others, I was surprised and concerned, when the Premier of Quebec announced plans to open businesses and schools sooner than later.

I was even more surprised when I read an article on the CTV website that reported new research that claims the situation in Quebec and Ontario is similar after taking differences in testing into account.

The researchers base this claim on a statistical model that aims to correct for testing bias and that is able to estimate the true number of infections on the basis of positive test results. To do so without a representative sample of tests seems rather dubious to most scientists. So, it would be helpful if the researchers could provide some evidence that validates their estimates. A simple validation criterion is the number of deaths. Regions that have more Covid-19 infections should also have more deaths, everything else being equal. Of course, differences in age structures or infections of care homes can create additional differences in deaths (i.e., the caes-fatality rates can differ), but there are no big differences between Quebec and Ontario in this regard as far as I know. So, is it plausible to assume that Quebec and Ontario have the same number of infections? I don’t think so.

To adjust for the difference in population size, all Covid-19 statistics are adjusted. The table shows that Ontario has 1,234 confirmed positive cases per 1 Million inhabitants while Quebec has 3,373 confirmed positive cases per 1 Million residents. This is not a trivial difference. There is also no evidence that the higher number in Quebec is due to more testing. While Ontario has increased testing lately, testing remains a problem in Quebec. Currently, Ontario has tested more (21,865 per Million tests) than Quebec (19,471 per Million tests). This also means that the positive rate (percentage of positive tests; positives/tests*100) is much higher in Quebec than in Ontario. Most important, there are 741 deaths for 10 Million residents in Ontario and 2157 deaths in Montreal. That means there are 2.91 times more deaths in Quebec than in Ontario. This matches the differences in cases where Quebec has 2.73 times more cases than Ontario. It follows that Ontario and Quebec also have similar case-fatality rates of 6.00% and 6.39%. That is, out of 100 people who test positive, about 6 die of Covid-19.

In conclusion there is absolutely no evidence for the claim that the Covid-19 pandemic has affected Ontario and Quebec to the same extent and that differences in testing produce misleading statistics. Rather, case numbers and deaths consistently show that Quebec is affected three time worse than Ontario. As the false claim is based on the Montreal authors’ statistical model, we can only conclude that their model makes unrealistic assumptions. It should not be used to make claims about the severity of Covid-19 in Ontario, Quebec, or anywhere else.