This blog post is a review of a manuscript that hopefully will never be published, but it probably will be. In that case, it is a draft for a PubPeer comment. As the ms. is under review, I cannot share the actual ms., but the review makes clear what the authors are trying to do.
I assume that I was selected as a reviewer for this manuscript because the editor recognized my expertise in this research area. While most of my work on replicability has been published in the form of blog posts, I have also published a few peer-reviewed publications that are relevant to this topic. Most important, I have provided estimates of replicability for social psychology using the most advanced method to do so, z-curve (Bartos & Schimmack, 2020; Brunner & Schimmack, 2020), using the extensive coding by Motyl et al. (2017) (see Schimmack, 2020). I was surprised that this work was not mentioned.
In contrast, Yeager et al.’s (2019) replication study of 12 experiments is cited and as I recall 11 of the 12 studies replicated successfully. So, it is not clear why this study is cited as evidence that replication attempts often “producing pessimistic results”
While I agree that there are many explanations that have been offered for replication failures, I do not agree that listing all of these explanations is impossible and that it is reasonable to focus on some of these explanations, especially if the main reason is left out. Namely, the main reason for replication failures is that original studies are conducted with low statistical power and only those that achieve significance are published (Sterling et al., 1995; Schimmack, 2020). Omitting this explanation undermines the contribution of this article.
The listed explanations are
(1) original articles making use of questionable research practices that result in Type I errors
This explanation conflates two problems. QRPs are used to get significance when power to do so is low, but we do not know whether the population effect size is zero (type-I error) or above zero (type-II error).
(2) original research’s pursuit of counterintuitive findings that may have lower a priori probabilities and thus poor chances at replication
This explanations assumes that there are a lot of type-I errors, but we don’t really know whether the population effect size is zero or not. So, this is not a separate explanation, but rather an explanation why we might have many type-I errors assuming that we do have many type-I errors, which we do not know.
(3) the presence of unexamined moderators that produce differences between original and replication research (Dijksterhuis, 2014; Simons et al., 2017),
This citation ignores that empirical tests of this hypothesis have failed to provide evidence for it (van Bavel et al., 2016).
4) specific design choices in original or replication research that produce different conclusions (Bouwmeester et al., 2017; Luttrell et al., 2017; Noah et al., 2018).
This argument is not different from (3). Replication failures are attributed to moderating factors that are always possible because exact replications are impossible.
To date, discussions of possible explanations for poor replication have generally been presented as distinct accounts for poor replication, with little attempt being made to organize them into a coherent conceptual framework.
This claim ignores my detailed discussion of the various explanations including some not discussed by the authors (Schooler decline effect; Fiedler, regression to the mean; Schimmack, 2020).
The selection of journals is questionable. Psychological Science is not a general (meta)-psychological journal. Instead there are two journals, The Journal of General Psychology and Meta-Psychology that contain relevant articles.
The authors then introduce Cook and Campbell’s typology of validity and try to relate it to accounts of replication failures based on some work by Fabrigar et al. (2020). This attempt is flawed because validity is a broader construct than replicability or reliability. Measures can be reliable and correlations can be replicable even if the conclusions drawn from these findings are invalid. This is Intro Psych level stuff.
Statistical conclusion validity is concerned with the question of “whether or not two or more variables are related.” This is of course nothing else than the distinction between true and false conclusions based on significant or non-significant results. As noted above, even statistical conclusion validity is not directly related to replication failures because replication failures do not tell us whether the population effect size is zero or not. Yet, we might argue that there is a risk of false positive conclusions when statistical significance is achieved with QRPs and these results do not replicate. So, in some sense statistical conclusion validity is tied to the replication crisis in experimental social psychology.
Internal validity is about the problem of inferring causality from correlations. This issue has nothing to do with the replication crisis because replication failures can occur in experiments and correlational studies. The only indirect link to internal validity is that experimental social psychology prided itself on the use of between-subject experiments to maximize internal validity and minimize demand effects, but often used ineffective manipulations (priming) that required QRPs to get significance especially in the tiny samples that were used because experiments are more time-consuming and labor intensive. In contrast, survey studies often are more replicable because they have larger samples. But the key point remains, it would be absurd to explain replication failures directly as a function of low internal validity.
Construct validity is falsely described as “the degree to which the operationalizations used in the research effectively capture their intended constructs.” The problem here is the term operationalization. Once a construct is operationalized with some procedure, it is defined by the procedure (intelligence is what the IQ test measures) and there is no way to challenge the validity of the construct. In contrast, measurement implies that constructs exist independent of one specific procedure and it is possible to examine how well a measure reflects variation in the construct (Cronbach & Meehl, 1955). That said, there is no relationship between construct validity and replicability because systematic measurement error can produce spurious correlations between measures in correlational studies that are highly replicable (e.g., social desirable responding). In experiments, systematic measurement error will attenuate effect sizes, but it will do so equally in original studies and replication studies. Thus, low construct validity also provides no explanation for replication failures.
External validity is defined as “the degree to which an effect generalizes to different populations and contexts” This validation criterion is also only slightly related to replication failures when there are concerns about contextual sensitivity or hidden moderators. A replication study in a different population or context might fail because the population effect size varies across populations or contexts. While this is possible, there is little evidence that contextual sensitivity is a major factor.
In short, it is a red herring in explanations for replication failures or the replication crisis to talk about validity. Replicability is necessary but not sufficient for good science.
It is therefore not surprising that the authors found most discussions of replication failures focus on statistical conclusion validity. Any other finding would make no sense. It is just not clear why we needed a text analysis to reveal this.
However, the authors seem to be unable to realize that the other types of validity are not related to replication failures when they write “What does this study add? Identifies that statistical conclusion validity is over-emphasized in replication analysis”
Over-emphasized??? This is an absurd conclusion based on a failure to make a clear distinction between replicability/reliability and validity.
Social psychology has an open secret. For decades, social psychologists conducted experiments with low statistical power (i.e., even if the predicted effect is real, their study could not detect it with p < .05), but their journals were filled with significant (p < .05) results. To achieve significant results, social psychologists used so-called questionable research practices that most lay people or undergraduate students consider to be unethical. The consequences of these shady practices became apparent in the past decade when influential results could not be replicated. The famous reproducibility project estimated that only 25% of published significant results are replicable. Most undergraduate students who learn about this fact are shocked and worry about the credibility of results in their social psychology textbooks.
Today, there are two types of social psychologists. Some are actively trying to improve the credibility of social psychology by adopting open science practices such as preregistration of hypothesis, sharing open data, and publishing non-significant results rather than hiding these findings. However, other social psychologists are actively trying to deflect criticism. Unfortunately, it can be difficult for lay people, journalists, or undergraduate students to make sense of articles that make seemingly valid arguments, but only serve the purpose to protect the image of social psychology as a science.
As somebody who has followed the replication crisis in social psychology for the past decade, I can provide some helpful information. In the blog post , I want to point out that Duane T. Wegener and Leandre R. Fabrigar have made numerous false arguments against critics of social psychology, and that their latest article “Evaluating Research in Personality and Social Psychology: Considerations of Statistical Power and Concerns About False Findings” ignores the replication crisis in social psychology and the core problem of selectively publishing significant results from underpowered studies.
The key point of their article is that “statistical power should be de-emphasized in comparison to current uses in research evaluation” (p. 1105).
To understand why this is a strange recommendation, it is important to understand that power is simply the probability of producing evidence for an effect, when an effect exists. When the criterion for evidence is a p-value below .05, it means the probability of obtaining this desired outcome. One advantage of high power is that researchers get the correct result. In contrast, a study with low power is likely to produce the wrong result called a type-II error. While the study tested a correct hypothesis, the results fail to provide sufficient support for it. As these failures can be due to many reasons (low power or the theory is wrong), they are difficult to interpret and to publish. Often these studies remain unpublished, the published record is biased, and resources were wasted. Thus, high power is a researcher’s friend. To make a comparison, if you could gamble on a slot machine with a 20% chance of winning or an 80% chance of winning, which machine would you pick? The answer is simple. Everybody would rather want to win. The problem is only that researchers have to invest more resources in a single study to increase power. They may not have enough money or time to do so. So, they are more like desperate gamblers. You need a publication, you don’t have enough resources for a well-powered study, so you do a low powered study and hope for the best. Of course, many desperate gamblers lose and are then even more desperate. That is where the analogy ends. Unlike gamblers in a casino, researchers are their own dealers and can use a number of tricks to get the desired outcome (Simmons et al., 2011). Suddenly, a study with only 20% power (chance of winning honestly) can have a chance of winning of 80% or more.
This brings us to the second advantage of high-powered studies. Power determines the outcome of a close replication study. If a researcher conducted a study with 20% power and found some tricks to get significance, the probability of replicating the result honestly is again only 20%. Many unsuspecting graduate students have wasted precious years trying to build on studies that they were not able to replicate. Unless they quickly learned the dark art of obtaining significant results with low power, they did not have a competitive CV to get a job. Thus, selective publishing of underpowered studies is demoralizing and rewards cheating.
None of this is a concern for Wegener and Fabrigar, who do not cite influential articles about the use of questionable research practices (John et al., 2012) or my own work that uses estimates of observed power to reveal those practices (Schimmack, 2012; see also Francis, 2012). Instead, they suggest that “problems with the overuse of power arise when the pre-study concept of power is used retrospectively to evaluate completed research” (p. 1115). The only problem that arises from estimating actual power of completed studies, however, is the detection of questionable practices that produce more reported significant results (often 100%) than one would expect given the low power to do so. Of course, for researchers who want to use QRPs to produce inflated evidence for their theories, this is a a problem However, for consumers of research, the detection of questionable results is desirable so that they can ignore this evidence in favor of honestly reported results based on properly powered studies.
The bulk of Wegener and Fabrigar’s article discusses the relationship between power and the probability of false positive results. A false positive result occurs when a statistically significant result is obtained in the absence of a real effect. The standard criterion of statistical significance, p < .05, states that a researcher that tests 100 false hypothesis without a real effect is expected to obtain 95 non-significant results and 5 false positive results. This may sound sufficient to keep false positive results at a low level. However, the false positive risk is a conditional probability based on a significant result. If a researcher conducts 100 studies, obtains 5 significant results, and interprets these results as real effects, the researcher has a false positive rate of 100% because 5 significant results are expected by chance along. An honest researcher would conclude from a series of studies with only 5 out of 100 significant results that they found no evidence for a real effect.
Now let’s consider a researcher that conducted 100 studies and obtained 24 significant results. As 24 is a lot more than the expected 5 studies by chance along, the researcher can conclude that at least some of the 24 significant results are caused by real effects. However, it is also possible that some of these results are false positives. Soric (1989 – not cited by Wegener and Fabrigar – derived a simple formula to estimate the false discover risk. The formula makes the assumption that studies of real effects have 100% power to detect a real effect. As a result, there are zero studies that fail to provide evidence for a real effect. This assumption makes it possible to estimate the maximum percentage of false positive results.
In this simple example, we have 4 false positive results and 20 studies with evidence for a real effect. Thus, the false positive risk is 4 / 24 = 17%. While 17% is a lot more than 5%, it is still pretty low and doesn’t warrant claims that “most published results are false” (Ioannidis, 2005). Yet, it is also not very reassuring if 17% of published results might be false positives (e.g., 17% of cancer treatments actually do not work). Moreover, based on a single study, we do not know which of the 24 results are true results and false results. With a probability of 17% (1/6), trusting a result is like playing Russian roulette. The solution to this problem is to conduct a replication study. In our example, the 20 true effects will produce significant results again because they were obtained with 100% power to do so. However, the chance to replicate one of the 4 false positive results is only 5/100 * 5 / 100 = 25 / 10,000 = 0.25%. So, with high-powered studies, a single replication study can separate true and false original findings.
Things look different in a world with low powered studies. Let’s assume that studies have only 25% power to produce a significant result, which is in accordance with the success rate in replication studies in social psychology (Open Science Collaboration, 2005).
In this scenario, there is only 1 false positive result and the false positive risk is only 1 out of 21, ~ 5%. Of course, researchers do not know this and have to wonder whether some of 21 significant results are false positives. When they conduct a replication study, only 6 (25/100 * 25/100) of their 20 significant results replicate. Thus, a single replication study does not help to distinguish true and false findings. This leads to confusion and the need for additional studies to separate true and false findings, but low power will produce inconsistent results again and again. The consequences can be seen in the actual literature in social psychology. Many literatures are a selected set of inconsistent results that do not advance theories.
In sum, high powered studies quickly separate true and false findings, whereas low powered studies produce inconsistent results that make it difficult to separate true and false findings (Maxwell, 2004, not cited by Wegener & Fabrigar).
Actions speak louder than Words
Over the past decade, my collaborators and I I have developed powerful statistical tools to estimate the power of studies that were conducted (Bartos & Schimmack, 2021; Brunner & Schimmack, 2022; Schimmack, 2012). In combination with Soric’s (1989) formula, estimates of actual power can also be used to estimate the real false positive risk. Below, I show some results when this method is applied to social psychology. I focus on the journal Personality and Social Psychological Bulletin for two reasons. First, Wegener and Fabrigar were co-editors of this journal right after concerns about questionable research practices and low power became a highly discussed topic and some journal editors changed policies to increase replicability of published results (e.g., Steven Lindsay at Psychological Science). Examining the power of studies published in PSPB when Wegener and Fabrigar were editors provides objective evidence about their actions in response to concerns about replication failures in social psychology. Another reason to focus on PSPB is that Wegener and Fabrigar published their defense of low powered research in this journal, suggesting a favorable attitude towards their position by the current editors. We can therefore examine whether the current editors changed standards or not. Finally, PSPB was edited from 2017 to 2021 by Chris Crandall, who has been a vocal defender of results obtained with questionable research practices on social media.
Let’s start with the years before concerns about replication failures became openly discussed. I focus on the years 2000 to 2012.
Figure 1 shows a z-curve plot of automatically extracted statistical results published in PSPB from 2000 to 2012. All statistical results are converted into z-scores. A z-curve plot is a histogram that shows the distribution of z-scores. One important aspect of a z-curve plot is the percentage of significant results. All z-scores greater than 1.96 (the solid vertical red line) are statistically significant with p < .05 (two-sided). Visual inspection shows a lot more significant results than non-significant results. More precisely, the percentage of significant results (i.e., the Observed Discovery Rate, EDR) is 71%.
Visual inspection of the histogram also shows a strange shape to the distribution of z-scores. While the peak of the distribution is at the point of significance, the shape of the distribution shows a rather steep drop of z-scores just below z = 1.96. Moreover, some of these z-scores are still used to claim support for a hypothesis often called marginally significant. Only z-scores below 1.65 (p < .10, two-sided or .0 5 one-sided, the dotted red line) are usually interpreted as non-significant results. The distribution shows that these results are less likely to be reported. This wonky distribution of z-scores suggests that questionable research practices were used.
Z-curve analysis makes it possible to estimate statistical power based on the distribution of statistically significant results only. Without going into the details of this validated method, the results suggest that the power of studies (i.e., the expected discovery rate, EDR) would only produce 23% significant results. Thus, the actual percentage of 71% significant results is inflated by questionable practices. Moreover, the 23% estimate is consistent with the fact that only 25% of unbiased replication studies produce a significant result (Open Science Collaboration, 2005). With 23% significant results, Soric’s formula yields a false positive risk of 18%. That means, roughly 1 out of 5 published results could be a false positive result.
In sum, while Wegener and Fabrigar do not mention replication failures and questionable research practices, the present results confirm the explanation of replication failures in social psychology as a consequence of using questionable research practices to inflate the success rate of studies with low power (Schimmack, 2020).
Figure 2 shows the z-curve plot for results published during Wegener and Fabrigar’s reign as editors. The results are easily summarized. There is no significant change. Social psychologists continued to publish ~70% significant results with only 20% power to do so. Wegener and Fabrigar might argue that there was not enough time to change practices in response to concerns about questionable practices. However, their 2022 article provides an alternative explanation. They do not consider it a problem when researchers conduct underpowered studies. Rather, the problem is when researchers like me estimate the actual power of studies and reveal that massive use of questionable practices.
The next figure shows the results for Chris Chrandall’s years as editor. While the percentage of significant results remained at 70%, power to produce these results increased to 32%. However, there is uncertainty about this increase and the lower limit of the 95%CI is still only 21%. Even if there was an increase, it would not imply that Chris Crandall caused this increase. A more plausible explanation is that some social psychologists changed their research practices and some of this research was published in PSPB. In other words, Chris Crandall and his editorial team did not discriminate against studies with improved power.
It is too early to evaluate the new editorial team lead by Michael D. Robinson, but for the sake of completeness, I am also posting the results for the last two years. The results show a further increase in power to 48%. Even the lower limit of the confidence interval is now 36%. Thus, even articles published in PSPB are becoming more powerful, much to the dismay of Wegener and Fabrigar, who believe that “the recent overemphasis on statistical power should be replaced by a broader approach in which statistical and conceptual forms of validity are considered together” (p. 1114). In contrast, I would argue that even an average power of 48% is ridiculously low. An average power of 48% implies that many studies have even less than 48% power.
More than 50 years ago, famous psychologists Amos Tversky and Daniel Kahneman (1971) wrote “we refuse to believe that a serious investigator will knowingly accept a .50 risk of failing to confirm a valid research hypothesis” (p. 110). Wegener and Fabrigar prove them wrong. Not only are they willing to conduct these studies, they even propose that doing so is scientific and that demanding more power can have many negative side-effects. Similar arguments have been made by other social psychologists (Finkel, Eastwick, Reis, 2017).
I am siding with Kahneman, who realized too late that he placed too much trust in questionable results produced by social psychologists and compared some of this research to a train wreck (Kahneman, 2017). However, there is no consensus among psychologists and readers of social psychological research have to make up their own mind. This blog post only points out that social psychology lacks clear scientific standards and no proper mechanism to ensure that theoretical claims rest on solid empirical foundations. Researchers are still allowed to use questionable research practices to present overly positive results. At this point, the credibility of results depends on researchers’ willingness to embrace open science practices. While many young social psychologists are motivated to do so, Wegener and Fabrigar’s article shows that they are facing resistance from older social psychologists who are trying to defend the status quo of underpowered research.
I am not the first and I will not be the last to point out that the traditional peer-review process is biased. After all, who would take on the thankless job of editing a journal if it would not come with the influence and power to select articles you like and to reject articles you don’t like. Authors can only hope that they find an editor who favors their story during the process of shopping around a paper. This is a long and frustrating process. My friend Rickard Carlsson created a new journal that operates differently with a transparent review process and virtually no rejection rate. Check out Meta-Psychology. I published two articles there that reported results based on math and computer simulations. Nobody challenged the validity, but other journals rejected the work based on politics (AMMPS rejection).
The biggest event in psychology, especially social psychology, in the past decade (2011-2020) was the growing awareness of the damage caused by selective publishing of significant results. It has long been known that psychology journals nearly exclusively publish statistically significant results (Sterling, 1959). This made it impossible to publish studies with non-significant results that could correct false positive results. It was long assumed that this was not a problem because false positive results are rare. What changed over the past decade was that researchers published replication failures that cast doubt on numerous classic findings in social psychology such as unconscious priming or ego-depletion.
Many, if not most, senior social psychologists have responded to the replication crisis in their field with a variety of defense mechanisms, such as repression or denial. Some have responded with intellectualization/rationalization and were able to publish their false arguments to dismiss replication failures in peer-reviewed journals (Bargh, Baumeister, Gilbert, Fiedler, Fiske, Nisbett, Stroebe, Strack, Wilson, etc., to name the most prominent ones). In contrast, critics had a harder time to make their voices heard. Most of my work on this topic has been published in blog posts in part because I don’t have the patience and frustration tolerance to deal with reviewer comments. However, this is not the only reason and in this blog post I want to share what happened when Moritz Heene and I were invited by Christiph Klauer to write an article on this topic for the German journal “Psychological Rundschau”.
For readers who do not know Christipher; he is a very smart social psychologists who worked as an assistant professor with Hubert Feger when I was an undergraduate student. I respect his intelligence and his work such as his work on the Implicit Association Test.
Maybe he invited us to write a commentary because he knew me personally. Maybe he respected what we had to say. In any case, we were invited to write an article and I was motivated to get an easy ‘peer-reviewed’ publication, even if nobody outside of Germany cares about a publication in this journal.
After submitting our manuscript, I received the following response in German. I used http://www.DeepL.com/Translator (free version) to share an English version.
Thu 2016-04-14 3:50 AM
Thank you very much for the interesting and readable manuscript. I enjoyed reading it and can agree with most of the points and arguments. I think this whole debate will be good for psychology (and hopefully social psychology as well), even if some are struggling at the moment. In any case, the awareness of the harmfulness of some previously widespread habits and the realization of the importance of replication has, in my impression, increased significantly among very many colleagues in the last two to three years.
Unfortunately, for formal reasons, the manusrkipt does not fit so well into the planned special issue. As I said, the aim of the special issue is to discuss topics around the replication question in a more fundamental way than is possible in the current discussions and forums, with some distance from the current debates. The article fits very well into the ongoing discussions, with which you and Mr. Heene are explicitly dealing with, but it misses the goal of the special issue. I’m sorry if there was a misunderstanding.
That in itself would not be a reason for rejection, but there is also the fact that a number of people and their contributions to the ongoing debates are critically discussed. According to the tradition of the Psychologische Rundschau, each of them would have to be given the opportunity to respond in the issue. Such a discussion, however, would go far beyond the intended scope of the thematic issue. It would also pose great practical difficulties, because of the German language, to realize this with the English-speaking authors (Ledgerwood; Feldman Barrett; Hewstone, however, I think can speak German; Gilbert). For example, you would have to submit the paper in an English version as well, so that these authors would have a chance to read the criticisms of their statements. Their comments would then have to be translated back into German for the readers of Psychologische Rundschau.
All this, I am afraid, is not feasible within the scope of the special issue in terms of the amount of space and time available. Personally, as I said, I find most of your arguments in the manuscript apt and correct. From experience, however, it is to be expected that the persons criticized will have counter-arguments, and the planned special issue cannot and should not provide such a continuation of the ongoing debates in the Psychologische Rundschau. We currently have too many discussion forums in the Psychologische Rundschau, and I do not want to open yet another one.
I ask for your understanding and apologize once again for apparently not having communicated the objective of the special issue clearly enough. I hope you and Mr. Heene will not hold this against me, even though I realize that you will be disappointed with this decision. However, perhaps the manuscript would fit well in one of the Internet discussion forums on these issues or in a similar setting, of which there are several and which are also emerging all the time. For example, I think the Fachgruppe Allgemeine Psychologie is currently in the process of setting up a new discussion forum on the replicability question (although there was also a deadline at the end of March, but perhaps the person responsible, Ms. Bermeitinger from the University of Hildesheim, is still open for contributions).
I am posting this letter now because the forced resignation of Fiedler as editor of Perspectives on Psychological Science made it salient how political publishing in psychology journals is. While many right-wing media commented on this event to support their anti-woke, pro-doze culture wars. They want to maintain the illusion that current science, I focus on psychology here, is free of ideology and only interested in searching for the truth. This is BS. Psychologists are human beings and show in-group bias. When most psychologists in power are old, White, men, they will favor old, White, men that are like them. Like all systems that work for the people in power, they want to maintain the status quo. Fiedler abused his power to defend the status quo against criticisms of a lack in diversity. He also published several articles to defend (social) psychology against accusations of shoddy practices (questionable research practices).
I am also posting it here because a very smart psychologists stated in private that he agreed with many of our critical comments that we made about replication-crisis deniers. As science is a social game, it is understandable that he never commented on this topic in public (If he doesn’t like that I am making them public, he can say that he was just polite and didn’t really mean what he wrote).
I published a peer-reviewed article on the replication crisis and the shameful response by many social psychologists several years later (Schimmack, 2020). A new generation of social psychologists is trying to correct the mistakes of the previous generation, but as so often, they do so without the support or even against the efforts of the old guard that cannot accept that many of their cherished findings may die with them. But that is life.
The journal Psychological Inquire publishes theoretical articles that are accompanied by commentaries. In a recent issue, prominent implicit cognition researchers discussed the meaning of the term implicit. This blog post differs from the commentaries by researchers in the field, by providing an outsider perspective and by focusing on the importance of communicating research findings clearly to the general public. This purpose of definitions was largely ignored by researchers who are more focused on communicating with each other than with the general public. I will show that this unique outsider perspective favors a definition of implicit bias in terms of the actual research that has been conducted under the umbrella of implicit social cognition research rather than proposing a definition that renders 30 years of research useless with a simple stroke of a pen. If social cognition researchers want to communicate about implicit bias as empirical scientists they have to define implicit bias as effects of automatically activated information (associations, stereotypes, attitudes) on behavior. This is what they have studied for 30 years. Defining implicit bias as unconscious bias is not helpful because 30 years of research have failed to provide any evidence that people can act in a biased way without awareness. Although unconscious biases may occur, there is currently no scientific evidence to inform the public about unconscious biases. While the existing research on automatically activated stereotypes and attitudes has problems, the topic remains important. As the term implicit bias has caught on, it can be used in communications with the public about, but it should be made clear that implicit does not mean unconscious.
Psychologists are notoriously sloppy with language. This leads to misunderstandings and unnecessary conflicts among scientists. However, the bigger problem is a break-down in communication with the general public. This is particularly problematic in social psychology because research on social issues can influence public discourse and ultimately policy decisions.
One of the biggest case-studies of conceptual confusion that had serious real-world consequences is the research on implicit cognition that created the popular concept of implicit bias. Although the term implicit bias is widely used to talk about racism, the term lacks clear meaning.
The Stanford Encyclopedia of Philosophy defines implicit bias as a tendency to “act on the basis of prejudice and stereotypes without intending to do so.” However, lack of intention (not wanting to) is only one of several meanings of the term implicit. Another meaning of the word implicit is automatic activation of thoughts. For example, a Scientific American article describes implicit bias as a “tendency for stereotype-confirming thoughts to pass spontaneously through our minds.” Notably, this definition of implicit bias clearly implies that people are aware of the activated stereotype. The stereotype-confirming thought is in people’s mind and not activated in some other area of the brain that is not accessible to consciousness. This definition also does not imply that implicit bias results in biased behavior because awareness makes it possible to control the influence of activated stereotypes on behavior.
Merriam Webster Dictionary offers another definition of implicit bias as “a bias or prejudice that is present but not consciously held or recognized.” In contrast to the first two meanings of implicit bias, this definition suggests that implicit bias may occur without awareness; that is implicit bias = unconscious bias.
The different definitions of implicit bias lead to very different explanations of biased behavior. One explanation assumes that implicit biases can be activated and guide behavior without awareness and individuals who act in a biased way may either fail to recognize their biases or make up some false explanation for their biased behaviors after the fact. This idea is akin to Freud’s notion of a powerful, autonomous unconscious (the Id) that can have subversive effects on behavior that contradict the values of a conscious, moral self (Super-Ego). Given the persistent influence of Freud on contemporary culture, this idea of implicit bias is popular and reinforced by the Project Implicit website that offers visitors tests to explore their hidden (hidden = unconscious) biases.
The alternative interpretation of implicit bias is less mysterious and more mundane. It means that our brain constantly retrieves information from memory that is related to the situation we are in. This process does not have a filter to retrieve only information that we want. As a result, we sometimes have unwanted thoughts. For example, even individuals who do not want to be prejudice will sometimes have unwanted stereotypes and associated negative feelings pop into their mind (Scientific American). No psychoanalysis or implicit test is needed to notice that our memory has stored stereotypes. In safe contexts, we may even laugh about them (Family Guy). In theory, awareness that a stereotype was activated also makes it possible to make sure that it does not influence behavior. This may even be the main reason for our ability to notice what our brain is doing. Rather than acting in a reflexive way to a situation, awareness makes it possible to respond more flexible to a situation. When implicit is defined as automatic activation of a thought, the distinction between implicit and explicit bias becomes minor and academic because the processes that retrieve information information from memory are automatic. The only difference between implicit and explicit retrieval of information is that the process may be triggered spontaneously by something in our environment or by a deliberate search for information.
After more than 30 years of research on implicit cognitions (Fazio, Sanbonmatsu, Powell, Kardes, 1986), implicit social cognition researchers increasingly recognize the need for clearer definitions of the term implicit (Gawronski, Ledgerwood, & Eastwick, 20222a), but there is little evidence that they can agree on a definition (Gawronski, Ledgerwood, & Eastwick, 20222b). Gawronski et al. (2022a, 2022b) propose to limit the meaning of implicit bias to unconscious biases; that is, individuals are unaware that their behavior was influenced by activation of negative stereotypes or affects/attitudes. “instances of bias can be described as implicit if respondents are unaware of the effect of social category cues on their behavioral response” (p. 140). I argue that this definition is problematic because there is no scientific evidence to support the hypothesis that prejudice is unconscious. Thus, the term cannot be used to communicate scientific results that have been obtained by implicit cognition researchers over the past three decades because these studies did not study unconscious bias.
Implicit Bias Is Not Unconscious Bias
Gawronski et al. note that their decision to limit the term implicit to mean unconscious is arbitrary. “A potential objection against our arguments might be that they are based on a particular interpretation of implicit in IB that treats the term as synonymous with unconscious” (p. 145). Gawronski et al. argue in favor of their definition because “unconscious biases have the potential to cause social harm in ways that are fundamentally different from conscious biases that are unintentional and hard-to-control” (p. 146). The key words in this argument is “have the potential,” which means that there is no scientific evidence that shows different effects of biases with and without awareness of bias. Thus, the distinction is merely a theoretical, academic one without actual real-world implications. Gawronski et al. agree with this assessment when they point out that existing implicit cognition research “provides no information about IB [implicit bias] if IB is understood as an unconscious effect of social category cues on behavioral responses. It seems bizarre to define the term implicit bias in a way that makes all of the existing implicit cognition research irrelevant. A more reasonable approach would be to define implicit bias in a way that is more consistent with the work of implicit bias researchers. As several commentators pointed out, the most widely used meaning of implicit is automatic activation of information stored in memory about social groups. In fact, Gawronski himself used the term implicit in this sense and repeatedly pointed out that implicit does not mean unconscious (i.e., without awareness) (Appendix 1).
Defining the term implicit as automatic activation makes sense because the standard experimental procedure to study implicit cognition is based on presenting stimuli (words, faces, names) related to a specific group and to examine how these stimuli influence behaviors such as the speed of pressing a button on a keyboard. The activation of stereotypic information is automatic because participants are not told to attend to these stimuli or even to ignore them. Sometimes the stimuli are also presented in subtle ways to make it less likely that participants consciously attend to them. The question is always whether these stimuli activate stereotypes and attitudes stored in memory and how activation of this information influences behavior. If behavior is influenced by the stimuli, it suggests that stereotypic information was activated – with or without awareness. The evidence from studies like these provides the scientific basis for claims about implicit bias. Thus, implicit bias is basically operationally defined as systematic effects of automatically activated information about groups on behavior.
The aim of implicit bias research is to study real-word incidences of prejudice under controlled laboratory conditions. A recent incidence at racism shows how activation of stereotypes can have harmful consequences for victims and perpetrators of racist behavior .
The question of consciousness is secondary. What is important is how individuals can prevent harmful consequences of prejudice. What can individuals do to avoid storing negative stereotypes and attitudes in the first place? What can individuals do to weaken stored memories and attitudes? What can individuals do to make it less likely that stereotypes are activated? What can individuals do to control the influence of attitudes when they are activated? All of these questions are important and are related to the concept of implicit as automatic activation of attitudes. The only reason to emphasize unconscious process would be a scenario where individuals are unable to control the influence of information that influences behavior without awareness. However, given the lack of evidence that unconscious biases exist, it is currently unnecessary to focus on this scenario. Clearly, many instances of biases occur with awareness (“White teacher in Texas fired after telling students his race is ‘the superior one’”).
Unfortunately, it may be surprising for some readers to learn that implicit does not mean unconscious because the term implicit bias has been popularized in part to make a distinction between well-known forms of bias and prejudice and a new form of bias that can influence behavior even when individuals are consciously trying to be unbiased. These hidden biases occur against individuals’ best intentions because they exist in a blind spot of consciousness. This meaning of implicit bias was popularized by Banaji and Greenwald (2013), who also founded the Project Implicit website that provides individuals with feedback about their hidden biases; akin to psychoanalysts who can recover repressed memories.
Gawronski et al. (2022b) point out that Greenwald and Banaji’s theory of unconscious bias evolved independently of research by other implicit bias researchers who focused on automaticity and were less concerned about the distinction between conscious and unconscious biases. Gawronski’s definition of implicit bias as unconscious bias favors Banaji and Greenwald’s school of thought (hidden bias) over other research programs (automatically activated biases). The problem with this decision is that Greenwald and Banaji recently walked back their claims about unconscious biases and no longer maintain that the effects they studies were obtained without awareness (Implicit = Indirect & Indirect ≠ Unconscious, Greenwald & Banaji, 2017). The reversal of their theoretical position is evident in their statement that “even though the present authors find themselves occasionally lapsing to use implicit and explicit as if they had conceptual meaning [unconscious vs. conscious], they strongly endorse the empirical understanding of the implicit– explicit distinction” (p. 892). It is puzzling to see Gawronski arguing for a definition that is based on a theory that the authors no longer endorse. Given the lack of scientific evidence that stereotypes regularly lead to biases without awareness, this might be the time to agree on a definition that matches the actual research by implicit cognition researchers, and the most fitting definition would be automatic activation of stereotypes and attitudes, not unconscious causes of behavior.
Gawronski et al. (2022a) also falsely imply that implicit cognition researchers have ignored the distinction between conscious and unconscious biases. In reality, numerous studies have tried to demonstrate that implicit biases can occur without awareness. To study unconscious biases, social cognition researchers have relied heavily on an experimental procedure known as subliminal priming. In a subliminal priming study, a stimulus (prime) is presented very briefly, outside of the focus of attention, and/or with a masking stimuli. If a manipulation check shows that individuals have no awareness of the prime and the prime influences behavior, the effect appears to occur without awareness. Several studies suggested that racial primes can influence behavior without awareness (Bargh et al., 1996; Davis, 1989).
However, the credibility of these results has been demolished by the replication crisis in social psychology (Open Science Collaboration, 2015; Schimmack, 2020). Priming research has been singled out as the field with the biggest replication problems (Kahneman, 2012). When asked to replicate their own findings, leading priming researchers like Bargh refused to do so. Thus, while subliminal priming studies started the implicit revolution (Greenwald & Banaji, 2017), the revolution imploded over the past decade when doubts about the credibility of the original findings increased.
Unfortunately, researchers within the field of implicit bias research often ignore the replication crisis and cite questionable evidence as if it provided solid evidence for unconscious biases. For example, Gawronski et al. (2022b) suggest that unconscious biases may contribute to racial disparities in use-of-force errors such as the high-profile killing of Philando Castile. To make this case, they use a (single) study of 58 White undergraduate students (Correll, Wittenbrink, Crawford, & Sadler, 2015, Study 3). The study asked participants to make shoot vs. no-shoot decisions in a computer task (game) that presented pictures of White or Black men holding a gun or another object. Participants were instructed to make one quick decision within 630 milliseconds and another decision without time restriction. Gawronski et al. suggest that failures to correct an impulsive error given ample time to do so constitutes evidence of unconscious bias. They summarized the results as evidence that “unconscious effects on basic perceptual processes play a major role in tasks that more closely resemble real-world settings” (p. 226).
Fact checking reveals that this characterization of the study and its results is at least misleading, if not outright false. First, it is important to realize that the critical picture was presented for only 175ms and immediately replaced by another picture to wipe out visual memory. Although this is not a strictly subliminal presentation of stimuli, it is clearly a suboptimal presentation of stimuli. As a result, participants sometimes had to guess what the object was. They also had no other information to know whether their initial perception was correct or incorrect. The fact that participants’ performance improved without time pressure may be due to response errors under time pressure and this improvement was evident independent of the race of the men in the picture.
Without time pressure, participants shot 85% of armed Black men and 83% of armed White men. For unarmed men, participants shot 28% Black men and 25% White men. The statistical comparison of these differences showed weak effect of a systematic bias. The comparison for unarmed men produced a p-value that was just significant with the standard criterion of alpha = .05 criterion, F(1,53) = 6.65, p = .013, but not the more stringent criterion of alpha = .005 that is used to predict a high chance of replication. The same is true for the comparison of responses to pictures of unarmed men, F(1,53) = 4.96, p =.031. To my knowledge, this study has not been replicated and Gawronski et al.’s claim rests entirely on this single study.
Even if these effects could be replicated in the laboratory, they do not provide any information about unconscious biases in the real world because the study lacks ecological validity. To make claims about the real world, it is necessary to study police officers in simulations of real world scenarios (Andersen, Di Nota, Boychuk, Schimmack, & Collins, 2021). This research is rare, difficult, and has not yet produced conclusive results. Andersen et al. (2021) found a small racial bias, but the sample was too small to provide meaningful information about the amount of racial bias in the real world. Most important, however, real-word scenarios provide ample information to see whether a suspect is Black or White and is armed or not. The real decision is often whether use of force is warranted or not. Racial biases in these shooting errors are important, but they are not unconscious biases.
Contrary to Gawronski et al., I do not believe that social cognition researchers focus on automatic biases rather than unconscious biases was a mistake. The real mistake was the focus on reaction times in artificial computer tasks rather than studying racial biases in the real world. As a result, thirty years of research on automatic biases has produced little insights into racial biases in the real world. To move the field towards the study of unconscious biases would be a mistake. Instead, social cognition researchers need to focus on outcome variables that matter.
The term implicit bias can have different meanings. Gawronski et al. (2022a) proposed to limit the meaning of the term to unconscious bias. I argue that this definition of implicit bias is not useful because most studies of implicit cognition are studies in which racial stereotypes and attitudes toward stigmatized groups are automatically activated. In contrast, priming studies that tried to distinguish between conscious and unconscious activation of this information have been discredited during the replication crisis and there exists no credible empirical evidence to suggest that unconscious biases exist or contribute to real-world behavior. Thus, funding a new research agenda focusing on unconscious biases may waste resources that are better spent on real-world studies of racial biases. Evidently, this conclusion diverges from the conclusion of implicit cognition researchers who are interested in continuing their laboratory studies, but they have failed to demonstrate that their work makes a meaningful contribution to society. To make research on automatic biases more meaningful, implicit bias research needs to move from artificial outcomes like reaction times on computer tasks to actual behaviors.
Implicit Cognition Research Focusses on Automatic (Not Unconscious) Processes
Gawronski & Bodenhausen (2006), WOS/11/22 1,537
“If eras of psychological research can be characterized in terms of general ideas, a major theme of the current era is probably the notion of automaticity” (p. 692)
This perspective is also dominant in contemporary research on attitudes, in which deliberate, “explicit” attitudes are often contrasted with automatic, “implicit” attitudes (Greenwald & Banaji, 1995; Petty, Fazio, & Brin˜ol, in press; Wilson, Lindsey, & Schooler, 2000; Wittenbrink & Schwarz, in press).
“We assume that people generally do have some degree of conscious access to their automatic affective reactions and that they tend to rely on these affective reactions in making evaluative judgments (Gawronski, Hofmann, & Wilbur, in press; Schimmack & Crites, 2005) (p. 696).
“The distinction between automatic and controlled processes now occupies a central role in many areas of social psychology and is reflected in contemporary dual-process theories of prejudice and stereotyping (e.g., Devine, 1989)” (p. 469)
“Specifically, we argued that performance on implicit measures is influenced by at least four different processes: the automatic activation of an association (association activation), the ability to determine a correct response (discriminability), the success at overcoming automatically activated associations (overcoming bias), and the influence of response biases that may influence responses in the absence of other available guides to response (guessing)” (p. 482)
Gawronski & DeHouwer (2014), WOS 11/22 240
” other researchers assume that the two kinds of 11lL’asurcs tap into distinct memory representations, such that explicit measures tap into conscious representations whereas implicit measures tap into unconscious representations (e.g., Greenwald & Banaji, 1995). Although the conceptualizations arc relatively common in the literature on implicit measures, we believe that it is concecptually more appropriate to classify different measures in terms of whether the tobe-measured psychological attribute influences participants’ responses on the task in an automatic fashion (De Houwer, Teige-Mocigemba, Spruyt, & Moors, 2009).” (p. 283)
Hofmann, Gawronski, Le, & Schmitt, PSPB, 2005, WoS/11/22
“These [implicit] measures—most of them based on reaction times in response compatibility tasks (cf. De Houwer, 2003)—are intended to assess relatively automatic mental associations that are difficult to gauge with explicit self-report measures”. (p. 1369)
“A common explanation for these findings is that the spontaneous behavior assessed in these studies is difficult to control, and thus more likely to be influenced by automatic evaluations, such as they are reflected in indirect attitude measures” (p. 492)
“there is no empirical evidence that people lack conscious awareness of indirectly assessed attitudes per se” (p. 496)
“Phenomena such as stereotype and attitude activation can be readily reconstructed as instance-based automaticity. For example, perceiving a person of a stereotyped group or an attitude object may be sufficient to activate well-practiced stereotypic or evaluative associations in memory” (p. 386)
Implicit measures are important even if they do not assess unconscious processes.
Hofmann, Gawronski, Le, & Schmitt, PSPB, 2005, WoS/11/22
” Arguably one of the most important contributions in social cognition research within the last decade was the development of implicit measures of attitudes, stereotypes, self-concept, and self-esteem (e.g., Fazio, Jackson, Dunton, & Williams, 1995; Greenwald, McGhee, & Schwartz, 1998; Nosek & Banaji, 2001; Wittenbrink, Judd, & Park, 1997).” (p. 1369)
Gawronski & DeHouwer (2014), WOS 11/22 240
“For the decade to come, we believe that the field would benefit from a stronger focus on underlying mechanisms with regard to the measures themselves as well as their capability to predict behavior (see also Nosek, Hawkins, & Frazier, 2011).” (p. 303)
Post-war American Psychology is rooted in behaviorism. The key assumption of behaviorism is that psychology (i.e., the science of the mind) should only study phenomena that are directly observable. As a result, the science of the mind became the science of behavior. While behaviorism is long dead (see the 1990 funeral here), it’s (harmful) effect on psychology is still noticeable today. One lasting effect is psychologists aversion to make causal attributions to the mind (cognitive processes). While cognitive processes cannot be directly observed with the human senses (we cannot see, touch, smell, or hear what goes on in somebody’s mind), we can indirectly observe these processes on the basis of observable behaviors. A whole different discipline that is called psychometrics has developed elaborate theories and statistical models to relate observed behaviors to unobserved processes in the mind. Unfortunately, psychometrics is often not covered in the education of psychologists. As a result, psychologists often make simple mistakes when they apply psychometric tools to psychological questions.
In the language of psychometrics, observed behaviors are observed variables and unobserved mental processes are unobserved variables that are also often called latent (i.e., of a quality or state) existing but not yet developed or manifest; hidden or concealed) variables. The goal of psychometrics is to find systematic relationships between observed and latent variables that make it possible to study mental processes. We can compare this process to the task of early astronomers to make sense of the lights in the night sky. Bright stars are like observable indicators and the task of astronomers is to explain the behavior of these observable variables with unobserved forces. Astronomy has come a long way from seeing astrological signs in the sky, but psychology is pretty much at this early stage of science, where most of the unknown cognitive processes that cause observable behaviors are unknown. In fact, some psychologists still resist the idea that observable behavior can be explained by latent variables (Borsboom et al., 2021). Others, however, have used psychometric tools, but fail to understand the basic properties of psychometric models (e.g., Digman, 1997; DeYoung & Peterson, 2002; Musek, 2007). Here, I give a simple introduction to the basic logic of psychometric models and illustrate how applied psychologists can get lost in latent variable space.
Figure 2 shows the most basic psychometric model that relates an observed variable to an unobserved cause. I am using a widely used measure of life-satisfaction as an example. Please rate your life on a scale from 0 = worst possible life to 10 = best possible life. Thousands of studies with millions of respondents have used this question to study “the secret of happiness.” Behaviorists would treat this item as a stimulus and participants responses on the 11-point rating scales as behaviors. One problem for behaviorists is that participants will respond differently to the same question. Responses vary from 0 (very rarely) all the way to 10 (more often, but still rare). The modal response in affluent Western countries is 7. Behaviorism has no answer to the question why participants respond differently to the same situation (i.e., question). Some researchers have tried to provide a behavioristic answers by demonstrating that responses can be manipulated (e.g., responses are different in a good or bad mood; Schwarz & Strack, 1999; Kahneman, 2011). However, these effects are small and do not explain why responses are highly stable over time and across different situations (Schimmack & Oishi, 2005). To explain why some people report higher levels of life-satisfaction than others, we have to invoke unobserved causes within respondents’ minds. Just like forces that creates the universe, these causes are not directly observable, but we know they must exist because we observe variation in responses that cannot be explained by variation in the situation (i.e., same situation and different behaviors imply internal causes).
Psychologists have tried to understand the mental processes that produce variation in Cantril ladder scores for nearly 100 years (Andrews & Whitey, 1976; Cantril, 1965; Diener, 1984; Hartmann, 1936). In the 1980s, focus shifted from thoughts about one’s life (e.g., I hate my work, I love my spouse, etc.) to the influence of personality traits (Costa & McCrae, 1980). Just like life-satisfaction, personality is a latent variable that can only be measured indirectly by observing differences in behaviors in the same situation. The most widely used observed variables to do so are self-ratings of personality.
The key problem for the measurement of unobserved mental processes is that variation in observed scores can be caused by many different mental processes. To go beyond the level of observed variation in behaviors, it is necessary to separate the different causes that contribute to the variance in observed scores. The first step is to separate causes that produce measurement error. The most widely used approach to do so is to ask the same or similar questions repeatedly and to consider variability in responses as measurement error. The next figure shows a model for responses to two similar items.
When two or more observed variables are available, it is possible to examine the correlation between two variables. if two observed variables share a common cause, they are going to be correlated. The strength of the correlation depends on the relative strength of the shared mental process and the unique mental processes. Psychometrics works in reverse and makes inferences about the unobserved causes by examining the observed correlations. To do so, it is necessary to make some assumptions, and this is where things can go wrong, when researchers do not understand these assumptions.
A common assumption is that the shared causal processes are important and meaningful, whereas the unique mental processes are unimportant, irrelevant, and error variance. Based on this assumption, the model is often drawn differently. Sometimes, the shared unobserved variable is drawn on top, and the unshared unobserved variables are drawn at the bottom (top = important, bottom = unimportant).
Sometimes, the unique mental processes are drawn smaller and without a name.
And sometimes, they are simply omitted because they are considered unimportant and irrelevant.
The omission of the unshared causes makes sense when psychometricians communicate with each other because they are trained in understanding psychometric models and use figures merely as a short-hand to communicate with each other. However, when psychometricians communicate with psychologists things can go horribly wrong because psychologists may not realize that the omission of residuals is based on assumptions that can be right or wrong. They may simply assume that the unique variances are never important and can always be omitted. However, this is a big mistake with undesirable consequences. To demonstrate this, I am always going to show the unique causes of all variables in the following models.
When psychologists ask similar questions repeatedly, they are assuming that the unique causes of the responses are measurement error. In the present example, individuals may interpret the words “worry” and “nervous” somewhat differently and this may elicit different mental processes that result in slightly different responses. However, the two terms are sufficiently similar that they also elicit similar cognitive processes that produce a correlation between responses to the two items. Under this assumption, the common causes reflect the causes that are of interest and the unique causes produce error variance. Under the assumption that unique causes produce error variance, it is possible to average responses to similar items. These averages are called scales. Averaging amplifies the variance that is produced by shared causes.
This is illustrated in the next figure where the average is fully determined by the two observed variables “I often worry” and “I am often nervous.” To make this a measurement model, we have to relate the average scores to the unobserved variables. Now we see that the shared mental process variable has two ways to influence the average scores, whereas each of the unique causes has only one way to contribute to the average. As the number of variables increases the ratio (2:1) becomes even bigger for the shared variable (3 variables, 3:1). This implies that the shared mental processes more and more determine the average scores. This is the only part of measurement theory that psychologists are taught and understand as reflected in the common practice to report Cronbach’s alpha (a measure of the shared variance in the average scored) as evidence that a measure is a good measure (Flake & Fried, 2020). However, the real measurement problems are not addressed by averaging across similarly-worded items. This is revealed in the next figure.
To use the average of responses to similarly worded items as an observed measure of an unobserved personality trait, we have to assume that the shared mental processes that produce most of the variance in the average scores are caused by the personality trait that we are trying to measure. In the present example, personality psychologists use items like “worry” and “nervous” to measure a trait called Neuroticism. Despite 100 years of research, it is still not clear what Neuroticism is and some psychologists still doubt that Neuroticism even exists. Those who do believe in Neuroticism assume it reflects a general disposition to have more negative thoughts (e.g., low self-esteem, pessimism) and feelings (anxiety, anger, sadness, guilt). The main problem in current personality research is that item-averages are often treated as if they are perfect observed indicators of an unobserved personality trait (see next figure).
Ample research suggests that average scores of neuroticism items are also influenced by other factors such as socially desirable responding. Thus, it is a simplification to assume that item-averages are identical or isomorphous to the personality trait that they are designed to measure. Nevertheless, it is common for personality psychologists to study the influence of unobserved causes like Neuroticism by means of item averages. As we see later, even when psychologists use latent variable models, Neuroticism is just a label for an item-average. The problem with this practice is that it gives the illusion that we can study the causal effects of unobservable personality traits by examining the correlations of observable item-averages.
In this way, measurement problems are treated as unimportant, just like behaviorists considered mental processes as unimportant and relegated them to a black box that should not be examined. The same attitude prevails today with regards to personality measurement, when boxes (observed variables) are given names without checking that the labels actually match the content of the box (i.e., the unobserved causes that a measure is supposed to reflect). Often psychological constructs are merely labels for item-averages. Accordingly, neuroticism is ‘operationalized’ with an item-average and neuroticism can be defined as “whatever a neuroticism scale measures.”
When Things Go from Bad to Worse
In the 1980s, personality psychologists came to a broad consensus that the diversity of human traits (e.g., anxious, bold, curious, determined, energetic, frank, gentle, helpful, etc.) can be organized into a taxonomy with five broad traits, known as the Big Five. The basic idea is illustrate in the next Figure with Neuroticism. According to Big Five theory, Neuroticism is a general disposition to experience more anxiety, anger, and sadness. However, each emotion also has its one dispositions. Thus, variation in scales that measure anxiety, anger, and sadness is influenced by both Neuroticism (i.e., the general disposition) and specific causes. In addition, scales can also be influenced by general and specific measurement errors. The figure makes it clear that the scores in the item-averages can reflect many different causes aside from the intended broader personality trait called Neuroticism. This makes it risky to rely on these item averages to draw inferences about the unobserved variable Neuroticism.
A true science of personality would try to separate these different causes and to examine how they relate to other variables. However, personality psychologists often hide the complexity of personality measurement by treating personality scales as if they directly reflect a single cause). While this is bad enough, things get even worse when personality psychologists speculate about even broader personality traits.
The General Personality Factor (Musek, 2007)
The Big Five were considered to be roughly independent from each other. In fact, they were found with a method that looked for independent factors (another name for unobserved variables) more commonly used in personality research. However, when Digman (1997) examined correlations among item-averages, he found some systematic patterns in these correlations. This led him to postulate even broader factors than the Big Five that might explain these patterns. The problem with these theories is that they are no longer trying to relate observed variables to unobserved variables. Rather, Digman started to speculate about causal relationships among unobserved variables on the basis of imperfect indicators of the Big Five.
The first problem with Digman’s attempt to explain correlations among unobserved variables was that he lacked expertise in the use of psychometric models. As a result, he made some mistakes and his results could not be replicated (Anusic et al., 2009). A few years later, a study that controlled for some of the measurement problems by using self-ratings and informant ratings suggested that the Big Five are more or less independent and that correlations reflect measurement error (Biesanz & West, 2004; see also Anusic et al., 2009). However, other studies suggested that higher-order factors exists and may have powerful effects on people’s lives, including their well-being. Subsequently, I am going to show that these claims are based on a simple misunderstanding of measurement models that treat unique variance in the Big Five scales as error variance.
Musek (2007) proposed that correlations among Big Five scales can be explained with a single higher-order factor. This model is illustrated in his Figure 1.
First, it is notable that the unique mental processes that contribute to each of the Big Five scales are called e1 to e5 and the legend of the figure explains that e stands for error variances. This terminology can be justified if we treat Big Five scales only as observed variables that help us to observe the unobserved variable GFP. As GFP is not directly observable, we have to infer its presence from the correlations among the observed variables, namely the Big Five scales. However, labeling the unique causes that produce variation in Neuroticism scores error variance is dangerous because we may think that the unique variance in Neuroticism is no longer important; just error. Of course, this variance is not error variance in some absolute sense. After all, Neuroticism scales exists only because personality psychologists assume that Neuroticism is a real personality trait that is related to even more specific traits like anxiety, anger, and sadness. Thus, all of the variance in a neuroticism scale is assumed to be important and it would be wrong to assume that only the variance shared with other Big Five scales is important. To avoid this misinterpretation, it would be better to keep the unique causes in the model.
Another problem of this model is that the model itself provides no information about the actual causes of the correlations among the Big Five scales. This is different when items are written for the explicit purpose of measuring something that they have in common. In contrast, the correlations among the Big Five traits are an empirical phenomenon that requires further investigation to understand the nature of the causal processes that produce correlations. In other words, GFP is just a name for “shared cognitive processes;” it does not tell us what these shared cognitive processes are. To examine this question, it is necessary to see how the GFP is related to other things. This is where things go horribly wrong. Rather than relating the unobserved variable in Figure 1 to other measures, Musek (2007) averages all Big Five items to create an item average that is supposed to represent the unobserved variable. He then uses correlations of the GFP scale to make inferences about the GFP factor. The problems of this approach are illustrated with the next figure.
The figure illustrates that the general personality scale is not a good indicator of the general personality factor. The main problem is that the scale scores are also influenced by the unique causes that contribute to variation in the Big Five scales (on top of measurement error that is not shown in the picture to avoid clutter, but should not be forgotten). The problem is hidden when the unique causes are represented as errors, but unique variance in neuroticism is not error variance. It reflects a disposition to have more negative thoughts and this disposition could have a negative influence on life-satisfaction. This contribution of unique causes is hidden when Big Fife scale scores are averaged and labeled General Personality.
Musek (2007) reports a correlation of r = .5 (Study 1) between the general personality scale and a life-satisfaction scale. Musek claims that this high correlation must reveal a true relationship between the general factor of personality and life-satisfaction and cannot reflect a method artifact like social desirable responding. It is unclear why Musek (2007) relied on an average of Big Five scale scores to examine the relationship of the general factor with life-satisfaction. Latent variable modeling makes it possible to examine the relationship of the general factor directly without the need for scale scores. Fortunately, it is possible to conduct this analysis post-hoc based on the reported correlations in Table 1.
The first model created a general personality scale and used the scale as a predictor of life-satisfaction. The only difference to a simple correlation is that the model also includes the implied measurement model. This makes the model testable because it imposes restrictions on the correlations of the Big Five scales with the life-satisfaction scale. The fit of the model was acceptable, but not great, suggesting that alternative models might produce even better fit, RMSEA = .078, CFI = .958.
In this model, it is possible to trace the paths from the unobserved variables to life-satisfaction. The strongest relationship was the path from the general personality factor (h) to life-satisfaction, b = .42, se = .04, but the model also implied that unique variances of the Big Five scales contribute to life-satisfaction. These effects are hidden when the general personality scale is interpreted as if it is a pure measure of the general personality factor.
A direct test of the assumption that the general factor is the only predictor of life-satisfaction requires a simple modification of the model that links life-satisfaction directly to the general factor (h). This model actually fits the data better, RMSEA = .048, CFI = .984. This might suggest that the unique causes of variation in the Big Five are unrelated to life-satisfaction.
However, good fit is not sufficient to accept a model. It is also important to rule out plausible alternative models. An alternative model assumes that the Big Five factors are necessary and sufficient to explain variation in life-satisfaction. There is no reason to create a general scale and use it as a predictor. Instead, life-satisfaction can simply be regressed onto the Big Five scales as indicator of the Big Five factors. In fact, it is always possible to get good fit for a model that uses indicators as predictors of outcomes because the model does not impose any restrictions (i.e., the model is just identified). The only reason why this model fits worse than the other model is that fit indices like RMSEA and CFI reward parsimony and this model uses 5 predictors of life-satisfaction whereas the previous model had only one predictor. However, parsimony cannot be used to falsify a model.
In fact, it is possible to find an even better fitting model because only two of the five Big Five scales were significant predictors of life-satisfaction. This finding is consistent with many previous studies that these two Big Five traits are the strongest predictors of life-satisfaction. If the model is limited to these two predictors, it fits the data better than the model with a direct path from the general factor, CFI = .987, RMSEA = .045. Musek (2007) was unable to realize that the unique variances in neuroticism and extraversion make a unique contribution to life-satisfaction because the general personality scale does not separate shared and unique causes of variation in the Big Five scales.
The Correlated Big Two
In contrast to Musek (2007), DeYoung and Peterson favor a model with two correlated higher-order factors (DeYoung, Peterson, & Higgins, 2002; see Schimmack, 2022, for a detailed discussion).
As Musek (2007) they treat the unique causes of variation in Big Five traits as error (e1-e5) and assume that relationships of the higher-order factors with criterion variables are direct rather than being mediated by the Big Five factors. Here, I fitted this model to Musek’s (2007) data. Fit was excellent, CFI = .996, RMSEA = .030.
Based on this model, life-satisfaction would be mostly predicted by stability rather than neuroticism and extraversion or a general factor. However, just because this model has excellent fit doesn’t mean it is the best model. The model simply masks the presence of a general factor by modeling the shared variance between Plasticity and Stability as a correlated residual. It is also possible to model it with a general factor. In this model, Stability and Plasticity would be an additional level in a hierarchy between the Big Five and the General Factor. This model does not impose any additional restrictions and fits the data as well as the previous model, CFI = .996, RMSEA = .030. Thus, even though Stability and Plasticity can be identified, it does not mean that this distinction is important for the prediction of life-satisfaction. The general factor could still be the key predictor of life-satisfaction.
However, both models make the assumption that the unique causes of variation in Big Five scales are unrelated to life-satisfaction, and we already saw that this assumption is false. As a result, the model that relates life-satisfaction to neuroticism and extraversion fits the data, CFI = .994, RMSEA = .035, and the paths from extraversion and neuroticism to life-satisfaction were significant.
Musek (2007) and DeYoung et al. (2006) ignored the possibility that unique causes of variation in the Big Five contribute to the prediction of other variables because they made the mistake to equate unique variances with error variances. This interpretation is based on the basic examples that are used to illustrate latent variable models for beginners. However, the interpretation of all aspects of a latent variable model, including the residual or unique variances has to be guided by theory. To avoid these mistakes, psychometricians need to stop presenting their models as if they can be used without substantive theory and substantive researchers need to get better training in the use of psychometric tools.
Compared to other sciences like physics, astronomy, chemistry, or biology, psychology has made little progress over the past 40 years. While there are many reasons for this lack of progress, one problem is the legacy of behaviorism to focus on observable behaviors and to rely on experimentation as the only scientific approach to test causal theories. Another problem is an ideological bias against personality as a causal force that produces variation between individuals (Mischel, 1968). To make progress, personality science has to adopt a new scientific approach that uses observed behaviors to test causal theories of unobservable forces like personality. While personality scales can be used to predict behaviors and life-outcomes, they cannot explain behaviors and life-outcomes. Latent variable modeling provides a powerful tool to test causal theories. The biggest advantage of latent variable modeling is that model fit can be used to reject models. A cynic might think that this is the main reason why they are ot used more by psychologists because it is more fun to build a theory and confirm it rather than to find out that it was false, but fun doesn’t equal scientific progress.
P.S. What about Network Models?
Of course, it is also possible to reject the idea of unobserved variables altogether and draw pictures of the correlations (or partial correlations) among all the observed variables. The advantage of this approach is that it always produces results that can be used to tell an interesting story about the data. The disadvantage is that it always produces a result and therefore doesn’t test any theory. Thus, these models cannot be used to advance personality psychology towards a science that progresses by testing and rejecting false theories.
Awards, Ivy League universities, or prestigious journals are suboptimal heuristics to evaluate people’s work, but in a world of information overflow, they influence the popularity of ideas. Therefore, I am caching in on Jason Geller’s invitation to present z-curve in the Advanced Research Methods seminar at Princeton.
The talk was recorded and Jason and Princeton University generously shared the recording with me (Video). The talk builds on previous talks, but incorporates the latest z-curve findings that demonstrate the power of z-curve to predict replication failures and to justify the use of alpha = .005 as a reasonable criterion for significance tests to keep the risk of false positive results in psychological journals at a reasonably low level.
You can find many other z-curve related articles and studies on my blog. Here I want to mention only the two peer-reviewed articles that introduced the method and provide more detailed information about the method.
Recently, a team of German sociologists combined data about racial biases in police stops in the United States (Stanford Open Policing Project ; Pierson et al., 2020) and data about county-level average levels of racial biases collected by Project Implicit (Xu et al., 2022). The key finding was that various measures of racial bias were correlated with racial bias in traffic stops by police (published in the Supplement Table 2).
The authors missed an opportunity to examine the validity of different measures of racial attitudes under the assumption that all measures, implicit and explicit, reflect a common attitude rather than distinct attitudes (Schimmack, 2021). If implicit measures tapped some distinct form of unconscious bias, they should show incremental predictive validity. To examine this question, I used the correlations in Table 2 and fitted a structural equation model to the data. I found that a model with a single racial bias factor fitted the data reasonably well, chi2 (df = 9, N ~ 300) = 34.52, CFI = .975, RMSEA = .097. The effect size of b = .369 for bias implies that for every increase in bias by one standard deviation, there is a .369 increase in racial bias in traffic stops. This is considered a moderate effect size in comparison to other effect sizes in the social sciences.
,The more interesting result is that the race IAT and simple self-report measures of racial bias are equally valid measures of counties’ average level of racial bias. The effect sizes are .797 for the feeling thermometer, .784 for a simple preference rating, and .834 for the race Implicit Association Test; a computerized task that is less susceptible to socially desirable responding. The high validity coefficients of these measures can be explained by the aggregation of individuals’ scores. Aggregation reduces random measurement error as well as systematic biases that are unique to individuals. Thus, the present results show that race IAT scores are valid measures of racial biases at the aggregated level. The results also show that self-ratings provide as much valid information. This undermines claims by Greenwald, who developed the IAT, that the race IAT is a more valid measure of racial biases than self-ratings (see also Schimmack, 2021, for studies at the individual level).
The figure also shows an additional relationship between the race IAT and the weapons IAT. This relationship reveals that IAT tasks reflect some information that is not captured by self-reports. However, it is not clear whether this variance is method variance or valid variance of unconscious bias. In the latter case, the unique variance in the race IAT could predict police stops in addition to the bias factor (incremental predictive validity).
Adding this path did not improve model fit and the effect size estimate was not significantly different from zero, b = -.045, 95%CI = -.305 to .214. These results are consistent with many other results that the incremental predictive validity of the race IAT is elusive and even if it is not zero, it is likely to be negligible (Kurdi et al., 2019).
In short, the article could have made a nice contribution to the literature by demonstrating that implicit and explicit measures of racial bias show high convergent validity when they are aggregated to measure racial bias of US counties, and by demonstrating that racial bias predicts an important behavior, namely police officers’ decision to conduct a traffic stop.
However, the discussion of the results in the article is problematic and may reveal a sociological bias or the lack of lived experience of German researchers. The authors interpret the results as evidence that situational factors explain the results.
“The observed relationships between regional-level bias and police traffic stops underscore the role of the context in which police officers operate. Our findings are consistent with theorizing by Payne et al. (2017), who argued that some contexts expose individuals more regularly to stereotypes and/or prejudice, increasing mental accessibility of biased thoughts and feelings, in turn influencing individual behavior. Consequently, behavioral expressions of prejudice and stereotypes often reflect properties of contexts rather than stable dispositions of people (but see Connor & Evers, 2020).”
The plausible alternative explanation is relegated to a “but see.” As a German who has lived in the United States and is constantly exposed to US media while living in Canada, I think the “but” deserves more attention and is actually a more plausible explanation of these findings. After all, police officers are not Robo-Cops or United Nations soldiers. They are typically born and raised in the county or in close proximity they are working in (Flint Town). As a result, their own racial biases are likely to be similar to the racial biases measured in the Project Implicit data (see Andersen et al., 2021, for race IAT scores of police officers). Thus, it is entirely possible that racial biases of police officers, rather than some mysterious unidentified social context, contribute to the racial biases in police stops. This does not mean that social factors are not at play. The fact that racial bias is not some involuntary, unconscious bias means that better training and incentives can be used to reduce bias in police officers’ behaviors without changing their attitudes and feelings. Traffic stops are clearly deliberate actions that are not made in a split second. Thus, officers can be trained in avoiding biases in their actions without the need to change their implicit or explicit attitudes. Although attitude change would be desirable, it is difficult and will take time. For now, Black citizens are likely to settle for equal treatment rather than waiting for changes in implicit attitudes that are difficult to measure and have no known effects on behavior.
In conclusion, it is well known that racism is a problem among US police officers. Often these officers are known and remain on the force. This study shows that these racial attitudes have clear consequences that sometimes lead to the death of innocent Black civilians. To attribute these incidences to some abstract contextual factors ignores the lived experiences of thousands of African Americans. The data are fully consistent with the common assumption of African Americans that racists cops are more likely to pull them over. The present study showed that this fear is more justified in counties with higher levels of racism.
Lew Goldberg has made important contributions to personality psychology. He contributed to the development of the Big Five model that is currently the most widely accepted model of the higher-order factors of personality that describe the relationship among the basic trait words used in everyday language.
He also pioneered open science when he made a large pool of personality items available to all researchers and created open and free measures that mimic proprietary measures like Costa and McCrae’s NEO scales. Because these measures were designed to measure the original scales as closely as possible, the validity of the scales is defined in terms of correlations with the existing scales. The goal of the IPIP project was not to examine validity or to improve on existing measures. As Lew pointed out in a personal correspondence to me, users of IPIP measures could have created new measures based on the initial 300 items. The fact that users of these items have failed to do so shows a lack of interested in construct validation. Thanks to Lew Goldberg, we have open items and open data to develop better measures of personality.
The extended 300-item IPIP measure has been used to provide thousands of internet users free feedback about their personality, and Johnson made his data from these surveys openly available (OSF-data).
The present critical examination of the psychometric properties of the IPIP scales would not be possible without these contributions. My main criticism of personality measurement is that personality psychologists have not used the statistical tools that are needed to validate a personality measure. A common and false belief among personality psychologists is that these tools are not suitable for personality measures. A misleading article by McCrae, Costa and colleagues in the esteemed Journal of Personality and Social Psychology did not help. The authors were unable to fit a Big Five model to their data. Rather than questioning the model, they decided that the method is wrong because “we know that the Big Five model is right”. This absurd conclusion has been ridiculed by psychometricians (Borsboom, 2006), but led only to a defensive response by personality psychologists (Clark, 2006). For the most part, personality psychologists today continue to create scales or use scales that lack proper validation. The IPIP-300 is no exception. This blog post is just illustrates with a simple example how bad measurement can derail science.
The IPIP-300 aims to measure 30 personality traits that are called facets. Facets are constructs that are more specific than the Big Five and closer to everyday trait concepts. Each facet is measured with 10 items. The 10 items are summed or averaged to give individuals a score for one of the 30 facets. Each facet has a name. There are two ways to interpret these names. One interpretation is that the name is just a short-hand for a scientific construct. For example, the term Depression is just a name for the sum-scores of 10-items from the IPIP. To know what this sum score actually measures, one might need to examine the item content, learn about the correlations of this sum-score with other sum-scores, or understand the scientific theory that let to the creation of the 10-items. Accordingly, the Depression scale measures whatever it is supposed to measure and what this is is called Depression. In this case, we could change the name of the scale without changing anything in our understanding of the scale. We could call it the D-scale or just facet number 3 of Neuroticism. Depression is just a name. The alternative view assumes that the 10-items were selected to measure a construct that is at least somewhat related to what we mean by depression in our everyday language. For example, we would be surprised to see the item “I like ginger” or “I often break the rules” in a list of items that are supposed to measure depression. The use of everyday trait worlds as labels for scales does usually imply that researchers are aiming to measure a construct that is at least similar to the everyday meaning of the label. Unfortunately, this is often not the case and interpreting scales based on their labels can lead to misunderstandings.
To illustrate the problem of misch-mesch-urement, I am using two facets scales from the IPIP-300 that are labeled Depression and Modesty. I used the first 10,000 observations in Johnson’s dataset and selected only US respondents with complete data (N = 6,786). The correlation between Depression and Modesty was r = .35, SE = .01. I replicated this finding with the next 10,000 observations, again selecting only US respondents with complete data (N = 5,864), r = .39, SE = .01. The results clearly show a moderate positive relationship between the two scale scores. A correlation of r = .35 implies that a respondent who is above average in Depression has about a 67.5% probability to be also above average in Modesty. We could now start speculating about the causal mechanism that produces this correlation. Maybe bragging (not being modest) reduces the risk of depression. Maybe being depressed lowers the probability of bragging. Maybe it is both and maybe there are third variables at play. However, before we even start down this path, we have to consider the possibility that the sum score labels are misleading and we are not even seeing the correlation between the constructs that we have in mind when we talk about depression and modesty. This question is examined by fitting a measurement model to the items that were used to create the sum scores.
Of course, the two scales were chosen because a simple measurement model does not fit the data. This is shown with a simplified figure of a measurement model that assumes the 10 items of a scale all reflect a common construct and some random measurement error. The items are summed to reduce the random measurement error so that the sum score mostly reflects the common construct. The main finding is that this simple model does not meet standard criteria of acceptable fit such as a Comparative Fit Index (CFI) greater than .95 or a Root Mean Square Error of Approximation (RMSEA) below .06. Another finding is that the correlation between the factors (i.e., unobserved variables that are assumed to cause the shared variance among items) is even stronger, r = .69, than the correlation among the scales. This would be interpreted as evidence that measurement error reduces the correlation with scales and the correlation among the factors shows the true correlation. However, the model does not fit and the correlation should not be interpreted.
Inspection of the items suggests some reasons why the simple model may not fit and why the positive correlation is at least inflated, if not totally an artifact. For example, the item “Have a low opinion of myself” is used as an item to measure Depression, while the item “Have a high opinion of myself ” is reversed and used to measure Modesty (reverse scoring means that low ratings on this item are scored as high modesty). Just looking at the items, we might suspect that they are both measures of low and high self-esteem, respectively. While it is plausible that Depression and Modesty are linked to low self-esteem, but it is a problem to use self-esteem items to measure both. This will produce an artificial positive correlation between the scales and lead to the false impression that Depression and Modesty are positively correlated when they are actually unrelated or even negatively related. This is what I call the misch-masch problem of personality measurement. Scales are foremost averages of items and it is not clear what these scales measure if the scales are not properly evaluated with a measurement model.
As items are closer to the level of everyday conversations about personality, it is not difficult to notice other similarities between items. For example, “often feel blue” and “rarely feel blue” are simply oppositely worded questions about the same feeling. These items should correlate more strongly (negatively) with each other than the item “rarely feel blue” and “feel comfortable with myself”. However, our interpretation of items may differ from the interpretation of the average survey respondent. Thus, we need to examine empirically the pattern of correlations. One reason why personality researchers do not do this is another confusion caused by a bad label. The best statistical tool to explore the pattern of correlations among items with called Confirmatory Factor Analysis. The label “Confirmatory” has led to the false impression that this method can only be used to confirm a theoretical model. But when a model like the simple model in Figure 1 does not fit, we do not have a theory to suggest a more complex model. We could of course explore the data, but the term confirmatory implies that this would be wrong or an abuse of a method that should not be used for exploration. This is pure nonsense. We can use CFA to explore the data, find a plausible model that fits the data, and then confirm this model with a new dataset. We can then also use this model to make new predictions, test them, and if the predictions fail, further revise the model. This is called science and fully in line with Cronbach and Meehl’s (1955) approach to construct validation. Why do I make such a big deal about this? Because my suggestion to use CFA to explore personality data has been met with a lot of resistance by veteran personality psychologists.
In response to a related blog post, William Revelle wrote me an email.
Uli, Inspired by your blog on how one needs to use CFA to do hierarchical models (which is in fact, incorrect), I prepared the enclosed slides. I try to point out that EFA approaches can a) give goodness of fit tests and b) do hierarchical models. In a previous post you suggested that those of us in personality should know some psychometrics and not use simple sum scores. I think you are correct with respect to the first part of your argument, but you might find my paper with Keith Widaman a useful response suggesting that sum scores are not as bad as you think. Your comment about some people (i.e., our Dutch friend) refusing to understand the silliness of a general factor of personality was most accurate. Bill
Bill is right that EFA can sometimes produce the right results, but this is not a good argument to use an inferior method. The key problem of EFA is that it does not require any theory and as a result also does not test a theory. If a model does not fit, researchers cannot change the model because the model is based on a stringent set of mathematical principles that are not based on any substantive theory. In contrast, CFA requires that researchers think about their data and why the model does not fit.
In response to my CFA analysis of Costa and McCrae’s NEO-PI-R, Robert McCrae wrote this response:
Uli I just read your blog on “what lurks beneath”. I must say that I find the blog format disconcerting, both for its informality and its lack of editing and references. But here are a few responses. 1. We certainly agree that people ought to measure facets as well as domains; that personality is not simple structured; that there is some degree of evaluative bias in any single source of data. 2. What we argued in the 1996 paper was that CFA “as it has typically been applied in investigating personality structure, is systematically flawed” (p. 552, italics added). I should think you would agree with that position; you have criticized others for failing to acknowledge secondary loadings and evaluative biases in their CFAs. 3. Why in the world do you think that “CFA is the only method that can be used to test structural theories”? If that were true, I would agree with your position. But the major point of our paper was to offer an alternative confirmatory approach using targeted rotation. There are a number of instances where this method has led to falsification of hypotheses—John’s study of personality in dogs and cats showed that the FFM doesn’t fit even after targeted rotation. 4. I would have liked a comparison with Marsh’s ESEM, which was developed in part in response to our 1996 paper. 5.”The evaluation of model fit was still evolving”. That, I would say, is an understatement. In my experience, most fit indices in SEM and other statistical approaches are essentially as arbitrary as p < .05. There are virtually no empirical tests of the utility of fit indices. And most are treated as dichotomies: A model fits or not. That is like deciding that coefficient alpha should be .70, and throwing out a scale because its alpha is only .69. I recall a paper on national levels of traits in which the authors were told by reviewers not to report the observed means because they could not demonstrate measurement invariance. This is statistically-mandated data suppression. 6. I am not quite convinced by your analysis of evaluative bias in the NEO data. It is really difficult to separate substance from style in mono-method data. One could argue that the factor you call EVB is really N, and vice-versa. I have attached a chapter in which we reported joint factor analyses of self-reports and observer ratings and included bias factors (pp. 280-283). –Jeff
I was fortunate to take a CFA (SEM) course offered by Ralf Schwarzer at the Free University Berlin in the early 1990s. I have been using LISREL, EQS, and now MPLUS for 30 years. I thought, the older professors were just too old to learn this method, and that the attitudes would change. However, in 2006 Borsboom wrote his attack on bad practices in personality research and measurement is still considered a secondary topic in graduate education. This attitude towards measurement has been called a measurement-schmeasurement attitude (Flake & Fried, 2020). It is time to end this embarrassing status quo and to take measurement seriously.
After exploring the data and trying many different models, I settled on a model that fits the data. I then cross-validated this model in the second dataset. However, given the large sample sizes, the structure is very robust and the model had nearly identical fit in the second dataset. The model fit of the cross-validated model also met standard fit criteria, CFI = .983, RMSEA = .035. This does not mean that it is the best model. As the data are open, other researchers could try to find better models. Importantly, minor differences between models are not important, as long as the main results are consistent. The model also does not automatically tell us what the 10-item scales measure. This question can only be answered with additional data that relate the factors in the model to other variables. However, we can at least see how items are related to the factors that the scales aimed to measure.
Figure 2 shows that it is possible to describe the correlations among items from the same scale with three factors that are simply labeled Dep1, Dep2, and Dep3 for Depression and Mod1, Mod2, and Mod3 for Modesty. Dep1 is mainly related to feeling blue and depressed. Dep2 is related to low self-esteem. Dep 3 is related to two items that might be interpreted as pessimism. Mod1 is related to low self-esteem, Mod2 is about bragging, and Mod3 is about avoiding being the center of attention. As predicted by the similar wording, two self-esteem items of Mod2 are also related to the Dep2 factor. In addition, the Modesty factor is also related to Dep2, presumably because modest participants do not rate themselves lower on self-estem items. However, there is no relationship to Dep1, the feeling blue factor. Thus, Modesty is not related to feeling depressed, as implied by the Depression label of the scale. In fact, the correlation between the Depression and Modesty factors is now close to zero. Thus, the strong correlation in the bad fitting model and the moderate correlation based on scale scores misrepresents the relationship between Depression and Modesty.
Simple models of two facets are just a building block along the way to testing more complex models of personality. I hope you realize that this is an important step before personality scales can be used for research and before people are given feedback about their personality online. You might be surprised that not all personality psychologists agree. Some personality researchers rather publish pretty pictures of the models in their heads without checking that they actually fit real data. For example, Colin DeYoung has published this picture to illustrate his ideas about the structure of personality.
This model implies that there should be a negative correlation between the Depression facet of Neuroticism and the Modesty facet of Agreeableness because Stability has a negative relationship with Neuroticism and a positive relationship with Agreeableness (minus times plus = minus). I shared my initial results that showed a positive correlation which contradicts his model (see also our published results by Anusic et al., 2009, that showed problems with the Stability factor).
His final response was:
“Uli, I think the problem is that the actual structure is too complex to make it easily represented in a single CFA model. The point of the pictures is to show only some important aspects of the actual structure. As long as one acknowledges it’s only part of the structure, I don’t see that as a problem.”
To my knowledge he has never attempted to specify his model in more detail to accommodate findings that are inconsistent with this simple model. He also does not seem very eager to explore this question using CFA.
I suppose I could try to create a more complete CFA model, starting from the 10 aspects, which would allow correlations between Enthusiasm and Compassion and between Politeness and Assertiveness, and also would include additional paths from Plasticity and Stability to certain aspects, but even then I’d be wary of claiming it was the complete structure. Whatever might be left out could still easily lead to misfit. It would take a lot of chutzpah to claim that one was confident in understanding all details of the covariance structure of personality traits.
To me this sounds like an excuse for bad fit. The picture gets it right, even if the model does not fit. This is the same argument that was ridiculed by Borsboom’s critique of Costa and McCrae. If models are immune to empirical tests, they are merely figments of researchers’ imagination. To make scientific claims one first needs to pass the first test: show that a model fits the data, and if a simple model does not fit the data, we need to reject the simple model and find a better one. As Revelle pointed out, nowadays EFA software can also show fit indices. What he doesn’t say is that the typical EFA models have bad fit and that there is not much EFA users can do when this is the case. In contrast, CFA can be used to explore the data, find plausible models with good fit, like the one in Figure 2, and then test these models with new data. Call me crazy, but I have the chutzpah and confidence that I can find a well-fitting model for the structure of personality. In fact, I have already done so (Schimmack, 2019), and now I am working on doing the same for the IPIP-300. Stay tuned for the complete results. I hope this post made it clear why it is important to examine this question even for measures that have been used for decades in hundreds of studies.
Post-Script: When a figure says less than zero words
In a further email exchange Colin DeYoung asked me to add the following clarification.
Uli, please add the following quote to your blog post. You are misrepresenting me inasmuch as you are claiming that my theoretical position requires that your model of modesty and depression should show a negative correlation between modesty and depression. This is not true. I would absolutely never predict that, and I think quoting the passage here makes it clear why that is:
“A final note on the hierarchy shown in Fig. 1: It is necessarily an oversimplification at the levels below the Big Five, because personality does not have simple structure (Costa & McCrae, 1992; Hofstee, de Raad, & Goldberg, 1992). Some facets and aspects have associations, not depicted in the figure, with factors in other domains. This is true even between some traits located under different metatraits, which could not be related if the diagram in Fig. 1 were complete. For example, Compassion is positively related to Enthusiasm, and Politeness is negatively related to Assertiveness (DeYoung et al., 2007, 2013).”
Happy to add this to the blog post, but I do have to ask. Is there any finding that you would take seriously to revise your model or is this model basically unfalsifiable?
After all, I also fitted a model without higher-order factors and aspects to the 30 facets. It would be really interesting to do a multi-method study with facet-factors as starting point, but I don’t know a study that did that or any data to do it.
Thanks, Uli. Please do add that text to your blog post as my explanation of the figure.
As that text points out, what you’re calling my “model” is in fact just a summary of various empirical results. It is not, and has never been, intended as a formal CFA model.
[Explanation: The figure uses the symbolic language of causal modeling that links factors (circles) to other factors (circles) with arrows pointing from one factor to another (implying a causal effect or at least a representation of shared variance among factors that are related to a common higher-order factor. It is not clear what this figure could tell readers unless we believe that factors are real and at some point explain a pattern of observed correlations. To say that the model is not a CFA model is to say that the model makes no empirical predictions and that factors like Stability or Plasticity only exit as constructs in Colin’s imagination. Not sure why we should print such imaginary models in a scientific article.]
Psychologists have studied dating (also sometimes called mating by evolutionary psychologists) for 100 years (more or less). We are therefore able to give young, unexperienced novices expert advice. This advice is particularly important for young men because the human mating ritual in many cultures still puts them in the position of the actor who has to initiate a complex mating ritual (Elain on Seinfeld: “We mostly play defense”). The leading experts from elite universities like Harvard are willing to share their knowledge, but these personalized courses are not yet available, and probably not free.
Fortunately, I am able to provide free expert advice with a brief instructional video that illustrates all the things you should NOT do on a first date. Just do the opposite and you will be fine. Please add further suggestions in the comment section. Advice from classy women is especially welcome.
The ideal model of science is that scientists are well-paid with job security to work collaboratively towards progress in understanding the world. In reality, scientists operate like monarchs in the old days or company CEO’s in the modern world. They try to expand their influence as much as possible. This capitalistic model of science could work if there was a market that rewards CEO’s of good companies for producing better products at a cheaper price. However, even in the real world, markets are never perfect. In science, there is no market and success is much more driven by many factors that have nothing to do with the quality of the product.
The products of empirical scientists often contain a valuable novel contribution, even if the overall product is of low quality. The reason is that empirical psychologists often collect new data. Even this contribution can be useless when the data are not trustworthy, as the replication crisis in social psychology has shown. However, often data are interesting and when shared can benefit other researchers. Scientists who work in non-empirical fields (e.g., mathematicians, philosophers, statisticians) do not have that advantage. Their products are entirely based on their cognitive abilities. Evidently, it is much easier to find some new data, then to come up with a novel idea. This creates a problem for non-empirical scientists because it is a lot harder to come up with an empire-expanding novel idea. This can be seen by the fact that the most famous philosophers are still Plato and Aristoteles and not some modern philosopher. It can also be seen by the fact that it is hard for psychometricians to compete with empirical researchers for attention and jobs. Many psychology departments have stopped hiring psychometricians because empirical researchers often add more to the university rankings. Case in point, my own university, including all three campuses, is one of the largest departments in the world and does not have a formally trained psychometrician. Thus, my criticism of psychometricians should not be seen as a personal attack. Their unhelpful behaviors can be attributed to a reward structure that rewards unhelpful behaviors, just like Skinner would have predicted on the basis of their reward schedule.
Measurement Models without Substance
A key problem for psychometricians is that they are not rewarded for helping empirical psychologists who work on substantive questions. Rather, they have to make contributions to the field of psychometrics. To have a big impact, it is therefore advantages to develop methods that can be used by many researchers who work on different research questions. This is like ready-to-wear clothing. The empirical researcher just needs to pick a model and plug the data into the model and the truth comes out at the other end. Many readers will realize that ready-to-wear clothing has its problems. Mainly, it may not fit your body. Similarly, a ready-to-use statistical model may not fit a research question, but users of statistical models who are not trained in statistics may not realize this and psychometricians have no interest in telling them that their model is not appropriate. As a result, we see many articles that uncritically use statistical models that are applied to the wrong data. To avoid this problem, psychometricians would have to work with empirical researchers like tailors who create custom -fitted clothing. This would produce high-quality work, but not the market influence and rewards that read-to-wear companies can make.
Don’t take my word for it. The most successful contemporary psychometrician said so himself.
“The founding fathers of the Psychometric Society—scholars such as Thurstone, Thorndike, Guilford, and Kelley—were substantive psychologists as much as they were psychometricians. Contemporary psychometricians do not always display a comparable interest with respect to the substantive field that lends them their credibility. It is perhaps worthwhile to emphasize that, even though psychometrics has benefited greatly from the input of mathematicians, psychometrics is not a pure mathematical discipline but an applied one. If one strips the application from an applied science one is not left with very much that is interesting; and psychometrics without the “psycho” is not, in my view, an overly exciting discipline. It is therefore essential that a psychometrician keeps up to date with the developments in one or more subdisciplines of psychology.“ (Borsboom, 2006)
Borsboom has carefully avoided his own advice and became a rock-star for his claims that the founding people of psychometrics were all delusional because they actually believed in substances that could be measured (traits) and developed methods to measure intelligence, personality, or attitudes. Borsboom declared that personality does not exist and the tools that are used to claim they exist like factor analysis are false, and the way researchers present evidence for the existence of psychological substances outlined by two more founding psychometricians (Cronbach & Meehl, 1955) was false. Few psychometricians who gave him an award realized that his Attack of the Psychometricians (Borsboom, 2006) was really an attack of one ego-maniac psychometrician on the entire field. Despite Borsboom’s fame as measured by citations, his attack is largely ignored by substantive researchers who couldn’t care less about somebody who claims their topic of study is just a figment of imagination without any understanding of the substantive area that is being attacked.
A greater problem are psycho-metricians who market statistical tools that applied researchers actually use without understanding them. And that is what this blog-post is really about. So, end of ranting and on to showing how psychometrics without substance can lead to horribly wrong results.
Michael Eid’s Truth Factor
Psychometrics is about measurement and psychological measurement is not different from measurement in other disciplines. First, researchers assume that the world we live in (reality) can be described and understood with models of the world. For example, we assume that there is something real that makes us sometimes sweat, sometimes wear just a t-shirt, and sometimes wear a thick coat. We call this something temperature. Then we set out to develop instruments to measure variation in this construct. We call these instruments thermometers. The challenging step in the development of thermometers is to demonstrate that they measure temperature and that they are good measures of temperature. This step is called validation of a measure. A valid measure measures what it is supposed to measure and nothing else. The natural sciences have made great progress by developing better and better measures of constructs we all take for granted in everyday life like temperature, length, weight, time, etc. (Cronbach & Meehl, 1955). To make progress, psychology would also need to develop better and better measures of psychological constructs such as cognitive abilities, emotions, personality traits, attitudes, and so on.
The basic statistical tool that psychometricians developed to examine validity of psychological measures is factor analysis. Although factor analysis has developed and has become increasingly easy and cheap with the advent of powerful personal computers, the basic idea of factor analysis has remained the same. Factor analysis relates observed measures to unobserved variables that are called factors and estimates the strength of the relationship between the observed variable and the unobserved variable to provide information about the variance in a measure that is explained by a factor. Variance explained by the factor is valid variance if the factor represents the construct that a researcher wanted to measure. Variance that is not explained by a factor represents measurement error. The key problem for substantive researchers is that a factor may not correspond to the construct that they were trying to measure. As a result, even if a factor explains a lot of the variance in a measure, the measure could be a poor measure of a construct. As a result, the key problem for validation research is to justify the claim that a factor measures what it is assumed to measure.
Welcome to Michael Eid’s genius short-cut to the most fundamental challenge in psychometrics. Rather than conducting substantive research to justify the interpretation of a factor, researchers simply declare one measure as a valid measure of a construct. You may thin, surely, I am pulling wool over your eyes and nobody could argue that we can validate measures by declaring them to be valid. So, let me provide evidence for my claim. I start with Eid, Geiser, Koch, and Heene’s (2017) article that is built on the empire-expanding claim that all previous applications of another empire-expanding model called the bi-factor model, are false and that researchers need to use the authors model. This article is flagged as highly-cited in WebofScience showing that this claim has struck fear in applied researchers who were using the bi-factor model.
One problem for applied researchers is that psychometricians are trained in mathematics and use mathematical language in their articles which makes it impossible for applied researchers to understand what they are saying. For example, it would take me a long time to understand what this formula in Eid et al.’s article tries to say.
Fortunately, psychometricians have also developed a simpler language to communicate about their models that uses figures with just four elements that are easy to understand. Boxes represent measured variables where we have actual scores of people in a sample. Circles are unobserved variables where we do not have scores of individuals. Straight and directed arrows imply a causal effect. The key goal of a measurement model is to estimate parameters that show how strong these causal effects are. Finally, there are also curved and undirected paths that reflect a correlation between two variables without assuming causality. This simple language makes it possible for applied researchers to think about the statistical model that they are using to examine validity of their measures. Eid et al.’s Figure 1 shows the bi-factor models they criticize with an example of several cognitive tasks that were developed to measure general intelligence. In this model, general intelligence is an unobserved variable (g). Nothing in the bi-factor model tells us whether this factor really measures intelligence. So, we can ignore this hot-button issue and focus on the question that the bi-factor model actually can answer. Are the tasks that were developed to measure the g-factor good measures of the g-factor. To be a good measure, a measure has to be strongly related to the g-factor. Thus, the key information that applied researchers care about are the parameter estimates for the directed paths from the g-factor to the 9 observed variables. Annoyingly, psychometricians use Greek letters to refer to these parameters. An English term is factor loadings and we could just use L for loading to refer to these parameters, but psychometricians feel more like scientists when they use the Greek letter lambda.
But how can we estimate the strength of an unobserved variable on an observed variable? This sounds like magic or witch craft and some people have argued that factor analysis is fundamentally flawed and produces illusory causal effects of imaginary substances. In reality, factor analysis is based on the simple fact that causal process produce correlations. If there are really people who are better at cognitive tasks, they will do better one different tasks, just like athletic people are likely to do better on several different sports. Thus, a common cause will produce correlations between two effects. You may remember this from PSY100 where this is introduced as the third-variable problem. The correlation between height and hair-length (churches and murder rates, etc.) does not reveal a causal effect of height on hair-length or vice versa. Rather, it is produced by a common cause. In this case, gender explains most of the correlation between height and hair-length because men tend to be taller and tend to have shorter hair, producing a negative correlation. Measurement models use the relationship between correlation and causation to infer the strength of common causes on the basis of the strength of correlations among the observed variables. To do so, they assume that there are no direct causal effects of one measure on another. That is, just because we measured your temperature under your arm pits before we measured it in your ear and moth, does not produce correlations among the three measures of temperature. This assumption is represented in the Figure by the fact that there are no direct relationships among the observed variables. The correlations merely reflect common causes and when three measures of temperature are strongly correlated, it suggests that they are all measuring the same common cause.
A simple model of g might assume that performance on a cognitive measure is influenced by only two causes. One is the general ability (g) that is represented by the directed arrow from g to the variable that represents variation in a specific task and another due to factors that are unique to this measure (e.g., some people are better at verbal tasks than others). This variance that is unique to a variable is often omitted from figures, but is part of the model in Figure 1.
The problem with this model is that it often does not fit the data. Cognitive performance does not have a simple structure. This means that some measures are more strongly correlated than a model with a single g-factor predicts. Bi-factor models model these additional relationships among measures with additional factors. They are called S1, S2, and S3 (thank god, they didn’t call them sigma or some other Greek name) and S stands for specific. So, the model implies that participants’ scores on a specific measure are caused by three factors: the general factor (g), one of the three specific factors (S1, S2, or S3), and a factor that is unique to a specific measure. The model in Figure 1 is simplistic and may still not fit the data. For example, it is possible that some measures that are mainly influenced by S2 are also influenced a bit by S1 and S3. However, these modifications are not relevant for our discussion, and we can simply assume that the model in Figure 1 fits the data reasonably well.
From a substantive perspective, it seems plausible that two cognitive measures could be influenced by a general factor (e.g., some students do better in all classes than others) and some specific factors (e.g., some students do better in science subjects). So, while the bi-factor model is not automatically the correct model, it would seem strange to reject it a priori as a plausible model. Yet, this is exactly what Eid et al.’s (2017) are doing based on some statistical discussion that I honestly cannot follow. All I can say is that it from a substantive point of view, a bi-factor model is a reasonable specification of the assumption that cognitive performance can be influenced by general and specific factors and that this model predicts stronger correlations among measures that tap the same specific abilities than measures that share only the general factor as a common cause.
After Eid et al. convinced themselves, reviewers, and an editor at a prestigious journal that their statistical reasoning was sound, they proposed a new way of modeling correlations among cognitive performance measures. They call it, the Bifactor-(S-1) model. The key difference between this model and the bi-factor model is that the authors remove one of the specific factors from the model; hence, S – 1.
You might say, but what if there is specific variance that contributes to performance on these task? If these specific factors exist, they would produce stronger correlations between measures that are influenced by these specific factors and a model without this factor would not fit the data (as well as the model that includes a specific factor that actually exists). Evidently, we cannot simply remove factors willy-nilly without misrepresenting the data. To solve this problem, the bi-factor (S-1) model introduces new parameters that help the model to fit the data as well or better than the bi-factor model.
Figure 4 in Eid et al.’s article makes it possible for readers who are not statisticians to see the difference between the models. First, we see that the S1 factor has been removed. Second, we see that the meaningful factor names (g = general and s = specific) have been replaced by obscure Greek letters where it is not clear what these factors are supposed to represent. The Greek letter tau (I had to look this up) stands for T = true score. Now true score is not a substantive entity. It is just a misleading name for a statistical construct that was created for a measurement theory that is called classic, meaning outdated. So, the bi-factor (S-1) model no longer claims to measure anything in the real world. There is no g-factor that is based on the assumption that some people will perform better on all cognitive tasks that were developed to measure this common factor. There are also no longer specific factors because specific factors are only defined when we first attribute performance to a general factor and see that other factors also have a common effect on subsets of measures. In short, the model is not a substantive model that aims to measure. It is like creating thermometers without assuming that temperature exists. When I discussed this with Michael Eid years ago, he defended this approach to measurement without constructs with a social-constructionistic philosophy. The basic idea is that there is no reality and that constructs and measures are social creations that do not require validation. Accordingly, the true score factor measures what a researcher wants to measure. We can simply pick two or three correlated measures and the construct becomes whatever produces variation in these three measures. Other researchers can pick other measures and the factors that produce variation in these measures are the construct. This approach to measurement is called operationalism. Accordingly, constructs are defined by measures and intelligence is whatever some researchers shows to measure and call intelligence. Operationalism was rejected by Cronbach and Meehl (1955) and led to the development of measurement models that can be used to examine whether a measure actually measures what it is intended to measure. The bifactor (S-1) model avoids this problem by letting researchers chose measures that define a construct without examining what produces variation in these measures.
“One way to define a G factor in a single-level random experiment is to take one domain as a reference domain. Without loss of generality, we may choose the first domain (k = 1) as reference domain and take the first indicator of this domain (i = 1) as a reference indicator. This choice of the reference domain and indicator indicator depends on a researcher’s theory and goals” (Eid et al., 2017, p. 550).
While the authors are transparent about the arbitrary nature of true scores – what is true variance depends on researchers’ choice of which specific factors to remove – they fail to point out that this model cannot be used to test the validity of measures because there is no longer a claim that factors correspond to real-world objects. Now both the measures and the constructs are constructed and we are just playing around with numbers and models without testing any theoretical claims.
Assuming the bi-factor model fits the data, it is easy to explain what the factors in the bi-factor (S-1) model are and why it fits the data. Because the model removed S-1, the true-score factor now represents the g-factor and the S1 factor. The G+S1 factor still predicts variance in the S2 and S3 measures because of the g-variance in the G+S1 factor. However, because the S1-variance in the G+S1 factor is not related to the S2 and S3 measures, the G+S1 factor explains less variance in the S2 and S3 measures than the g-factor in the bi-factor model. The specific factors in the bi-factor (S-1) model with the Greek symbol zeta (ζ) now predict more variance in the S2 and S3 measures because they not only represent the specific variance, but also some of the general factor variance that is not removed by using the contaminated g+S1 factor to account for shared variance among all measures. Finally, because the zeta factors now contain some g-variance that is shared between S2 and S3 measures, the two zeta factors are correlated. Thus, g-variance is split into g-variance in the g+S1 factor and g-variance that is common to the zeta factors.
Eid et al. might object that I assume the g-factor is real and that this may not be the case. However, this is a substantive question and the choice between the bi-factor model and the bi-factor (S-1) model has to be based on broader theoretical consideration and eventually empirical tests of the competing models. To do so, Eid et al. would have to explain why the two zeta-factors are correlated, which implies an additional common cause for S2 and S3 measures. Thus, the empirical question is whether it is plausible to assume that in addition to a general factor that is common to all measures, S2 and S3 have another common cause that is not shared by S1 measures. The key problem is that Eid et al. are not even proposing a substantive alternative theory. Instead, they argue that there are no substantive questions and that researchers can pick any model they want if it serves their goals. “This choice of the reference domain and indicator indicator depends on a researcher’s theory and goals” (p. 550).
If researchers can just pick and chose models, it is not clear why they could not just pick the standard bi-factor model. After all, the bi-factor (S-1) model is just an arbitrary choice to define the general factor in terms of items without a specific factor. What is wrong with choosing to all for all measures to be influenced by specific factors as in the standard bi-factor model. Eid et al. (2014) claim that this model has several problems. The first claim is that the bi-factor model often produced anomalous results that are often not consistent with the a priori theory. However, this is a feature of modeling, not a bug. What are the chances that a priori theories always fit the data? They whole point of science is to discover new things and new things often contradict our a prior notions. However, psychologists seem to be averse to discovery and have created the illusion that they are clairvoyant and never make mistakes. This narcissistic delusion has impeded progress in psychology. Rather than recognizing that anomalies reveal problems with the a priori theory, they blame the method for these results. This is a stupid criticism of models because it is always possible to modify a model and find a model that fits the data. The real challenge in modeling is that often several models fit the data. Bad fit is never a problem of the method. It is a problem of model misspecification. As I showed, proper exploration of data can produce well-fitting and meaningful models with a g-factor (Schimmack, 2022). This does not mean that the g-factor corresponds to anything real, nor does it mean that it should be called intelligence. However, it is silly to argue that we should prefer models with a general factor and simply pick some measures to create constructs that do not even aim to measure anything real.
Anther criticism of standard bi-factor models is that the loadings (i.e., the effect sizes of the general factor on measures) are unstable. “That means, for example, that the G-factor of intelligence should stay the same (i.e., “general”) when one takes out four of 10 domains of intelligence” (p. 546). Eid et al. point out that this is not always the case.
“Reise (2012), however, found that the G factor loadings can change when domains are removed. This causes some conceptual problems, as it means that G factors as measured in the bifactor and related models are not generally invariant across different sets of domains used to measure them. This can cause problems, for example, in literature reviews or meta-analyses that summarize data from different studies or in so-called conceptual replications in which different domains were used to measure a given G factor, because the G factors may not be comparable across studies.” (p. 546).
This is nonsense. First of all, the problem that results are not comparable across studies is much greater when researchers just start arbitrarily selecting sets of measures as indicators of the general+S factor because the g+S1, g+S2, g+S3 factors are conceptually different. All reall sciences have benefited from unification and standardization of measurement by selecting the best measures. In contrast, only psychologists think we are making progress by developing more and more measures. The use of bi-factor (S-1) models makes it impossible to compare measures because they are all valid measures of researchers’ pet constructs. Thus, use of this model will further impede progress in psychological measurement.
Eid et al. (2014) also exaggerate the extent to which results depend on the choice of measures in the bi-factor model. The more measures are highly correlated and reflect the full range of measures, the more results will be stable and comparable. Moreover, the only reason for notable changes in loadings would be mismeasurement of the general factor because some specific factors were not properly modeled. To support my claim, I used the data from Brunner et al. (2012) who fitted a bi-factor model to 14 measures of g. I randomly split the 14 measures into two sets of 7 and fitted a model with two g-factors and let the two factors correlate. The magnitude of this correlation shows how much inferences about g would depend on the arbitrary selection of measures. The correlation was r = .96 with a 95%CI ranging from .94 to .98. While number-nerds might get a hard-on because they can now claim that results are unstable, p < .05, applied researchers might shrug and think that this correlation is good enough to think they measured the same thing and it is ok to combine results in a meta-analysis.
In sum, the criticism of bi-factor models is all smoke and mirrors to advertise another way of modeling data and to grab market share from the popular bi-factor model that took away market share from hierarchical models. All of this is just a competition among psychometricians to get attention that doesn’t advance actual psychological research. The real psychometric advances are made by psychometricians who created statistical tools for applied researchers like Jorekog, Bentler, and Muthen and Muthen. These tools and substantive theory are all that applied researchers need. The idea that statistical considerations can constrain the choice of models is misleading and often leads to suboptimal and wrong models.
Readers might be a bit skeptical that somebody who doen’t know the Greek alphabet and doesn’t understand some of the statistical arguments is able to criticize trained psychometricians. After all, they are experts and surely must know better what they are doing. This argument ignores the systemic factors that make them do things that are not in the best interest of science. Making a truly novel and useful contribution to psychometrics is hard and many well-meaning attempts will fail. To make my point, I present Eid et al.’s illustration of their model with a study of emotions. Now, I may not be a master psychometrician, but nobody can say that I lack expertise in the substantive area of emotion research and in attempts to measure emotions. My dissertation in 1997 was about this topic. So, what did Eid et al. (2017) find when they created a bi-factor (S-1) measurement model of emotions?
Eid et al. (2017) examined the correlations among self-reports of 9 specific negative emotions. To fit their model to the data, they used the Anger domain as the reference domain. Not surprisingly, anger, fury and rage had high loadings on the true score factor (falsely called the g-factor) and the other negative emotions had low loadings on this factor. This result makes no sense and is inconsistent with all established models of negative affect. All we really learn from this model is that a factor that is mostly defined by anger also explains a small amount of variance in sadness and self-conscious negative emotions. Moreover, this result is arbitrary and any one of the other emotions could have been used to model the misnamed g-factor. As a result, there is nothing general about the g-factor. It is a specific factor by definition. “The G factor in this model represents anger intensity” (p. 553). But why would we call a specific emotion factor a general factor. This makes no theoretical sense. As a result, this model does not specify any meaningful theory of emotions.
A proper bi-factor or hierarchical model would test the substantive theory that some emotions covary because they share a common feature. The most basic feature of emotions is assumed to be valence. Based on this theory, emotions with the same valence are more likely to co-occur , which results in positive correlations among emotions of the same valence. Hundreds of studies have confirmed this prediction. In addition, emotions also share specific features such as appraisals and action tendencies. Emotions who also share these components are more likely to co-occur than emotions with different or even opposing appraisals. For example, pride and gratitude are based on opposing appraisals of attribution to self or others. A measurement model of emotions might represent these assumptions in a model with one or two general factors for valence (the dimensionality of valence is still debated) and several specific factors. In this model, the general factor has a clear meaning and represents the valence of an emotion. Fitting such a model to the data is useful to test the theory. Maybe the results confirm the model, may be they don’t. Either way, we learn something about human emotions. But if we fit a model that does not include a factor that represents valence and misleadingly label an anger-factor a general factor, we learn nothing, except that we should not trust psychometricians who build models without substantive expertise. Sadly, Eid has actually made good contributions to emotion research in the 1990s that has identified broad general factors of affect that he appears to have forgotten. Accordingly, he would have modeled affect along three general dimensions (Steyer, Schwenkmezger, Notz, & Eid, 1997).
In conclusion, the main point of this blog post is that psychometricians benefit from developing ready-to-use, plug-and-play models that applied researchers can use without thinking about the model. The problem is that measurement requires understanding of the object that is being measured. Thermometers do not measure time and clocks are not good measures of weight. As a result, good measurement requires substantive knowledge and custom models that are fitted to the measurement problem at hand. Moreover, measurement models have to be embedded in a broader model that specifies theoretical assumptions that can be empirically tested (i.e., Cronbach & Meehl’s, 1955, nomological network). The bi-factor (S-1) model is unhelpful because it avoids falsification by letting researchers define constructs in terms of an arbitrary set of items. This may be useful for scientists who want to publish in a culture that values confirmation (bias), it is not useful for scientist who want to explore the human mind and need valid measures to do so. For these researchers, I recommend to learn structural equation modeling from some of the greatest psychometricians who helped researchers like me to test substantive theories such as Joreskog, Bentler, and now Muthen and Muthen. They provide the tools, you need to provide the theory and the data and be willing to listen to the data when your model does not fit. I learned a lot.