Category Archives: Uncategorized

The Misguided Attack of a Meta-Psychometrician

Webster’s online dictionary defines a psychometrician as (a) a person who is skilled in the administration and interpretation of objective  psychological tests or (b) a psychologist who devises, constructs, and standardizes psychometric tests.

Neither definition describes Denny Borsboom, who is better described as a theoretical psychometrician, a philosopher of pyschometrics, or a meta-psychometrician. The reason is that Borsboom never developed a psychological test. His main contribution to discussions about psychological measurement have been meta-psychological articles that reflect on the methods that psychometricans use to interpret and validate psychological measures and test scores.

Thus, one problem with Borsboom’s (2006) article “The attack of the psychometricians” is that Borsboom is not a psychometrician who is concerned with developing psychological measures. The second problem with the title is the claim that a whole group of psychometricians is ready to attack, while he was mainly speaking for himself. Social psychologists call this the “false consensus” bias, where individuals overestimate how much their attitudes are shared with others.

When I first read the article, I became an instant fan of Borsboom. He shared my frustration with common practices in psychological measurement that persisted despite criticism of these practices by eminent psychologists like Cronbach, Meehl, Campbell, and Fiske in the 1950s. However, it later became apparent that I misunderstood Borsboom’s article. What seemed like a call for improving psychological assessment turned out to be a criticism of the entire enterprise of measuring individuals’ traits. Apparently, psychomtricians weren’t just using the wrong methods; they were actually misguided in their beliefs that there are traits that can be measured.

In 2006, Borsboom criticized leading personality psychologists for dismissing results that contradicted their a priori assumptions. When McCrae, Zonderman, Costa, Bond, & Paunonen’s (1996) used Confirmatory Factor Analysis (CFA) to explore personality structure and their Big Five model didn’t fit the data, they questioned CFA and did not consider the possibility that their measurement model of the Big Five was wrong.

In actual analyses of personality data [. . .] structures that are known to be reliable [from principal components analyses] showed poor fits when evaluated by CFA techniques. We believe this points to serious problems with CFA itself when used to examine personality structure” (McCrae et al., 1996, p. 563).

I fully agree with Borsboom that it is wrong to dismiss a method simply on the grounds that it does not support a preexisting theory. However, six years later Borsboom made the same mistake and used misfit of a measurement model to jump to the conclusion that the Big Five do not exist ( Cramer, van der Sluis, Noordhof, Wichers, Geschwind, Aggen, Kendler, & Borsboom, 2012).

The central tenet of this paper is to consider the misfit of the untweaked model an indication that the latent variable hypothesis fails as an explanation of the emergence of normal personality dimensions, and to move on towards alternative model (p. 417).

This conclusion is as ridiculous as McCrae et al.’s conclusion. After all, how would it be possible that personality items that were created to measure a personality attribute and that were selected to show internal consistency and convergent validity with informant ratings do not reflect a personality trait? It seems more likely that the specified measurement model was wrong and that a different measurement model is needed to fit the data.

The key problem with the measurement model is the ridiculous assumption that all items load only on the intended factor. However, exploratory factor analyses or principal component analysis typically show secondary loadings. Thus, it is not surprising that omitting these secondary loadings from a CFA model produces bad fit. Thus, the key problem in fitting CFA models is that it is difficult to create content-pure items. The problem is not that the Big Five cannot be identified or do not exist.

Confirmatory Factor Analysis of Big Five Data

Idealistic vs. Realistic CFA

The key problems in fitting simple CFA models to data are psychometric assumptions that are neither theory-driven nor plausible. The worst assumption of standard CFA models is that each personality item loads on a single factor. As a result, all loadings on other factors are fixed at zero. To the extent that actual data have secondary loadings, these CFA models will show poor fit.

From a theoretical point of view, constraining all secondary loadings to zero makes no sense. To do so, implies that psychometricians are able to create perfect items that reflect only a single factor. In this idealistic scenario that exists only in the world of meta-psychometricians with simulated data, the null-hypothesis that there are no secondary loadings is true. However, psychometricians who work with real data know that the null-hypothesis is always false (Cohen, 1994). Meehl called this the crude factor. All items will be correlated with all other items, even if these correlations are small and meaningless.

McCrae et al. (1996) made the mistake to interpret bad fit of the standard CFA model as evidence that CFA cannot be used to study the Big Five. Borsboom and colleagues made the mistake to claim that bad fit of the standard CFA model implies that the Big Five do not exist. The right conclusion is that the standard CFA model without secondary loadings and without method factors is an idealistic and therewith unrealistic model that will not fit real data. It can only serve as a starting point for exploration to find better fitting models that actually fit the data.

CFA models of the Big Five

Another problem of Borsboom et al.’s (2012) article is that the authors ignored studies that used CFA to model Big Five questionnaires with good fit to actual data. The do not cite Biesanz and West (2004), deYoung (2006), or Anusic et al. (2009).

All three articles used CFA to model agreement and disagreement in Big Five ratings for self-ratings and informant ratings. The use of multi-method data is particularly useful to demonstrate that Big Five factors are more than mere self-perceptions. The general finding of these studies is that self and informant ratings show convergent validity and can be modeled with five latent factors that reflect the shared variance among raters. In addition, these models showed that unique variances in ratings by a single rater are systematically correlated. The pattern of these correlations suggests an evaluative bias in self-ratings. CFA makes it possible to model this bias as a separate method factor, which is not possible with exploratory factor analysis. Thus, these articles demonstrate the usefulness of examining personality measurement models with CFA and they show that the Big Five are real personality traits that are not mere artifacts of self-ratings.

Anusic et al. (2006) also developed a measurement model for self-ratings. The model assumes that variance in each Big Five item has at least four components: (a) the intended construct variance (valid variance), (b) evaluative bias variance, (c) acquiescence bias variance, and (d) item-specific unique variance.

Thus, even in 2012 it was wrong to claim that CFA models do not fit Big Five data and to suggest that the Big Five do not exist. This misguided claim could only arise from a meta-psychometric perspective that ignores the substantive literature on personality traits.

Advances in CFA Modeling of Big Five Data

The internet has made it easier to collect and share data. Thus, there are now large datasets with Big Five data. In addition, computing power has increased exponentially, which makes it possible to analyze larger sets of items with CFA.

Beck, Condon, and Jackson published a preprint that examined the structure of personality with a network model and made their data openly available ( ). The dataset contains responses to 100 Big Five items (the IPIP-100) from a total of 369,151 participants who voluntarily provided data on an online website.

As the questionnaire was administered in English, I focused on English-speaking countries for my analyses. I used the Canadian sample for exploratory analyses. This way, the much larger US sample and the samples from Great Britain and Australia can be used for cross-validation.

Out of the 100 items, 78 items were retained for the final model. Two items were excluded because they had low coverage (that is infrequently administered with other items). The remaining 20 items were excluded because they had low loadings on the primary factor. The remaining 78 items were fitted to a model with the Big Five, an acquiescence factor and an evaluative bias factor. Secondary loadings and correlated residuals were added by exploring modification indices. Loadings on the halo factor were initially fixed at 1. However, loadings for some items were freed if modification indices suggested that this would improve fit. This allowed to identify items with high or low evaluative bias.

The precise specification of the model and the full results can be found in the OSF project MPLUS input file ( ). The model had excellent fit using Root Mean Square Error of Approximation as a criterion, RMSEA = .017, 90%CI[.017,.017]. The Comparative Fit Index (CFI) was acceptable, CFI = .940, considering the use of single -item indicators (Anusic et al., 2009). Table 1 shows the factor loadings on the Big Five factors and the two method factors for individual items and for the Big Five scales.

easily disturbed30.44-0.25
not easily bothered10-0.58-0.12-0.110.25
relaxed most of the time17-0.610.19-0.170.27
change my mood a lot250.55-0.15-0.24
feel easily threatened370.50-0.25
get angry easily410.50-0.13
get caught up in my problems420.560.13
get irritated easily440.53-0.13
get overwhelmed by emotions450.620.30
stress out easily460.690.11
frequent mood swings560.59-0.10
often feel blue770.54-0.27-0.12
panic easily800.560.14
rarely get irritated82-0.52
seldom feel blue83-0.410.12
take offense easily910.53
worry about things1000.570.210.09
hard to get to know7-0.45-0.230.13
quiet around strangers16-0.65-0.240.14
skilled handling social situations180.650.130.390.15
am life of the party190.640.160.14
don’t like drawing attention to self30-0.540.13-0.140.15
don’t mind being center of attention310.560.230.13
don’t talk a lot32-0.680.230.13
feel at ease with people 33-0.200.640.160.350.16
feel comfortable around others34-0.230.650.150.270.16
find it difficult to approach others38-0.60-0.400.16
have little to say57-0.14-0.52-0.250.14
keep in the background60-0.69-0.250.15
know how to captivate people610.490.290.280.16
make friends easily73-0.100.660.140.250.15
feel uncomfortable around others780.22-0.64-0.240.14
start conversations880.700.120.270.16
talk to different people at parties930.720.220.13
full of ideas50.650.320.19
not interested in abstract ideas11-0.46-0.270.16
do not have good imagination27-0.45-0.190.16
have rich vocabulary500.520.110.18
have a vivid imagination520.41-
have difficulty imagining things53-0.48-0.310.18
difficulty understanding abstract ideas540.11-0.48-0.280.16
have excellent ideas550.53-0.090.370.22
love to read challenging materials70-0.180.400.230.14
love to think up new ways710.510.300.18
indifferent to feelings of others8-0.58-0.270.16
not interested in others’ problems12-0.58-0.260.15
feel little concern for others35-0.58-0.270.18
feel others’ emotions360.600.220.17
have a good word for everybody490.590.100.17
have a soft heart510.420.290.17
inquire about others’ well-being580.620.320.19
insult people590.190.12-0.32-0.18-0.250.15
know how to comforte others620.260.480.280.17
love to help others690.140.640.330.19
sympathize with others’ feelings890.740.300.18
take time out for others920.530.320.19
think of others first940.610.290.17
always prepared20.620.280.17
exacting in my work4-0.090.380.290.17
continue until everything is perfect260.140.490.130.16
do things according to a plan280.65-0.450.17
do things in a half-way manner29-0.49-0.400.16
find it difficult to get down to work390.09-0.48-0.400.14
follow a schedule400.650.070.14
get chores done right away430.540.240.14
leave a mess in my room63-0.49-0.210.12
leave my belongings around64-0.50-0.080.13
like order650.64-0.070.16
like to tidy up660.190.520.120.14
love order and regularity680.150.68-0.190.15
make a mess of things720.21-0.50-0.260.15
make plans and stick to them750.520.280.17
neglect my duties76-0.55-0.450.16
forget to put things back 79-0.52-0.220.13
shirk my duties85-0.45-0.400.16
waste my time98-0.49-0.460.14

The results show that the selected items have their highest loading on the intended factor and all but two loadings exceed |.4|. Secondary loadings are always lower than the primary loadings and most secondary loadings are below |.2| . Consistent with previous studies, loadings on the acquiescence factor are weak. As acquiescence bias is reduced by including reverse scored items, the effect of acquiescence on scales is trivial. However, halo bias accumulates and has about 15% of the variance in scales is evaluative bias. Secondary loadings produce only negligible correlations between scales. Thus, scales are a mixture of the intended construct and evaluative bias.

These results show that it is possible to fit a CFA model to a large set of Big Five items and to recover the intended structure. The results also show that sum scores can be used as reasonable proxies of the latent constructs. The main caveat is that scales are contaminated with evaluative bias.

Creating Short Scales

Even on a powerful computer, a model with 78 items takes a lot of time to converge. Thus, it is not very useful for further analyses such as cross-validation across samples or to explore age or gender differences. Moreover, it is unnecessary to measure a latent variable with 18 items as even 3 or 4 indicators are sufficient to identify a latent construct. Thus, I created short scales with high loading items and an equal balance of positive and negative items. The goal was to have six items, but for two scales only two negative items were available and the total number of items was only five.

The results are presented in Table 2. Interestingly, even the results for scales (sum scores) are very similar suggesting that administering only 28 items provides the same information as 80 items.

easily disturbed30.46-0.210.14
not easily bothered10-0.660.200.13
relaxed most of the time17-0.630.220.15
stress out easily460.740.11-0.190.13
frequent mood swings560.58-0.11-0.200.13
quiet around strangers16-0.69-0.200.15
at ease with people33-0.270.650.200.230.13
difficult to approach others380.16-0.69-0.190.14
keep in the background60-0.68-0.210.14
start conversations880.740.150.220.15
talk to a lot of different people930.720.180.12
full of ideas50.090.540.270.18
don’t have good imagination27-0.69-0.220.15
have vivid imagination520.110.72-
difficulty imagining things53-0.70-0.260.18
love to think up new ways710.410.240.16
indifferent to others’ feelings8-0.61-0.220.15
not interest in others’ problems12-0.60-0.220.15
feel little concern for others35-0.61-0.220.15
have a soft heart510.560.230.16
sympathize with others’ feelings890.760.250.17
think of others first940.560.240.16
follow a schedule400.450.200.13
get chores done right away430.640.200.13
leave a mess in my room63-0.74-0.170.12
leave my belongings around64-0.70-0.180.12
love order and regularity680.20-0.13-0.210.410.220.15
forget to put things back79-0.72-0.180.12

These results also show that Borsboom’s criticism of scale scores as arbitrary sum scores overstated the problem with commonly used personality measures. While scale scores are not perfect indicators of constructs, sum scores of carefully selected items can be used as proxies of latent variables.

The main problem of sum scores is that they are contaminated with evaluative bias variance. This is not a problem if latent variable models are used because evaluative bias can be separate from Big Five variance. In order to control for evaluative bias with manifest scale scores, it is necessary to regress outcomes on all Big Five traits. As evaluative bias is shared across the Big Five, it is removed from regression coefficients that reflect only the unique contribution of each Big Five trait.

In conclusion, the results show that McCrae et al. were wrong to dismiss CFA as a method for personality psychologists and Borsboom et al. were wrong to claim that traits do not exist. CFA is ideally suited to create measurement models of personality traits and to validate personality scales. Ideally, these measurement models would use multiple methods such as personality ratings by multiple raters as well as measures of actual behaviors as indicators of personality traits.

Borsboom Comes to His Senses

in 2017, Borsboom and colleagues published another article on the Big Five (Epskamp, Rhemutulla, & Borsboom, 2017). The main focus of the article is to introduce a psychometric model that combines CFA to model the influence of unobserved common causes and network models that describe direct relationships between indicators. The model is illustrated with the Big Five.

As the figure shows, the model is essentially a CFA model (on the left) and a model of correlated residuals that are presented as a network (on the right). The authors note that the CFA model alone does not fit the data and that adding the residual network improves model fit. However, the authors do not explore alternative models with secondary loadings and their model ignores the presence of method factors such as the acquiescence bias and evaluative bias factors identified in the model above (Anusic et al., 2009). Even though the authors consider their model a new psychometric model, adding residual covariances to a structural equation model is not really novel. Most important, the article shows a reversal in Borsboom’s thinking. Rather than claiming that latent factors are illusory, he seems to acknowledge that personality items are best modeled with latent factors.

It is interesting to contrast the 2017 article with the 2012 article. In 2012, Borsboom critiqued personality psychologists for ” tweaking the model ‘on the basis of the data’ so that the basic latent variable hypothesis is preserved (e.g. by allowing cross-loadings,
exploratory factor analysis with procrustes rotation; see also Borsboom, 2006 for an elaborate critique).” However, in 2017, Borsboom added a purely data-driven network of residual correlations to produce model fit. This shows a major shift in Borsboom’s thinking about personality measurement.

I think the 2017 article is a move in the right direction. All we need to do is to add method factors and secondary loadings to the model and Borsboom’s model in 2017 converges with my measurement model of the Big Five.

Where are we now?

A decade has past since Borsboom marshaled his attack, but most of the problems that triggered Borsboom’s article remain the same. In part, Borsboom is to blame for this lack of progress because his attack was directed at the existence of traits as opposed to bad practices in measuring them.

The most pressing concerns remains the deep-rooted tradition in psychology to work with operational definitions of constructs. That is, constructs are mere labels or vague statements that refer to a particular measure. Subjective well-being is the sum score on the Satisfaction with Life Scale; self-esteem is the sum-score of Rosenberg’s 10 self-esteem items; and implicit bias is the difference score between reaction times on the Implicit Association Test. At best, it is recognized that observed scores are unreliable, but the actual correspondence between constructs and measures is never challenged.

This is what distinguishes psychology from natural sciences. Even though valid measurement is a fundamental requirement for empirical data to be useful, psychologists pay little attention to the validity of their measures. Implicitly, psychological research is conducted as if psychological measures are as valid as measures of height, weight, or temperature. However, in reality psychological measures have much lower validity. To make progress as a science, psychologists need to pay more attention to the validity of their measures. Borsboom (2006) was right about the obstacles in the way towards this goal.

Operationalism Rules

There is no training in theory construction or formalizing measurement models. Measures are valid if they have face validity and reliability. Many psychology programs have no psychometricians and do not offer courses in psychological measurement. Measures are used because others used them before and somebody said at some point that the measure has been validated. This has to stop. Every valid measure requires a measurement theory and assumptions being made by a measurement theory need to be tested . The measurement theory has to be specified as a formal model that can be tested. No measure without a formal measurement model should be considered a valid measure. Most important, it is not sufficient to rely on mono-method data because mono-method data always have method-specific variance. A multi-method approach is needed to separate construct variance from systematic method variance (Campbell & Fiske, 1959).

Classical Test Theory

Classical test theory may be sufficient for very specific applications such as performance or knowledge in a well-defined domain (e.g., multiplication, knowing facts about Canada). However, classical test theory does not work for abstract concepts like valuing power, achievement motivation, prejudice, or well-being. Students need to learn about latent variable models that can relate theoretical constructs to observed measures.

The Catch-All of Construct Validity

Borsboom correctly observes that psychologists lack a clear understanding of validation research.

Construct validity functions as a black hole from which nothing can escape: Once a question gets labeled as a problem of construct validity, its difficulty is considered superhuman and its solution beyond a mortal’s ken.”

However, he doesn’t provide a solution to the problem, and blames the inventors of construct validity for it. I vehemently disagree. Cronbach and Meehl (1955) did not only coin the term construct validity; they also outlined a clear program of research that is required to validate measures and to probe the meaning of constructs. The problem is that psychologists never followed their recommendations. To improve psychological science, psychologists must learn to create formal measurement models (nomological networks) and test them with empirical data. No matter how poor and simplistic these models are in the beginning, they are needed to initiate the process of construct validation. As data become available, measures and constructs need to be revised to accommodate new evidence. In this sense, construct validation is a never-ending process that is still ongoing even in the natural science (the definition of a meter was just changed); but just because the process is never ending doesn’t mean it is should never be started.

Psychometrics is Risky

Borsboom correctly notes that it is not clear who should do actual psychometric work. Psychologists do not get rewarded for validation research because developing valid measures is not sexy, while showing unconscious bias with invalid measures is. Thus, measurement articles are difficult to publish in psychology journals. Actual psychometric work is also difficult to publish in method journals that focus on mathematical and statistical developments and do not care about applications to specific content areas. Finally, assessment journals focus on clinical populations and are not interested in measures of individual differences in normal populations. Thus, it is difficult to publish validation studies. To address this problem it is important to demonstrate that psychology has a validation problem. Only when researches realize that they are using measures with unknown or low validity, journals have an incentive to publish validation studies. As long as psychologists believe that any reliable sum score is valid, there is no market for validation studies.

It Shouldn’t Be Too Difficult

Psychologists would like to have the same status as the natural sciences or economics. However, students in these areas often have to learn complex techniques and math. In comparison, psychology is easy, which is partly the appeal. However, to make scientific progress in psychology is a lot harder than it seems. Many students lack the training to do the hard work that wold be required to move psychology forward. Structural question modeling, for example, is not taught and many students would not know how to develop a measurement model and how to test it. They may learn how to fit a cookie-cutter model to a dataset, but if the data do not fit the model, they would not know what to do. To make progress, training has to take measurement more seriously and prepare students to evaluate and modify measurement models.

But It’s Not in SPSS!

At least on this front, progress has been made. Many psychologists ran a principal components analysis because this was the default option in SPSS. There were always other options, but users didn’t know the differences and sticked with the default option. Now a young generation is trained to use R and structural equation modeling is freely available with the lavaan package. Thus, students have access to statistical tools that were not as easily available a decade ago.

Thou Shalt Not. . .

Theoretical psychometricians are a special type of personality. Many of them would like to be as far away as possible from any applications to real data that never meet the strict assumptions of their models. This makes it difficult for them to teach applied researchers how to use psychometric models effectively. Ironically, Borsboom himself imposed the criterion of simple structure and local independence on data to argue that CFA is not appropriate for personality data. But if two items do not have local indendence (e.g., because they are antonyms), it doesn’t imply that a measurement model is fundamentally flawed. The main advantage of structural equation modeling is that it is possible to test the assumption of local independence and to modify the model if it is violated. This is exactly what Borsboom did in the 2017 article. The idea that you shall not have correlated residuals in your models is unrealistic and not useful in applied settings. Thus, we need real psychometricians who are interested in the application of psychometric models to actual data. They care deeply about the substance area and want to apply models to actual messy data to improve psychological measurement. Armchair criticism from meta-psychometricians is not going to move psychology forward.

Sample Size Issues

Even in 2006, sample size was an issue for the use of psychometric models. However, sample sizes have increased tremendously thanks to online surveys. There are now dozens of Big Five datasets with thousands of respondents. The IAT has been administered to millions of volunteers. Thus, sample size is no longer an issue and it is possible to fit complex measurement models to real data.

Substantive Factors

Borsboom again picks personality psychology to make his point.

“For instance, personality traits are usually taken to be continuously structured and conceived of as reflective latent variables (even though the techniques used do not sit well with this interpretation). The point, however, is that there is nothing in personality theory that motivates such a choice, and the same holds for the majority of the subdisciplines in psychology.”

This quote illustrates the problem of meta-psychometricians. They are not experts in a substantive area and often unaware of substantive facts that may motivate a specific measurement model. Borsboom seems to be unaware that psychologists have tried to find personality types, but that dimensional models won because it was impossible to find clearly defined types. Moreover, people have no problems to rate their personality along quantitative scales and to indicate that they are slightly or strongly interested in art or that they are worried sometimes or often. Not to mention, that personality traits show evidence of heritability and that we would expect an approximately normal distribution for traits that are influenced by multiple randomly combined genes (e.g., height).

Thus, to make progress, we need psychologists who have substantive knowledge and statistical knowledge in order to develop and improve measurement model of personality or other constructs. What we do not need are meta-psychologists without substantive knowledge who comment on substantive issues.

Read and Publish Widely

Borsboom also gives some good advice for psychometricians.

The founding fathers of the Psychometric Society—scholars such as Thurstone, Thorndike, Guilford, and Kelley—were substantive psychologists as much as they were psychometricians. Contemporary psychometricians do not always display a comparable interest with respect to the substantive field that lends them their credibility. It is perhaps worthwhile to emphasize that, even though psychometrics has benefited greatly from the input of mathematicians, psychometrics is not a puremathematical discipline but an applied one. If one strips the application from an applied science one is not left with very much that is interesting; and psychometrics without the “psycho” is not, in my view, an overly exciting discipline. It is therefore essential that a psychometrician keeps up to date with the developments in one or more subdisciplines of psychology.

I couldn’t agree more and I invite Denny to learn more about personality psychology, if he wants to make some contribution to the measurement of personality. The 2017 paper is a step in the right direction. Finding the Big Five in a questionnaire that was developed to measure the Big Five is a first step. Developing a measurement model of personality and assessing validity with multi-method data is a task that is worthwhile attacking in the next decade.

Well-Being Science

Happiness has become a big top in the social sciences. Many universities offer happiness courses that teach how to be happier. Many of the exercises that are being taught in these courses are not based on evidence of effectiveness. I am teaching a different course. The course is an introduction to the science of well-being. The aim of this course is to provide an overview of the empirical research on well-being.

Textbooks that cover well-being science are often written by textbook writers who are not experts on the topic. They are often pretty bad. A better alternative is the free textbook published by well-being expert Ed Diener on his free textbook site Noba publishing (link).

For my students, I wrote my own textbook. It is still a work in progress, but given the costly alternatives, I decided to make it public. As I said, it is a work in progress. I am always looking for ways to improve it and to correct it. Feel free to provide comments in the comment section or by email.

Wellbeing Science: In Search of the Good Life (Ulrich Schimmack)

Empowering the Underpowered Study

A recent article with the title “Is the Power Threshold of 0.8 Applicable to Surgical Science? Empowering the Underpowered Study” is being discussed on social media (e.g., Gelman blog).

Neither the authors, not the critics appear to be familiar with the statistical concept of power that is being discussed.

The article mentions Jacob Cohen as the pioneer of power analysis only to argue that his recommendation that studies should have 80% power is not applicable to surgical science.

They apparently didn’t read the rest of Cohen’s book on power analysis or any other textbook about statistical power.

Let’s first define statistical power. Statistical power is the long-run proportion of studies with a statistically significant result that one can expect given the sample size and population effect size of a study and the criterion for statistical significance.

Given this definition of power, we can ask whether an 80% success rate is to high and what success rate would be more applicable in studies with small sample sizes. Assuming that sample sizes are fixed by low frequency of events and effect sizes are not under the control of a researcher, we might simply have to accept that power is only 50% or only 20%. There is nothing we can do about it.

What are the implications of conducting significance tests with 20% power? 80% of the studies will produce a type-II error; that is, the test cannot reject the null-hypothesis (e.g., two surgical treatments are equally effective), when the null-hypothesis is actually false (one surgical procedure is better than another). Is it desirable to have an error rate of 80% in surgery studies? This is what the article seems to imply, but it is unlikely that the authors would actually agree with this, unless they are insane.

So, what the authors are really trying to say is probably something like “some data are better than no data and we should be able to report results even if they are based on small samples.” The authors might be surprised that many online trolls would agree with them, while they vehemently disagree with the claim that we can empower studies with small samples by increasing the type-II error rate.

What Cohen really said was that researchers should balance the type-I error risk (concluding one surgical procedure is better than the other) when this is actually not the case (both surgical procedures are approximately equally effective) and the type-II error risk (the reverse error).

To balance error probabilities, researchers should set the criterion for statistical significance according to the risk of drawing false conclusions (Lakens et al., 2018). In small samples with modest effect sizes, a reasonable balance of type-I and type-II errors is achieved by increasing the type-I risk from the standard criterion of alpha = .05 to, say, alpha = .20, or if necessary even to alpha = .50.

Changing alpha is the only way to empower small studies to produce significant results. Somehow the eight authors, the reviewers, and the editor of the target article missed this basic fact about statistical power.

In conclusion, the article is another example that applied researchers receive poor training in statistics and that the concept of statistical power is poorly understood. Jacob Cohen made an invaluable contribution to statistics by popularizing Neyman-Pearson’s extension of null-hypothesis testing by considering type-II error probabilities. However, his work is not finished and it is time for statistics textbooks and introductory statistics courses to teach statistical power so that mistakes like this article will not happen again. Nobody should think that it is desirable to run studies with less than 50% power (Tversky & Kahneman, 1971). Setting alpha to 5% even if this implies that a study has a high chance of producing a type-II error is insane and may even be considered unethical, especially in surgery where a better procedure may save lives.

Where Do Non-Significant Results in Meta-Analysis Come From?

It is well known that focal hypothesis tests in psychology journals nearly always reject the null-hypothesis (Sterling, 1959; Sterling et al., 1995). However, meta-analyses often contain a fairly large number of non-significant results. To my knowledge, the emergence of non-significant results in meta-analysis has not been examined systematically (happy to be proven wrong). Here I used the extremely well-done meta-analysis of money priming studies to explore this issue (Lodder, Ong, Grasman, & Wicherts, 2019).

I downloaded their data and computed z-scores by (1) dividing Cohen’s d by sampling errror (2/sqrt(N)) to compute t-values, (2) convert the absolute t-values into two-sided p-values, and (3) converting the p-values into absolute z-scores. The z-scores were submitted to a z-curve analysis (Brunner & Schimmack, 2019).

The first figure shows the z-curve for all test-statistics. Out of 282 tests, only 116 (41%) are significant. This finding is surprising, given the typical discovery rates over 90% in psychology journals. The figure also shows that the observed discovery rate of 41% is higher than the expected discovery rate of 29%, although the difference is relatively small and the confidence intervals overlap. This might suggest that publication bias in the money priming literature is not a serious problem. On the other hand, meta-analysis may mask the presence of publication bias in the published literature for a number of reasons.

Published vs. Unpublished Studies

Publication bias implies that studies with non-significant results end up in the proverbial file-drawer. Meta-analysts try to correct for publication bias by soliciting unpublished studies. The money-priming meta-analysis included 113 unpublished studies.

Figure 2 shows the z-curve for these studies. The observed discovery rate is slightly lower than for the full set of studies, 29%, and more consistent with the expected discovery rate, 25%. Thus, there this set of studies appears to be unbiased.

The complementary finding for published studies (Figure 3) is that the observed discovery rate increases, 49%, while the expected discovery rate remains low, 31%. Thus, published articles report a higher percentage of significant results without more statistical power to produce significant results.

A New Type of Publications: Independent Replication Studies

In response to concerns about publication bias and questionable research practices, psychology journals have become more willing to publish null-results. An emerging format are pre-registered replication studies with the explicit aim of probing the credibility of published results. The money priming meta-analysis included 47 independent replication studies.

Figure 4 shows that independent replication studies had a very low observed discovery rate, 4%, that is matched by a very low expected discovery rate, 5%. It is remarkable that the discovery rate for replication studies is lower than the discovery rate for unpublished studies. One reason for this discrepancy is that significance alone is not sufficient to get published and authors may be selective in the sharing of unpublished results.

Removing independent replication studies from the set of published studies further increases the observed discovery rate, 66%. Given the low power of replication studies, the expected discovery rate also increases somewhat, but it is notably lower than the observed discovery rate, 35%. The difference is now large enough to be statistically significant, despite the rather wide confidence interval around the expected discovery rate estimate.

Coding of Interaction Effects

After a (true or false) effect has been established in the literature, follow up studies often examine boundary conditions and moderators of an effect. Evidence for moderation is typically demonstrated with interaction effects that are sometimes followed by contrast analysis for different groups. One way to code these studies would be to focus on the main effect and to ignore the moderator analysis. However, meta-analysts often split the sample and treat different subgroups as independent samples. This can produce a large number of non-significant results because a moderator analysis allows for the fact that the effect emerged only in one group. The resulting non-significant results may provide false evidence of honest reporting of results because bias tests rely on the focal moderator effect to examine publication bias.

The next figure is based on studies that involved an interaction hypothesis. The observed discovery rate, 42%, is slightly higher than the expected discovery rate, 25%, but bias is relatively mild and interaction effects contribute 34 non-significant results to the meta-analysis.

The analysis of the published main effect shows a dramatically different pattern. The observed discovery rate increased to 56/67 = 84%, while the expected discovery rate remained low with 27%. The 95%CI do not overlap, demonstrating that the large file-drawer of missing studies is not just a chance finding.

I also examined more closely the 7 non-significant results in this set of studies.

  1. Gino and Mogliner (2014) reported results of a money priming study with cheating as the dependent variable. There were 98 participants in 3 conditions. Results were analyzed with percentage of cheating participants and extent of cheating. The percentage of cheating participants produced a significant contrast of the money priming and control condition, chi2(1, N = 65) = 3.97. However, the meta-analysis used the extent of cheating dependent variable, which should only a marginally significant effect with a one-tailed p-value of .07. “Simple contrasts revealed that participants cheated more in the money condition (M = 4.41, SD = 4.25) than in both the control condition (M = 2.76, SD = 3.96; p = .07) and the time condition (M = 1.55, SD = 2.41; p = .002).” Thus, this non-significant results was presented as supporting evidence in the original article.
  2. Jin, Z., Shiomura, K., & Jiang, L. (2015) conducted a priming studies with reaction times as dependent variables. This design is different from social priming studies in the meta-analysis. Moreover, money priming effects were examined within-participants, and the study produced several significant complex interaction effects. Thus, this study also does not count as a published failure to replicate money priming effects.
  3. Mukherjee, S., Nargundkar, M., & Manjaly, J. A. (2014) examined the influence of money primes on various satisfaction judgments. Study 1 used a small sample of N = 48 participants with three dependent variables. Two achieved significance, but the meta-analysis aggregated across DVs, which resulted in a non-significant outcome. Study 2 used a larger sample and replicated significance for two outcomes. It was not included in the meta-analysis. In this case, aggregation of DVs explains a non-significant result in the meta-analysis, while the original article reported significant results.
  4. I was unable to retrieve this article, but the abstract suggests that the article reports a significant interaction. ” We found that although money-primed reactance in control trials in which the majority provided correct responses, this effect vanished in critical trials in which the majority provided incorrect answers.”
  5. Wierzbicki, J., & Zawadzka, A. (2014) published two studies. Study 1 reported a significant result. Study 2 added a non-significant result to the meta-analysis. Although the effect for money priming was not significant, this study reported a significant effect for credit-card priming and a money priming x morality interaction effect. Thus, the article also did not report a money-priming failure as the key finding.
  6. Gasiorowska, A. (2013) is an article in Polish.
  7. is a duplication of article 5

In conclusion, none of the 7 studies with non-significant results in the meta-analysis that were published in a journal reported that money priming had no effect on a dependent variable. All articles reported some significant results as the key finding. This further confirms how dramatically publication bias distorts the evidence reported in psychology journals.


In this blog post, I examined the discrepancy between null-results in journal articles and in meta-analysis, using a meta-analysis of money priming. While the meta-analysis suggested that publication bias is relatively modest, published articles showed clear evidence of publication bias with an observed discovery rate of 89%, while the expected discovery rate was only 27%.

Three factors contributed to this discrepancy: (a) the inclusion of unpublished studies, (b) independent replication studies, and (c) the coding of interaction effects as separate effects for subgroups rather than coding the main effect.

After correcting for publication bias, expected discovery rates are consistently low with estimates around 30%. The main exception are the independent replication studies that found no evidence at all. Overall, these results confirm that published money priming studies and other social priming studies cannot be trusted because the published studies overestimate replicability and effect sizes.

It is not the aim of this blog post to examine whether some money priming paradigms can produce replicable effects. The main goal was to explain why publication bias in meta-analysis is often small, when publication bias in the published literature is large. The results show that several factors contribute to this discrepancy and that the inclusion of unpublished studies, independent replication studies, and coding of effects explain most of these discrepancies.

The Implicit Association Test: A Measure in Search of a Construct (in press, PoPS)

Here is a link to the manuscript, data, and MPLUS scripts for reproducibility.


Greenwald et al. (1998) proposed that the IAT measures individual differences in implicit social cognition.  This claim requires evidence of construct validity. I review the evidence and show that there is insufficient evidence for this claim.  Most important, I show that few studies were able to test discriminant validity of the IAT as a measure of implicit constructs. I examine discriminant validity in several multi-method studies and find no or weak evidence for discriminant validity. I also show that validity of the IAT as a measure of attitudes varies across constructs. Validity of the self-esteem IAT is low, but estimates vary across studies.  About 20% of the variance in the race IAT reflects racial preferences. The highest validity is obtained for measuring political orientation with the IAT (64% valid variance).  Most of this valid variance stems from a distinction between individuals with opposing attitudes, while reaction times contribute less than 10% of variance in the prediction of explicit attitude measures.  In all domains, explicit measures are more valid than the IAT, but the IAT can be used as a measure of sensitive attitudes to reduce measurement error by using a multi-method measurement model.

Keywords:  Personality, Individual Differences, Social Cognition, Measurement, Construct Validity, Convergent Validity, Discriminant Validity, Structural Equation Modeling


Despite its popularity, relatively little is known about the construct validity of the IAT.

As Cronbach (1989) pointed out, construct validation is better examined by independent experts than by authors of a test because “colleagues are especially able to refine the interpretation, as they compensate for blind spots and capitalize on their own distinctive experience” (p. 163).

It is of utmost importance to determine how much of the variance in IAT scores is valid variance and how much of the variance is due to measurement error, especially when IAT scores are used to provide individualized feedback.

There is also no consensus in the literature whether the IAT measures something different from explicit measures.

In conclusion, while there is general consensus to make a distinction between explicit measures and implicit measures, it is not clear what the IAT measures

To complicate matters further, the validity of the IAT may vary across attitude objects. After all the IAT is a method, just like Likert scales are a method, and it is impossible to say that a method is valid (Cronbach, 1971).

At present, relatively little is known about the contribution of these three parameters to observed correlations in hundreds of mono-method studies.

A Critical Review of Greenwald et al.’s (1998) Original Article

In conclusion, the seminal IAT article introduced the IAT as a measure of implicit constructs that cannot be measured with explicit measures, but it did not really test this dual-attitude model.

Construct Validity in 2007

In conclusion, the 2007 review of construct validity revealed major psychometric challenges for the construct validity of the IAT, which explains why some researchers have concluded that the IAT cannot be used to measure individual differences (Payne et al., 2017).  It also revealed that most studies were mono-method studies that could not examine convergent and discriminant validity

Cunningham, Preacher and Banaji (2001)

Another noteworthy finding is that a single factor accounted for correlations among all measures on the same occasion and across measurement occasions. This finding shows that there were no true changes in racial attitudes over the course of this two-month study.  This finding is important because Cunningham et al.’s (2001) study is often cited as evidence that implicit attitudes are highly unstable and malleable (e.g., Payne et al., 2017). This interpretation is based on the failure to distinguish random measurement error and true change in the construct that is being measured (Anusic & Schimmack, 2016).  While Cunningham et al.’s (2001) results suggest that the IAT is a highly unreliable measure, the results also suggest that the racial attitudes that are measured with the race IAT are highly stable over periods of weeks or months. 

Bar-Anan & Vianello, 2018

this large study of construct validity also provides little evidence for the original claim that the IAT measures a new construct that cannot be measured with explicit measures, and confirms the estimate from Cunningham et al. (2001) that about 20% of the variance in IAT scores reflects variance in racial attitudes.

Greenwald et al. (2009)

“When entered after the self-report measures, the two implicit measures incrementally explained 2.1% of vote intention variance, p=.001, and when political conservativism was also included in the model, “the pair of implicit measures incrementally predicted only 0.6% of voting intention variance, p = .05.”  (Greenwald et al., 2009, p. 247).

I tried to reproduce these results with the published correlation matrix and failed to do so. I contacted Anthony Greenwald, who provided the raw data, but I was unable to recreate the sample size of N = 1,057. Instead I obtained a similar sample size of N = 1,035.  Performing the analysis on this sample also produced non-significant results (IAT: b = -.003, se = .044, t = .070, p = .944; AMP: b = -.014, se = .042, t = 0.344, p = .731).  Thus, there is no evidence for incremental predictive validity in this study.

Axt (2018)

With N = 540,723 respondents, sampling error is very small, σ = .002, and parameter estimates can be interpreted as true scores in the population of Project Implicit visitors.  A comparison of the factor loadings shows that explicit ratings are more valid than IAT scores. The factor loading of the race IAT on the attitude factor once more suggests that about 20% of the variance in IAT scores reflects racial attitudes

Falk, Heine, Zhang, and Hsu (2015)

Most important, the self-esteem IAT and the other implicit measures have low and non-significant loadings on the self-esteem factor. 

Bar-Anan & Vianello (2018)

Thus, low validity contributes considerably to low observed correlations between IAT scores and explicit self-esteem measures.

Bar-Anan & Vianello (2018) – Political Orientation

More important, the factor loading of the IAT on the implicit factor is much higher than for self-esteem or racial attitudes, suggesting over 50% of the variance in political orientation IAT scores is valid variance, π = .79, σ = .016.  The loading of the self-report on the explicit ratings was also higher, π = .90, σ = .010

Variation of Implicit – Explicit Correlations Across Domains

This suggests that the IAT is good in classifying individuals into opposing groups, but it has low validity of individual differences in the strength of attitudes.

What Do IATs Measure?

The present results suggest that measurement error alone is often sufficient to explain these low correlations.  Thus, there is little empirical support for the claim that the IAT measures implicit attitudes that are not accessible to introspection and that cannot be measured with self-report measures. 

For 21 years the lack of discriminant validity has been overlooked because psychologists often fail to take measurement error into account and do not clearly distinguish between measures and constructs.

In the future, researchers need to be more careful when they make claims about constructs based on a single measure like the IAT because measurement error can produce misleading results.

Researchers should avoid terms like implicit attitude or implicit preferences that make claims about constructs simply because attitudes were measured with an implicit measure

Recently, Greenwald and Banaji (2017) also expressed concerns about their earlier assumption that IAT scores reflect unconscious processes.  “Even though the present authors find themselves occasionally lapsing to use implicit and explicit as if they had conceptual meaning, they strongly endorse the empirical understanding of the implicit– explicit distinction” (p. 862).

How Well Does the IAT Measure What it Measures?

Studies with the IAT can be divided into applied studies (A-studies) and basic studies (B-studies).  B-studies employ the IAT to study basic psychological processes.  In contrast, A-studies use the IAT as a measure of individual differences. Whereas B-studies contribute to the understanding of the IAT, A-studies require that IAT scores have construct validity.  Thus, B-studies should provide quantitative information about the psychometric properties for researchers who are conducting A-studies. Unfortunately, 21 years of B-studies have failed to do so. For example, after an exhaustive review of the IAT literature, de Houwer et al. (2009) conclude that “IAT effects are reliable enough to be used as a measure of individual differences” (p. 363).  This conclusion is not helpful for the use of the IAT in A-studies because (a) no quantitative information about reliability is given, and (b) reliability is necessary but not sufficient for validity.  Height can be measured reliably, but it is not a valid measure of happiness. 

This article provides the first quantitative information about validity of three IATs.  The evidence suggests that the self-esteem IAT has no clear evidence of construct validity (Falk et al., 2015).  The race-IAT has about 20% valid variance and even less valid variance in studies that focus on attitudes of members from a single group.  The political orientation IAT has over 40% valid variance, but most of this variance is explained by group-differences and overlaps with explicit measures of political orientation.  Although validity of the IAT needs to be examined on a case by case basis, the results suggest that the IAT has limited utility as a measurement method in A-studies.  It is either invalid or the construct can be measured more easily with direct ratings.

Implications for the Use of IAT scores in Personality Assessment

I suggest to replace the reliability coefficient with the validity coefficient.  For example, if we assume that 20% of the variance in scores on the race IAT is valid variance, the 95%CI for IAT scores from Project Implicit (Axt, 2018), using the D-scoring method, with a mean of .30 and a standard deviation of.46 ranges from -.51 to 1.11. Thus, participants who score at the mean level could have an extreme pro-White bias (Cohen’s d = 1.11/.46 = 2.41), but also an extreme pro-Black Bias (Cohen’s d = -.51/.46 = -1.10).  Thus, it seems problematic to provide individuals with feedback that their IAT score may reveal something about their attitudes that is more valid than their beliefs. 


Social psychologists have always distrusted self-report, especially for the measurement of sensitive topics like prejudice.  Many attempts were made to measure attitudes and other constructs with indirect methods.  The IAT was a major breakthrough because it has relatively high reliability compared to other methods.  Thus, creating the IAT was a major achievement that should not be underestimated because the IAT lacks construct validity as a measure of implicit constructs. Even creating an indirect measure of attitudes is a formidable feat. However, in the early 1990s, social psychologists were enthralled by work in cognitive psychology that demonstrated unconscious or uncontrollable processes (Greenwald & Banaji, 1995). Implicit measures were based on this work and it seemed reasonable to assume that they might provide a window into the unconscious (Banaji & Greenwald, 2013). However, the processes that are involved in the measurement of attitudes with implicit measures are not the personality characteristics that are being measured.  There is nothing implicit about being a Republican or Democrat, gay or straight, or having low self-esteem.  Conflating implicit processes in the measurement of attitudes with implicit personality constructs has created a lot of confusion. It is time to end this confusion. The IAT is an implicit measure of attitudes with varying validity.  It is not a window into people’s unconscious feelings, cognitions, or attitudes.

Social psychology textbook audit: Something smells fishy

Social psychology textbook like colorful laboratory experiments that illustrate a theoretical point. As famous social psychologist Daryl Bem stated, he considered his experiments more illustrations of what could happen than empirical tests of what actually happens. Unfortunately, social psychology textbooks make it less obvious that the results of highlighted studies should not be generalized to real life.

Myers and Twenge (2019) tell the story of fishy smells.

In a laboratory experiment, exposure to a fishy smell caused people to be suspicious of each other and cooperate less—priming notions of a shady deal as “fishy” (Lee & Schwarz, 2012). All these effects occurred without the participants’ conscious awareness of the scent and its influence.

They don’t even mention some other fun facts about this study. To make sure that the effect is not just a mood effect induced by bad odors in general, fishy smells were contrasted with fart smells, and the effect seemed to be limited to fishy smells.

The article was published in the top journal for experimental social psychology (JPSP:ASC) and is relatively highly cited.

However, the studies reported in this article smell a bit fishy and should be consumed with a grain of salt and a lot of lemon. The problem is that all of the results are significant, which is highly unlikely unless studies have very high statistical power (Schimmack, 2012).

And it even works the other way around.

And making people think about suspicion, also makes them think about fish, in theory.

Suspicion also makes you be more sensitive to fishy smells.

Undergraduate students may not realize what the problem with these studies is. After all, they all worked out; that is they produced a p-value less than .05, which is supposed to ensure that no more than 1 out of 20 studies are a false positive result. As all of these studies are significant, it is extremely unlikely that all of them are false positives. So, we would have to infer that suspicion is related to fishy smells in our minds.

However, since 2012 it is clear that we have to draw another conclusion. The reason is that results in social psychology articles like this one smell fishy and suggest that the authors are telling us a fun story, but they are not telling us what really happened in their lab. It is extremely unlikely that the authors reported all of their studies and data analyses that they conducted. Instead they may have used a variety of so-called questionable research practices that increase the chances of reporting a significant result. Questionable research practices are also known as fishing for significance. These questionable research practices have the undesirable effect that they increase the type-I error rate. Thus, while the reported p-values are below .05, the risk of a false positive result is not and could be as high as 100%.

To demonstrate that researchers used questionable research practices, we can conduct a bias test. The most powerful bias test for small sets of studies is the Test of Insufficient Variance. When most p-values are just significant , p < .05 and p > .005, but always significant the results are not trustworthy because sampling error should produce more variability than we see.

The table lists the test statistics, converts the two-tailed p-values into z-scores and computes the variance of the z-scores. The variance is expected to be 1, but the actual variance is only 0.14. A chi-square test shows that this deviation is significant with p = .01. Thus, we have scientific evidence to claim that these results smell a bit fishy.

Study  testvaluedfpz

Unfortunately, these results are not the only fishy results in social psychology textbooks. Thus, students of social psychology should read textbook claims with a healthy dose of skepticism. They should also ask their professors to provide information about the replicability of textbook findings. Has this study been replicated in a preregistered replication attempt? Would you think you could replicate this result in your own lab? It is time to get rid of the fishy smell and let the fresh wind of open science clean up social psychology.

We can only hope that sooner than later, articles like this will sleep with the fishes.

Social-Psychology Textbook Audit: External Validity

Every social psychology textbook emphasizes the problem of naturalistic studies (correlational research) that it is difficult to demonstrate cause-effect relationships in these studies.

Social psychology has a proud tradition of addressing this problem with laboratory experiments. The advantage of laboratory experiments is that they make it easy to demonstrate causality. The disadvantage is that laboratory experiments have low ecological validity. It is therefore important to demonstrate that findings from laboratory experiments generalize to real world behavior.

Myers and Twenge’s (2019) textbook (13e edition) addresses this issue in a section called “Generalizing from Laboratory to Life”

What people saw in everyday life suggested correlational research, which led to experimental research. Network and government policymakers, those with the power to make changes, are now aware of the results. In many areas, including studies of helping, leadership style, depression, and self-efficacy, effects found in the lab have been mirrored by effects in the field, especially when the laboratory effects have been large (Mitchell, 2012).

Mitchell, G. (2012). Revisiting truth or triviality: The external validity of research in the psychological laboratory. Perspectives on Psychological Science, 7, 109–117.

Curious about the evidence, I examined Mitchell’s article. I didn’t need to read beyond the abstract to see that the textbook misrepresented Mitchell’s findings.

Using 217 lab-field comparisons from 82 meta-analyses found that the external validity of laboratory research differed considerably by psychological subfield, research topic, and effect size. Laboratory results from industrial-organizational psychology most reliably predicted field results, effects found in social psychology laboratories most frequently changed signs in the field (from positive to negative or vice versa), and large laboratory effects were more reliably replicated in the field than medium and small laboratory effects.

Mitchell, G. (2012). Revisiting Truth or Triviality: The External Validity of Research in the Psychological Laboratory. Perspectives on Psychological Science7(2), 109–117.

So, a course in social psychology covers 80% results based on laboratory experiments that may not generalize to the real world. In addition, students are given the false information that these results do generalize to the real world, when evidence of ecological validity is often missing. On top of this, many articles based on laboratory experiments report inflated effect sizes due to selection for significance and the results may not even replicate in other laboratory contexts.