The Misguided Attack of a Meta-Psychometrician

Webster’s online dictionary defines a psychometrician as (a) a person who is skilled in the administration and interpretation of objective  psychological tests or (b) a psychologist who devises, constructs, and standardizes psychometric tests.

Neither definition describes Denny Borsboom, who is better described as a theoretical psychometrician, a philosopher of pyschometrics, or a meta-psychometrician. The reason is that Borsboom never developed a psychological test. His main contribution to discussions about psychological measurement have been meta-psychological articles that reflect on the methods that psychometricans use to interpret and validate psychological measures and test scores.

Thus, one problem with Borsboom’s (2006) article “The attack of the psychometricians” is that Borsboom is not a psychometrician who is concerned with developing psychological measures. The second problem with the title is the claim that a whole group of psychometricians is ready to attack, while he was mainly speaking for himself. Social psychologists call this the “false consensus” bias, where individuals overestimate how much their attitudes are shared with others.

When I first read the article, I became an instant fan of Borsboom. He shared my frustration with common practices in psychological measurement that persisted despite criticism of these practices by eminent psychologists like Cronbach, Meehl, Campbell, and Fiske in the 1950s. However, it later became apparent that I misunderstood Borsboom’s article. What seemed like a call for improving psychological assessment turned out to be a criticism of the entire enterprise of measuring individuals’ traits. Apparently, psychomtricians weren’t just using the wrong methods; they were actually misguided in their beliefs that there are traits that can be measured.

In 2006, Borsboom criticized leading personality psychologists for dismissing results that contradicted their a priori assumptions. When McCrae, Zonderman, Costa, Bond, & Paunonen’s (1996) used Confirmatory Factor Analysis (CFA) to explore personality structure and their Big Five model didn’t fit the data, they questioned CFA and did not consider the possibility that their measurement model of the Big Five was wrong.

In actual analyses of personality data [. . .] structures that are known to be reliable [from principal components analyses] showed poor fits when evaluated by CFA techniques. We believe this points to serious problems with CFA itself when used to examine personality structure” (McCrae et al., 1996, p. 563).

I fully agree with Borsboom that it is wrong to dismiss a method simply on the grounds that it does not support a preexisting theory. However, six years later Borsboom made the same mistake and used misfit of a measurement model to jump to the conclusion that the Big Five do not exist ( Cramer, van der Sluis, Noordhof, Wichers, Geschwind, Aggen, Kendler, & Borsboom, 2012).

The central tenet of this paper is to consider the misfit of the untweaked model an indication that the latent variable hypothesis fails as an explanation of the emergence of normal personality dimensions, and to move on towards alternative model (p. 417).

This conclusion is as ridiculous as McCrae et al.’s conclusion. After all, how would it be possible that personality items that were created to measure a personality attribute and that were selected to show internal consistency and convergent validity with informant ratings do not reflect a personality trait? It seems more likely that the specified measurement model was wrong and that a different measurement model is needed to fit the data.

The key problem with the measurement model is the ridiculous assumption that all items load only on the intended factor. However, exploratory factor analyses or principal component analysis typically show secondary loadings. Thus, it is not surprising that omitting these secondary loadings from a CFA model produces bad fit. Thus, the key problem in fitting CFA models is that it is difficult to create content-pure items. The problem is not that the Big Five cannot be identified or do not exist.

Confirmatory Factor Analysis of Big Five Data

Idealistic vs. Realistic CFA

The key problems in fitting simple CFA models to data are psychometric assumptions that are neither theory-driven nor plausible. The worst assumption of standard CFA models is that each personality item loads on a single factor. As a result, all loadings on other factors are fixed at zero. To the extent that actual data have secondary loadings, these CFA models will show poor fit.

From a theoretical point of view, constraining all secondary loadings to zero makes no sense. To do so, implies that psychometricians are able to create perfect items that reflect only a single factor. In this idealistic scenario that exists only in the world of meta-psychometricians with simulated data, the null-hypothesis that there are no secondary loadings is true. However, psychometricians who work with real data know that the null-hypothesis is always false (Cohen, 1994). Meehl called this the crude factor. All items will be correlated with all other items, even if these correlations are small and meaningless.

McCrae et al. (1996) made the mistake to interpret bad fit of the standard CFA model as evidence that CFA cannot be used to study the Big Five. Borsboom and colleagues made the mistake to claim that bad fit of the standard CFA model implies that the Big Five do not exist. The right conclusion is that the standard CFA model without secondary loadings and without method factors is an idealistic and therewith unrealistic model that will not fit real data. It can only serve as a starting point for exploration to find better fitting models that actually fit the data.

CFA models of the Big Five

Another problem of Borsboom et al.’s (2012) article is that the authors ignored studies that used CFA to model Big Five questionnaires with good fit to actual data. The do not cite Biesanz and West (2004), deYoung (2006), or Anusic et al. (2009).

All three articles used CFA to model agreement and disagreement in Big Five ratings for self-ratings and informant ratings. The use of multi-method data is particularly useful to demonstrate that Big Five factors are more than mere self-perceptions. The general finding of these studies is that self and informant ratings show convergent validity and can be modeled with five latent factors that reflect the shared variance among raters. In addition, these models showed that unique variances in ratings by a single rater are systematically correlated. The pattern of these correlations suggests an evaluative bias in self-ratings. CFA makes it possible to model this bias as a separate method factor, which is not possible with exploratory factor analysis. Thus, these articles demonstrate the usefulness of examining personality measurement models with CFA and they show that the Big Five are real personality traits that are not mere artifacts of self-ratings.

Anusic et al. (2006) also developed a measurement model for self-ratings. The model assumes that variance in each Big Five item has at least four components: (a) the intended construct variance (valid variance), (b) evaluative bias variance, (c) acquiescence bias variance, and (d) item-specific unique variance.

Thus, even in 2012 it was wrong to claim that CFA models do not fit Big Five data and to suggest that the Big Five do not exist. This misguided claim could only arise from a meta-psychometric perspective that ignores the substantive literature on personality traits.

Advances in CFA Modeling of Big Five Data

The internet has made it easier to collect and share data. Thus, there are now large datasets with Big Five data. In addition, computing power has increased exponentially, which makes it possible to analyze larger sets of items with CFA.

Beck, Condon, and Jackson published a preprint that examined the structure of personality with a network model and made their data openly available ( ). The dataset contains responses to 100 Big Five items (the IPIP-100) from a total of 369,151 participants who voluntarily provided data on an online website.

As the questionnaire was administered in English, I focused on English-speaking countries for my analyses. I used the Canadian sample for exploratory analyses. This way, the much larger US sample and the samples from Great Britain and Australia can be used for cross-validation.

Out of the 100 items, 78 items were retained for the final model. Two items were excluded because they had low coverage (that is infrequently administered with other items). The remaining 20 items were excluded because they had low loadings on the primary factor. The remaining 78 items were fitted to a model with the Big Five, an acquiescence factor and an evaluative bias factor. Secondary loadings and correlated residuals were added by exploring modification indices. Loadings on the halo factor were initially fixed at 1. However, loadings for some items were freed if modification indices suggested that this would improve fit. This allowed to identify items with high or low evaluative bias.

The precise specification of the model and the full results can be found in the OSF project MPLUS input file ( ). The model had excellent fit using Root Mean Square Error of Approximation as a criterion, RMSEA = .017, 90%CI[.017,.017]. The Comparative Fit Index (CFI) was acceptable, CFI = .940, considering the use of single -item indicators (Anusic et al., 2009). Table 1 shows the factor loadings on the Big Five factors and the two method factors for individual items and for the Big Five scales.

easily disturbed30.44-0.25
not easily bothered10-0.58-0.12-0.110.25
relaxed most of the time17-0.610.19-0.170.27
change my mood a lot250.55-0.15-0.24
feel easily threatened370.50-0.25
get angry easily410.50-0.13
get caught up in my problems420.560.13
get irritated easily440.53-0.13
get overwhelmed by emotions450.620.30
stress out easily460.690.11
frequent mood swings560.59-0.10
often feel blue770.54-0.27-0.12
panic easily800.560.14
rarely get irritated82-0.52
seldom feel blue83-0.410.12
take offense easily910.53
worry about things1000.570.210.09
hard to get to know7-0.45-0.230.13
quiet around strangers16-0.65-0.240.14
skilled handling social situations180.650.130.390.15
am life of the party190.640.160.14
don’t like drawing attention to self30-0.540.13-0.140.15
don’t mind being center of attention310.560.230.13
don’t talk a lot32-0.680.230.13
feel at ease with people 33-0.200.640.160.350.16
feel comfortable around others34-0.230.650.150.270.16
find it difficult to approach others38-0.60-0.400.16
have little to say57-0.14-0.52-0.250.14
keep in the background60-0.69-0.250.15
know how to captivate people610.490.290.280.16
make friends easily73-0.100.660.140.250.15
feel uncomfortable around others780.22-0.64-0.240.14
start conversations880.700.120.270.16
talk to different people at parties930.720.220.13
full of ideas50.650.320.19
not interested in abstract ideas11-0.46-0.270.16
do not have good imagination27-0.45-0.190.16
have rich vocabulary500.520.110.18
have a vivid imagination520.41-
have difficulty imagining things53-0.48-0.310.18
difficulty understanding abstract ideas540.11-0.48-0.280.16
have excellent ideas550.53-0.090.370.22
love to read challenging materials70-0.180.400.230.14
love to think up new ways710.510.300.18
indifferent to feelings of others8-0.58-0.270.16
not interested in others’ problems12-0.58-0.260.15
feel little concern for others35-0.58-0.270.18
feel others’ emotions360.600.220.17
have a good word for everybody490.590.100.17
have a soft heart510.420.290.17
inquire about others’ well-being580.620.320.19
insult people590.190.12-0.32-0.18-0.250.15
know how to comforte others620.260.480.280.17
love to help others690.140.640.330.19
sympathize with others’ feelings890.740.300.18
take time out for others920.530.320.19
think of others first940.610.290.17
always prepared20.620.280.17
exacting in my work4-0.090.380.290.17
continue until everything is perfect260.140.490.130.16
do things according to a plan280.65-0.450.17
do things in a half-way manner29-0.49-0.400.16
find it difficult to get down to work390.09-0.48-0.400.14
follow a schedule400.650.070.14
get chores done right away430.540.240.14
leave a mess in my room63-0.49-0.210.12
leave my belongings around64-0.50-0.080.13
like order650.64-0.070.16
like to tidy up660.190.520.120.14
love order and regularity680.150.68-0.190.15
make a mess of things720.21-0.50-0.260.15
make plans and stick to them750.520.280.17
neglect my duties76-0.55-0.450.16
forget to put things back 79-0.52-0.220.13
shirk my duties85-0.45-0.400.16
waste my time98-0.49-0.460.14

The results show that the selected items have their highest loading on the intended factor and all but two loadings exceed |.4|. Secondary loadings are always lower than the primary loadings and most secondary loadings are below |.2| . Consistent with previous studies, loadings on the acquiescence factor are weak. As acquiescence bias is reduced by including reverse scored items, the effect of acquiescence on scales is trivial. However, halo bias accumulates and has about 15% of the variance in scales is evaluative bias. Secondary loadings produce only negligible correlations between scales. Thus, scales are a mixture of the intended construct and evaluative bias.

These results show that it is possible to fit a CFA model to a large set of Big Five items and to recover the intended structure. The results also show that sum scores can be used as reasonable proxies of the latent constructs. The main caveat is that scales are contaminated with evaluative bias.

Creating Short Scales

Even on a powerful computer, a model with 78 items takes a lot of time to converge. Thus, it is not very useful for further analyses such as cross-validation across samples or to explore age or gender differences. Moreover, it is unnecessary to measure a latent variable with 18 items as even 3 or 4 indicators are sufficient to identify a latent construct. Thus, I created short scales with high loading items and an equal balance of positive and negative items. The goal was to have six items, but for two scales only two negative items were available and the total number of items was only five.

The results are presented in Table 2. Interestingly, even the results for scales (sum scores) are very similar suggesting that administering only 28 items provides the same information as 80 items.

easily disturbed30.46-0.210.14
not easily bothered10-0.660.200.13
relaxed most of the time17-0.630.220.15
stress out easily460.740.11-0.190.13
frequent mood swings560.58-0.11-0.200.13
quiet around strangers16-0.69-0.200.15
at ease with people33-0.270.650.200.230.13
difficult to approach others380.16-0.69-0.190.14
keep in the background60-0.68-0.210.14
start conversations880.740.150.220.15
talk to a lot of different people930.720.180.12
full of ideas50.090.540.270.18
don’t have good imagination27-0.69-0.220.15
have vivid imagination520.110.72-
difficulty imagining things53-0.70-0.260.18
love to think up new ways710.410.240.16
indifferent to others’ feelings8-0.61-0.220.15
not interest in others’ problems12-0.60-0.220.15
feel little concern for others35-0.61-0.220.15
have a soft heart510.560.230.16
sympathize with others’ feelings890.760.250.17
think of others first940.560.240.16
follow a schedule400.450.200.13
get chores done right away430.640.200.13
leave a mess in my room63-0.74-0.170.12
leave my belongings around64-0.70-0.180.12
love order and regularity680.20-0.13-0.210.410.220.15
forget to put things back79-0.72-0.180.12

These results also show that Borsboom’s criticism of scale scores as arbitrary sum scores overstated the problem with commonly used personality measures. While scale scores are not perfect indicators of constructs, sum scores of carefully selected items can be used as proxies of latent variables.

The main problem of sum scores is that they are contaminated with evaluative bias variance. This is not a problem if latent variable models are used because evaluative bias can be separate from Big Five variance. In order to control for evaluative bias with manifest scale scores, it is necessary to regress outcomes on all Big Five traits. As evaluative bias is shared across the Big Five, it is removed from regression coefficients that reflect only the unique contribution of each Big Five trait.

In conclusion, the results show that McCrae et al. were wrong to dismiss CFA as a method for personality psychologists and Borsboom et al. were wrong to claim that traits do not exist. CFA is ideally suited to create measurement models of personality traits and to validate personality scales. Ideally, these measurement models would use multiple methods such as personality ratings by multiple raters as well as measures of actual behaviors as indicators of personality traits.

Borsboom Comes to His Senses

in 2017, Borsboom and colleagues published another article on the Big Five (Epskamp, Rhemutulla, & Borsboom, 2017). The main focus of the article is to introduce a psychometric model that combines CFA to model the influence of unobserved common causes and network models that describe direct relationships between indicators. The model is illustrated with the Big Five.

As the figure shows, the model is essentially a CFA model (on the left) and a model of correlated residuals that are presented as a network (on the right). The authors note that the CFA model alone does not fit the data and that adding the residual network improves model fit. However, the authors do not explore alternative models with secondary loadings and their model ignores the presence of method factors such as the acquiescence bias and evaluative bias factors identified in the model above (Anusic et al., 2009). Even though the authors consider their model a new psychometric model, adding residual covariances to a structural equation model is not really novel. Most important, the article shows a reversal in Borsboom’s thinking. Rather than claiming that latent factors are illusory, he seems to acknowledge that personality items are best modeled with latent factors.

It is interesting to contrast the 2017 article with the 2012 article. In 2012, Borsboom critiqued personality psychologists for ” tweaking the model ‘on the basis of the data’ so that the basic latent variable hypothesis is preserved (e.g. by allowing cross-loadings,
exploratory factor analysis with procrustes rotation; see also Borsboom, 2006 for an elaborate critique).” However, in 2017, Borsboom added a purely data-driven network of residual correlations to produce model fit. This shows a major shift in Borsboom’s thinking about personality measurement.

I think the 2017 article is a move in the right direction. All we need to do is to add method factors and secondary loadings to the model and Borsboom’s model in 2017 converges with my measurement model of the Big Five.

Where are we now?

A decade has past since Borsboom marshaled his attack, but most of the problems that triggered Borsboom’s article remain the same. In part, Borsboom is to blame for this lack of progress because his attack was directed at the existence of traits as opposed to bad practices in measuring them.

The most pressing concerns remains the deep-rooted tradition in psychology to work with operational definitions of constructs. That is, constructs are mere labels or vague statements that refer to a particular measure. Subjective well-being is the sum score on the Satisfaction with Life Scale; self-esteem is the sum-score of Rosenberg’s 10 self-esteem items; and implicit bias is the difference score between reaction times on the Implicit Association Test. At best, it is recognized that observed scores are unreliable, but the actual correspondence between constructs and measures is never challenged.

This is what distinguishes psychology from natural sciences. Even though valid measurement is a fundamental requirement for empirical data to be useful, psychologists pay little attention to the validity of their measures. Implicitly, psychological research is conducted as if psychological measures are as valid as measures of height, weight, or temperature. However, in reality psychological measures have much lower validity. To make progress as a science, psychologists need to pay more attention to the validity of their measures. Borsboom (2006) was right about the obstacles in the way towards this goal.

Operationalism Rules

There is no training in theory construction or formalizing measurement models. Measures are valid if they have face validity and reliability. Many psychology programs have no psychometricians and do not offer courses in psychological measurement. Measures are used because others used them before and somebody said at some point that the measure has been validated. This has to stop. Every valid measure requires a measurement theory and assumptions being made by a measurement theory need to be tested . The measurement theory has to be specified as a formal model that can be tested. No measure without a formal measurement model should be considered a valid measure. Most important, it is not sufficient to rely on mono-method data because mono-method data always have method-specific variance. A multi-method approach is needed to separate construct variance from systematic method variance (Campbell & Fiske, 1959).

Classical Test Theory

Classical test theory may be sufficient for very specific applications such as performance or knowledge in a well-defined domain (e.g., multiplication, knowing facts about Canada). However, classical test theory does not work for abstract concepts like valuing power, achievement motivation, prejudice, or well-being. Students need to learn about latent variable models that can relate theoretical constructs to observed measures.

The Catch-All of Construct Validity

Borsboom correctly observes that psychologists lack a clear understanding of validation research.

Construct validity functions as a black hole from which nothing can escape: Once a question gets labeled as a problem of construct validity, its difficulty is considered superhuman and its solution beyond a mortal’s ken.”

However, he doesn’t provide a solution to the problem, and blames the inventors of construct validity for it. I vehemently disagree. Cronbach and Meehl (1955) did not only coin the term construct validity; they also outlined a clear program of research that is required to validate measures and to probe the meaning of constructs. The problem is that psychologists never followed their recommendations. To improve psychological science, psychologists must learn to create formal measurement models (nomological networks) and test them with empirical data. No matter how poor and simplistic these models are in the beginning, they are needed to initiate the process of construct validation. As data become available, measures and constructs need to be revised to accommodate new evidence. In this sense, construct validation is a never-ending process that is still ongoing even in the natural science (the definition of a meter was just changed); but just because the process is never ending doesn’t mean it is should never be started.

Psychometrics is Risky

Borsboom correctly notes that it is not clear who should do actual psychometric work. Psychologists do not get rewarded for validation research because developing valid measures is not sexy, while showing unconscious bias with invalid measures is. Thus, measurement articles are difficult to publish in psychology journals. Actual psychometric work is also difficult to publish in method journals that focus on mathematical and statistical developments and do not care about applications to specific content areas. Finally, assessment journals focus on clinical populations and are not interested in measures of individual differences in normal populations. Thus, it is difficult to publish validation studies. To address this problem it is important to demonstrate that psychology has a validation problem. Only when researches realize that they are using measures with unknown or low validity, journals have an incentive to publish validation studies. As long as psychologists believe that any reliable sum score is valid, there is no market for validation studies.

It Shouldn’t Be Too Difficult

Psychologists would like to have the same status as the natural sciences or economics. However, students in these areas often have to learn complex techniques and math. In comparison, psychology is easy, which is partly the appeal. However, to make scientific progress in psychology is a lot harder than it seems. Many students lack the training to do the hard work that wold be required to move psychology forward. Structural question modeling, for example, is not taught and many students would not know how to develop a measurement model and how to test it. They may learn how to fit a cookie-cutter model to a dataset, but if the data do not fit the model, they would not know what to do. To make progress, training has to take measurement more seriously and prepare students to evaluate and modify measurement models.

But It’s Not in SPSS!

At least on this front, progress has been made. Many psychologists ran a principal components analysis because this was the default option in SPSS. There were always other options, but users didn’t know the differences and sticked with the default option. Now a young generation is trained to use R and structural equation modeling is freely available with the lavaan package. Thus, students have access to statistical tools that were not as easily available a decade ago.

Thou Shalt Not. . .

Theoretical psychometricians are a special type of personality. Many of them would like to be as far away as possible from any applications to real data that never meet the strict assumptions of their models. This makes it difficult for them to teach applied researchers how to use psychometric models effectively. Ironically, Borsboom himself imposed the criterion of simple structure and local independence on data to argue that CFA is not appropriate for personality data. But if two items do not have local indendence (e.g., because they are antonyms), it doesn’t imply that a measurement model is fundamentally flawed. The main advantage of structural equation modeling is that it is possible to test the assumption of local independence and to modify the model if it is violated. This is exactly what Borsboom did in the 2017 article. The idea that you shall not have correlated residuals in your models is unrealistic and not useful in applied settings. Thus, we need real psychometricians who are interested in the application of psychometric models to actual data. They care deeply about the substance area and want to apply models to actual messy data to improve psychological measurement. Armchair criticism from meta-psychometricians is not going to move psychology forward.

Sample Size Issues

Even in 2006, sample size was an issue for the use of psychometric models. However, sample sizes have increased tremendously thanks to online surveys. There are now dozens of Big Five datasets with thousands of respondents. The IAT has been administered to millions of volunteers. Thus, sample size is no longer an issue and it is possible to fit complex measurement models to real data.

Substantive Factors

Borsboom again picks personality psychology to make his point.

“For instance, personality traits are usually taken to be continuously structured and conceived of as reflective latent variables (even though the techniques used do not sit well with this interpretation). The point, however, is that there is nothing in personality theory that motivates such a choice, and the same holds for the majority of the subdisciplines in psychology.”

This quote illustrates the problem of meta-psychometricians. They are not experts in a substantive area and often unaware of substantive facts that may motivate a specific measurement model. Borsboom seems to be unaware that psychologists have tried to find personality types, but that dimensional models won because it was impossible to find clearly defined types. Moreover, people have no problems to rate their personality along quantitative scales and to indicate that they are slightly or strongly interested in art or that they are worried sometimes or often. Not to mention, that personality traits show evidence of heritability and that we would expect an approximately normal distribution for traits that are influenced by multiple randomly combined genes (e.g., height).

Thus, to make progress, we need psychologists who have substantive knowledge and statistical knowledge in order to develop and improve measurement model of personality or other constructs. What we do not need are meta-psychologists without substantive knowledge who comment on substantive issues.

Read and Publish Widely

Borsboom also gives some good advice for psychometricians.

The founding fathers of the Psychometric Society—scholars such as Thurstone, Thorndike, Guilford, and Kelley—were substantive psychologists as much as they were psychometricians. Contemporary psychometricians do not always display a comparable interest with respect to the substantive field that lends them their credibility. It is perhaps worthwhile to emphasize that, even though psychometrics has benefited greatly from the input of mathematicians, psychometrics is not a puremathematical discipline but an applied one. If one strips the application from an applied science one is not left with very much that is interesting; and psychometrics without the “psycho” is not, in my view, an overly exciting discipline. It is therefore essential that a psychometrician keeps up to date with the developments in one or more subdisciplines of psychology.

I couldn’t agree more and I invite Denny to learn more about personality psychology, if he wants to make some contribution to the measurement of personality. The 2017 paper is a step in the right direction. Finding the Big Five in a questionnaire that was developed to measure the Big Five is a first step. Developing a measurement model of personality and assessing validity with multi-method data is a task that is worthwhile attacking in the next decade.

7 thoughts on “The Misguided Attack of a Meta-Psychometrician

    1. Hi Henrik,
      the reason I didn’t cite it is because I didn’t know about it. I read it and I think there is general agreement that method factors need to be included, but there are some differences in modeling method factors. A key difference that explains better fit of my models is that I include secondary loadings. A second difference, that also has practical implications, is that I constrain loadings on the method factor. Letting all items load free;ly on method factors can create major problems in separating method and content variance.

      It definitely seems worthwhile to do a direct comparison of these models on the same dataset.

      Best, Uli

Leave a Reply