Review of “Fit Indices Are Insensitive to Multiple Minor Violations of Perfect Simple Structure in Confirmatory Factor Analysis”

To reduce bias, I created this review with the help of ChatGPT. ChatGPT evaluates the manuscript within the limited framework of the question examined in this article. My own review focusses on broader issues and the practical relevance of these simulations.


ChatGPT

Overview

The manuscript offers a thorough examination of how confirmatory factor analysis (CFA) models with perfect simple structure perform when fitted to data generated by exploratory factor analysis (EFA) structures with various numbers of cross-loadings. The authors combine clear theoretical derivations with extensive simulation studies—implemented via a novel ShinyApp—to demonstrate that, under certain proportionality constraints, EFA models can be covariance-equivalent to their CFA counterparts. This work challenges the conventional wisdom that acceptable EFA solutions are invariably rejected by CFA models, and it raises important questions about the sensitivity of standard SEM fit indices (RMSEA, CFI, and SRMR) to omitted cross-loadings.


Strengths

  1. Theoretical Rigor and Novel Insights:
    The manuscript provides a detailed algebraic treatment showing that when cross-loadings in an EFA model are proportional to the main loadings, the covariance structure is invariant under rotation—yielding a CFA model with perfect simple structure. This theoretical contribution clarifies the mathematical basis for covariance equivalence and underlines why commonly used fit indices may fail to detect misspecification in many practical scenarios.
  2. Comprehensive Empirical Investigation:
    By systematically varying the number and magnitude of cross-loadings—and by considering both constant and random loading scenarios—the simulations offer a nuanced picture of when misfit is detectable by SEM fit indices. The exploration of both positive-only and mixed-sign cross-loadings adds depth to the analysis, illustrating that the sensitivity of fit indices depends not just on the number of cross-loadings but also on their sign and variability.
  3. Practical Tool (ShinyApp):
    The introduction of a ShinyApp for exploring the sensitivity of fit indices provides a valuable resource for researchers. It allows users to dynamically assess how different parameter settings affect model fit, making the theoretical results more accessible and practically useful.
  4. Integration with Existing Literature:
    The review of prior simulation studies (e.g., Chen et al., 2008; Ferrando & Lorenzo-Seva, 2000; Marsh et al., 2013, 2020) situates the work within the broader discourse on CFA versus EFA. The authors do a commendable job reconciling conflicting findings and explaining why commonly accepted “folk wisdom” about CFA model misfit may need to be revisited.

Weaknesses and Areas for Improvement

  1. Limited Scope of Models:
    While the manuscript’s focus on 2- and 3-factor models makes the simulation study more manageable, the generalizability of the findings to models with a higher number of factors is less clear. Although some exploratory results for larger kk are mentioned, a systematic investigation into higher-dimensional models would strengthen the paper’s implications for applied research.
  2. Reliance on Population-Level Analyses:
    The study emphasizes population values of fit indices to eliminate sampling fluctuations. Although this approach clarifies the asymptotic behavior of the indices, it may understate the impact of sampling variability. Future work could benefit from incorporating finite-sample analyses or confidence intervals to better bridge the gap between theory and practice.
  3. Exclusivity of Fit Indices Considered:
    The focus on RMSEA, CFI, and SRMR is justified for clarity, but many practitioners use a broader range of fit indices. Expanding the discussion to include alternative measures or discussing how other indices might behave under similar conditions would add practical value.
  4. Complexity and Accessibility:
    The manuscript is densely technical and might be challenging for readers who are not already familiar with the nuances of factor analysis and SEM. Simplifying some of the algebraic derivations or providing additional intuitive explanations could enhance accessibility without sacrificing rigor.
  5. Implications for Practice:
    While the paper offers clear recommendations for simulation studies and theoretical model selection, it could further discuss how practitioners should balance statistical fit with theoretical justification. Emphasizing strategies for validating the proportionality constraints in real-world data would be beneficial.

Conclusion

Overall, the manuscript makes a significant contribution to our understanding of model fit in factor analysis by revealing that CFA models with perfect simple structure can exhibit excellent fit—even when the true data-generating process is far more complex—due to covariance equivalence. The blend of theoretical and empirical work is commendable, as is the development of a practical tool for researchers. Addressing the limitations related to model scope, sample variability, and accessibility would further enhance the impact of the work. This study is a valuable resource for methodologists and applied researchers alike, prompting a re-examination of standard practices in model evaluation.


ChatPGT rating 8 out of 10

Ulrich Schimmack

The main problem with this article is that it has limited (to be generous) relevance for applied researchers who want to use factor analysis to examine the pattern of correlations among items. The key problem is the confusion of measurement and structural parts in a confirmatory/structural equation model. A measurement model relates theoretical constructs to measures of these constructs. For example, the item “I am sociable” is used to measure the personality disposition to be more or less sociable across different social situations. The purpose of examining correlations among items that measure different constructs is to evaluate discriminant validity (Campbell & Fiske, 1959). Take the constructs sociability, assertiveness, and cheerfulness as an example. To measure the three constructs, researchers would write items that reflect differences in these three personality traits. A good item would have high loadings on the intended factor that represents the theoretical construct and low loadings on the other factors. For example, the item “I am sociable” might have a loading (correlation) with the sociability factor of .7 and loadings of 1. and .2 on the assertiveness and cheerfulness factors. A common rule of thumb in factor analysis is that good items have loadings of .6 or higher on the intended factor and loadings of .3 or lower on the other factors. Importantly, researchers have full control over the loading pattern by selecting good items and removing bad items. The goal is to end up with items with high primary loadings (on the intended factor) and low secondary loadings) on the other factors. If the analysis does not show sufficient items that show the expected relationships, researchers could and should write new items and validate them in a new set of data. Ultimately, a good measurement instrument would show the expected simple structure, a term used for the desired pattern of primary and secondary loadings.

The problem with Savalei and Huang’s simulations is that they treat the pattern of correlations as a natural given that needs to be explored and described. However, measurement models are not a product of nature. They are a product of researchers’ ability to write good items and good items will show a simple structure.

It is also a common mistake to use EFA to examine the structure of item-item correlations because EFA cannot reveal all problems of items that reduce their validity. Most importantly, EFA fails to reveal that some items are more strongly correlated to each other than the EFA model predicts. These additional relationships are called correlated residuals. Only CFA can reveal that two or more items have correlated residuals (Schimmack, 2025, Padova Winter School). Thus, the focus on secondary loadings in this article ignores a major problem in item validation, namely demonstrating that residual variances of items are only weakly related.

Savalei and Huang might argue that EFA is sometimes used to examine theoretical questions. For example, before researchers used EFA to validate items to measure the Big Five traits, EFA was used to explore the structure of personality, which led to the creation of the Big Five model. The problem here is that it is unreasonable to expect nature to produce simple structures. For example, anger is related to high Neuroticism and low Agreeableness and Cheerfulness is related to high Extraversion and low Neuroticism. These theoretical relationships are not under the control of researchers by selecting good measures because the relationship between specific and broad traits is a theoretical one. This is called the structural model. A structural model needs theoretical justification. This is also the reason why EFA cannot be used to validate structural models. To do so, it is necessary to specify a theortical model and use SEM to fit the model to the data. It is totally irrelevant whether data generated with an EFA model can fit a CFA model or not. The real question is whether a theory fits the data and to examine this the theory needs to be specified as a set of structural equations and fitted to the data.

The final problem is that EFA is limited in the exploration of structural models for a number of reasons. First, it assumes that each item represents a distinct construct and that the correlation between items can be explained by their relationship to factors. This is a problem because single items have a lot of measurement error. Multi-rater studies suggest that only 30% to 50% of the variance in a personality item is valid variance. The large amount of measurement error makes it difficult to study structural relationships between constructs. A solution to this problem is to build hierarchical models that represent constructs with multiple items, but EFA cannot do this. Thus, CFA is a superior method to study structural models without measurement error.

Another problem of EFA is that it is a dogmatic causal model. Tha is, it assumes that constructs are only related to each other because they are influenced by unobserved third variables (factors) and cannot influence each other. This dogmatic assumption will sometimes be false (e.g., heigt and weight are not corelated due to the shared influence of an unobserved third variable), but factor analysis cannot represent the direct effect of heitht on weight. In contrast, SEM forces researchers to justify their causal assumptions when they build a model. Some correlations may be explained by unobserved third variables and others by direct causal effects. Researchers cannot hide behind a dogma to solve the third variable problem (Schimmack, 2025, Padova Winter School).

In short, the relationship between EFA and CFA models has limited practical importance for researchers interested in measurement or exploration of substantive structural models. Measurement models relate items to constructs and CFA is needed to demonstrate that items have good psychometric properties (i.e., high primary loadings, low secondary loadings, & low correlated residuals). Structural studies require validated measurement models to examine the relationship among constructs without dogmatic assumptions that hide the third variable problem. To make progress in 2025, we need to introduce a new generation of students to SEM rather than examining the relationship between EFA and CFA.

Rating 2 / 10

Leave a Reply