Exploratory Factor Analysis - Notes

1 Introduction

Make sure you Read this first (What is Factor Analysis?)

Exploratory Factor Analysis (EFA) is a statistical method that helps us uncover hidden patterns in data by identifying groups of related variables.

Think of it like sorting a messy drawer full of items - EFA helps you organise items into meaningful categories based on how they’re related to each other.

For example, if you have survey responses about different aspects of job satisfaction, EFA might reveal that questions about salary, benefits, and bonuses naturally group together into a “compensation” factor, while questions about coworkers, team spirit, and workplace communication might group into a “social environment” factor.

The main goals of EFA are to:

Reduce many variables into a smaller set of meaningful factors;
Discover underlying patterns in data that might not be obvious at first glance; and
Help us understand complex relationships between different variables.

Unlike its cousin Confirmatory Factor Analysis, EFA is exploratory in nature, meaning we let the data ‘tell us’ what patterns exist, rather than testing a pre-existing theory about how variables should be grouped.

2 Dimensionality

2.1 Introduction

Exploratory Factor Analysis is used to identify the underlying structure of a dataset by reducing a large number of observed variables into fewer latent dimensions or factors.

In the figure above, a number of observed variables (in yellow) reduce to a smaller number of factors (in blue).

This process, known as “dimensionality reduction”, helps simplify complex data by uncovering patterns and relationships that aren’t immediately apparent. By grouping correlated variables into common factors, EFA provides a clearer understanding of how different measures (variables) contribute to the overall structure of the data.

2.2 Why is it useful?

Dimensionality is particularly important in areas like sport, where multiple performance metrics, psychological measures, or physiological data points are often collected and ‘bundled together’.

Managing this high-dimensional data can be challenging, as redundancy among variables can obscure meaningful analysis. EFA helps address this issue by identifying the minimum number of factors that explain the majority of the variance in the dataset, reducing noise and improving interpretability.

This is especially valuable when working with smaller sample sizes or when resources for data collection are limited.

2.3 What is a ‘latent variable’?

A latent variable is an unobserved or hidden variable that cannot be directly measured but is inferred from observed data.

Latent variables represent underlying concepts or constructs that are believed to give rise to the patterns seen in measurable variables.

For instance, while we can directly measure an athlete’s speed, strength, or reaction time, these observed measures might collectively indicate a broader, unobservable construct like ‘physical fitness’. In this case, ‘physical fitness’ is an example of a latent variable.

In exploratory factor analysis (EFA), latent variables are usually called “factors”. They explain the correlations among observed variables, simplifying the data into meaningful dimensions.

2.3.1 Examples

In sport analytics, latent variables (factors) are particularly useful in helping us understand complex phenomena where multiple measurable factors contribute to a larger concept.

For example:

“Athlete readiness” could be considered as a latent variable that reflects an athlete’s preparedness to perform at their peak. It can’t be directly measured but can be inferred from observed data such as sleep quality, resting heart rate, muscle soreness, mood state, and training load.
“Team cohesion” could be considered to represent the degree of unity and collaboration within a sports club or team which influences performance but is difficult to measure directly. Observable indicators of this latent variable might include communication frequency, mutual support during matches, shared goal commitment, and qualitative measures from team surveys.

2.4 Factor structure in EFA

“Factor structure” refers to the way observed variables group together to form these underlying latent variables, or factors.

In the figure above, we can see that some observed items (questions ISELFB03-ISELFB10) appear to be associated with Factor 1 (we call this ‘loading’ on Factor 1), and other items appear to be associated (load) with Factor 2. This is an example of ‘factor structure’.

2.4.1 What does factor structure tell us?

Factor structure shows which variables are associated with specific factors and to what degree, based on statistical relationships.

In the figure above, we can see that some variables are associated with Factor 1, and some with Factor 2.

In simpler terms, the factor structure reveals the patterns of correlations within the data and helps to organise complex datasets into meaningful, interpretable dimensions. A clear factor structure allows us to identify how variables cluster together, simplifying the interpretation of large datasets.

In sport data analytics, understanding factor structure is helpful for uncovering relationships among different metrics.

For example:

Physical Performance Dimensions

When analysing ‘physical performance’, metrics such as sprint speed, jump height, reaction time, and endurance levels may group into distinct factors. For example, explosive power could emerge as a latent factor linking sprint speed and jump height, while aerobic capacity might group endurance-related measures. The factor structure simplifies the data, allowing us to identify key performance dimensions rather than focusing on individual metrics.

Psychological Resilience

In studies of mental toughness or psychological resilience, observed variables like stress tolerance, ability to recover from failure, confidence levels, and focus under pressure may cluster into broader latent factors. For instance, mental endurance could encompass stress tolerance and recovery, while performance confidence could group measures of self-belief and focus. Understanding the factor structure could help us target interventions more effectively.

2.5 Eigenvalues

Eigenvalues measure of how much of the variance of the observed variables each factor explains.

In the figure above, you can see that four factors have been identified, each with an associated eigenvalue. Factor 1 has the highest (~2.0), Factor 2 is quite high (~1.6), and then the eigenvalues drop quite significantly.

This suggests that, underlying all the different responses/variables in our data, there are two latent variables (Factors 1 and 2) that explain the majority of the variance in the data.

2.5.1 Example

Imagine you have a lot of data about people’s preferences, like in a survey. Each question might relate to different underlying factors, like personality traits. Factor analysis helps you find these hidden factors.

Now, think of eigenvalues as a way to measure how much of the variation in your data can be explained by each of these hidden factors. When you perform factor analysis, you’re essentially trying to reduce the complexity of your data by finding a few key factors that explain most of the variations in your responses.

The process involves creating a correlation matrix, which is a mathematical representation of how all the different survey questions relate to each other. Eigenvalues are then calculated from this matrix. Each eigenvalue corresponds to a factor, and the size of the eigenvalue indicates how much of the total variation in your data that factor explains.

A larger eigenvalue means that the factor is more significant in explaining the variability in your data. In practice, we often look for factors with the largest eigenvalues, as these are the ones that give you the most information.

This is why, in factor analysis, we often focus on factors with eigenvalues greater than 1, as they are considered to contribute significantly to explaining the variation in the data set.

3 Factor Extraction in EFA

3.1 Introduction

So, we’ve observed that Factor Analysis is about identifying underlying factors or ‘latent variables’ that explain the patterns of relationships among a set of observed variables.

We’ve also noted that the ultimate goal of factor extraction is to reduce the complexity of the data by summarising it into a smaller number of meaningful factors. These factors represent common influences shared by groups of variables, helping to simplify large datasets.

3.2 Process of factor extraction

graph TD
    A["Raw Variables"] --> B["Correlation Analysis"]
    B --> C["Factor Extraction"]
    C --> D["Factor 1<br/>(Most Variance)"]
    C --> E["Factor 2<br/>(Second Most Variance)"]
    C --> F["Factor 3<br/>(Third Most Variance)"]
    D --> G["Simplified Dataset"]
    E --> G
    F --> G
    
    %% Add styling
    style A fill:#f9f,stroke:#333
    style B fill:#bbf,stroke:#333
    style C fill:#dfd,stroke:#333
    style D fill:#fdd,stroke:#333
    style E fill:#fdd,stroke:#333
    style F fill:#fdd,stroke:#333
    style G fill:#ddf,stroke:#333

The process begins by analysing the correlations between variables to identify which ones group together.

Factors are then “extracted” based on how much variance in the data they explain, with the first factor accounting for the largest portion, followed by subsequent factors explaining progressively smaller amounts.

This step is crucial for uncovering the most important dimensions (factors) within a dataset, and making it easier to interpret.

For example, in sport data analytics, factor extraction could reveal that multiple financial measures, such as transfer value and current market value share a common underlying factor like value. By extracting these factors, we can focus on the broader patterns in the data rather than individual metrics.

There are a number of different ways in which factors are extracted, and we’ll review some of the most common in the following section.

3.3 Principal Axis factoring

Principal Axis Factoring (PAF) is a commonly used method to identify latent factors that explain the shared variance among variables.

Unlike Principal Component Analysis (PCA), which includes all variance (shared and unique), PAF focuses solely on the shared variance, making it more appropriate for uncovering latent constructs.

The extraction process begins by estimating the communalities, which represent the proportion of variance each variable shares with others, and iteratively refining these estimates to optimise the factor solution.

One of the key strengths of PAF is its robustness when the data deviate from multivariate normality. It doesn’t require normality assumptions, making it a good choice when analysing real-world datasets that often contain non-normal distributions.

PAF is particularly useful when the primary goal is to understand the underlying structure of the data rather than maximise variance explained.

PAF produces factor loadings, which are used to interpret the relationship between variables and extracted factors. We often use rotation techniques (e.g., varimax or oblimin) to enhance the clarity of the factor structure after extraction.

3.4 Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) is another widely used factor extraction method in EFA.

MLE seeks to identify the factor solution that maximises the likelihood of reproducing the observed correlation matrix, assuming that the data follow a multivariate normal distribution.

This approach provides parameter estimates, such as factor loadings and unique variances, that are statistically optimal under the normality assumption.

An advantage of MLE is its ability to generate goodness-of-fit statistics (see below), such as the chi-square test, to assess how well the model fits the observed data.

These metrics allow us to compare different factor solutions and make good decisions about the number of factors to retain.

Note, however, MLE’s reliance on the assumption of multivariate normality makes it sensitive to deviations from normality, which can impact the accuracy of the solution.

MLE is particularly effective in hypothesis-driven EFA or when confirmatory analysis is planned. Since it provides both statistical tests and flexible parameter estimation, MLE offers a more rigorous framework for factor extraction compared to other methods, such as PAF.

Nevertheless, its effectiveness depends on the quality of our data and the appropriateness of the underlying assumptions.

3.5 Variance explained by the factors

“Variance explained” is a critical element in factor extraction. It helps us measure how well the extracted factors account for the variability in the observed variables.

In the following figure, you can see that Factor 1 accounts for ~40% of the variance, and Factor 2 accounts for ~39% of the variance.

This means that, in total, these two factors account for ~79% of the total variance in the dataset.

3.6 More on variance

The total variance in the data can be divided into three components:

shared variance (captured by the factors)
unique variance (specific to individual variables)
error variance.

A well-fitting factor model should explain a substantial proportion of the shared variance, with thresholds like 60% often used as a benchmark.

The proportion of variance explained by each factor can also be represented by its eigenvalue, which reflects the factor’s contribution to the total variance.

Factors with eigenvalues greater than 1 are typically retained, based on Kaiser’s criterion, although this rule is sometimes supplemented with visual inspection of a scree plot (see above) or parallel analysis to identify the optimal number of factors.

Explaining variance is not just a statistical exercise but also a theoretical one. The extracted factors must not only account for variance but also align with theoretical constructs that make sense in the research context.

For that reason, we’re often looking for reasons why our factors ‘make sense’, based on the variables that load most heavily on each of them.

Imagine you have extracted two important factors, with eigenvalues > 1. When you look at which variables load onto each factor, you see that the three variables associated with player motivation load onto Factor 1, and the four variables associated with player performance load onto Factor 2. This is an ‘easy’ model to explain, suggesting there are two latent variables underpinning responses; motivation and performance.

4 Factor Rotation in EFA

4.1 Introduction

Thus far, we’ve discussed how factors are ‘extracted’ from the data. We’ve also noted that we are really interested in which variables load most strongly onto which factor.

Sometimes we’re lucky, and the results are easy to interpret. However, the results of this process can be challenging to interpet.

“Factor rotation” is a mathematical technique used in exploratory factor analysis to try to make the results more interpretable.

Therefore, after the initial factor extraction, the factors are “rotated” in multidimensional space to achieve a simpler and more meaningful pattern of factor loadings.

4.2 Purposes of factor rotation

The main purposes of factor rotation are:

To maximise high loadings and minimise low loadings
To help us better interpret what each factor represents
To achieve what’s called “simple structure” - where each variable loads strongly on one factor and weakly on others

There are two main types of factor rotation:

Orthogonal rotation: Keeps factors uncorrelated (perpendicular to each other)
Oblique rotation: Allows factors to be correlated

4.3 Orthogonal rotation

Orthogonal rotation is a method of factor rotation in EFA that maintains the independence of factors. This means that the rotated factors remain uncorrelated, simplifying the interpretation of the factor structure.

Common orthogonal rotation techniques include Varimax, Quartimax, and Equamax, with Varimax being the most widely used due to its ability to maximise the variance of squared loadings within factors. This creates a clearer distinction between variables that strongly associate with a factor and those that do not.

The key advantage of orthogonal rotation lies in its simplicity, as uncorrelated factors are easier to interpret and report. However, it assumes that the underlying constructs are truly independent, which may not always be realistic in practice.

For example, in psychological or social science research, latent variables such as anxiety and depression are often expected to overlap conceptually. When such relationships exist, orthogonal rotation may oversimplify the factor structure, prompting us to consider oblique rotation as an alternative.

4.4 Oblique rotation

Oblique rotation, in contrast, allows for the factors to be correlated, making it a more flexible option when the latent constructs are believed to overlap.

Techniques such as Oblimin and Promax are commonly used for oblique rotation, providing a factor solution where the relationships between factors are explicitly modeled. This is particularly valuable in fields where latent variables are conceptually or theoretically linked, as oblique rotation offers a more realistic representation of the data structure.

One consequence of oblique rotation is the generation of two matrices: the pattern matrix and the structure matrix. The pattern matrix shows the unique contribution of each variable to a factor, while the structure matrix includes both the direct contributions and the shared variance across factors.

While this complexity requires more effort to interpret, it gives us a deeper insights into the relationships within the data.

4.5 Simple structure

4.5.1 Introduction

The goal of any factor rotation, whether orthogonal or oblique, is to achieve a simple structure, as originally proposed by Thurstone.

A “simple structure” is characterised by factors with high loadings on a few variables and low loadings on others, making the factor solution more interpretable. Simple structure clarifies which variables are strongly associated with each factor, reducing ambiguity and aiding in theoretical interpretation.

4.5.2 Rotation methods

As noted above, rotation methods can be used to rearrange the factor loading matrix to better approximate simple structure.

Orthogonal methods like Varimax simplify interpretation by spreading variance across fewer factors.
Oblique methods achieve the same goal while accommodating correlations between factors.

The pursuit of simple structure is crucial for generating meaningful insights, as it ensures that each factor represents a distinct and interpretable construct.

5 Model Fit in EFA

5.1 Introduction

In Exploratory Factor Analysis (EFA), ‘model fit’ refers to how well the factor model represents the relationships between variables in the observed data.

A good model fit indicates that the extracted factors adequately explain the patterns of correlations among the measured variables.

Model fit in EFA is assessed through several key indicators, such as:

Explained Variance - the percentage of total variance in the variables that is accounted for by the extracted factors;
Residual Analysis - examining the differences between observed correlations and those predicted by the factor model; and
Goodness-of-fit Statistics - statistical measures that evaluate how well the factor solution reproduces the observed correlation matrix.

A well-fitting model should:

Account for a substantial portion of variance in the data (typically >60%)
Show small residuals between observed and reproduced correlations
Demonstrate meaningful factor patterns that can be interpreted theoretically

5.2 Kaiser-Meyer-Olkin (KMO) Test

The KMO measure evaluates the proportion of variance in the data that can be attributed to underlying factors, as opposed to random or idiosyncratic noise.

It provides a single value, ranging from 0 to 1, where higher values indicate that the data are more suitable for factor analysis.
A KMO value above 0.70 is generally considered acceptable, while values below 0.50 suggest that the data may not be suitable for EFA.

This measure is particularly useful in identifying multicollinearity or redundancy among variables. High KMO values signal that the observed correlations are largely driven by shared variance, supporting the extraction of meaningful factors.

By addressing sampling adequacy, the KMO test provides a foundation for evaluating model fit and complements other diagnostic tests, such as Bartlett’s test of sphericity.

5.3 Bartlett’s Test of Sphericity

While the KMO assesses sampling adequacy, Bartlett’s test of sphericity examines whether the observed correlation matrix is significantly different from an identity matrix, where variables are uncorrelated.

A significant Bartlett’s test (p < 0.05) indicates that sufficient correlations exist among the variables to justify factor analysis. This test is crucial because EFA assumes that the data exhibit some level of structure or clustering; without correlations between variables, factor extraction would not yield interpretable results.

Together with the KMO, Bartlett’s test ensures that the dataset meets the necessary assumptions for EFA, forming a solid basis for further evaluation of model fit.

After establishing the suitability of the data, residual analysis becomes the next step in assessing how well the extracted factors reproduce the observed correlations.

5.4 Residual correlation

Residual analysis in EFA involves examining the differences between the observed correlations and the correlations reproduced by the factor model.

These residuals provide insight into the degree to which the model captures the relationships among variables. Ideally, residuals should be small, indicating that the model explains most of the observed correlations.

In practice, a root mean square residual (RMSR) value below 0.05 is often used as a benchmark for a good fit. Large residuals, on the other hand, suggest model misspecification, potentially due to an incorrect number of factors or poorly performing variables.

Residual analysis thus serves as a diagnostic tool to refine the model and ensure its theoretical and statistical validity. Once residuals are minimised, attention can shift to interpreting goodness-of-fit statistics to validate the overall factor solution.

5.5 Goodness-of-fit statistics

Goodness-of-fit statistics give us a more holistic assessment of how well the factor solution reproduces the observed data.

Common metrics include the comparative fit index (CFI), Tucker-Lewis index (TLI), and root mean square error of approximation (RMSEA). These measures evaluate the discrepancy between the observed and model-predicted correlation matrices while accounting for model complexity.

For example, an RMSEA value below 0.08 and a CFI above 0.90 indicate an acceptable fit.

Goodness-of-fit statistics allow us to balance model parsimony with explanatory power, ensuring that the solution is both statistically sound and theoretically meaningful.

6 Interpreting Factors in EFA

6.1 Introduction

graph TD
    A[Exploratory Factor Analysis] --> B[Dimensionality]
    A --> C[Factor extraction]
    A --> D[Factor rotation]
    A --> E[Model fit]
    A --> F[Interpreting factors]

    style A fill:#f9c802,stroke:#333,stroke-width:2px
    style B fill:#b3d9ff,stroke:#333,stroke-width:2px
    style C fill:#ffcccc,stroke:#333,stroke-width:2px
    style D fill:#ccffcc,stroke:#333,stroke-width:2px
    style E fill:#e6ccff,stroke:#333,stroke-width:2px

Finally, interpreting the results of our Exploratory Factor Analysis is crucial in understanding the underlying structure of the data. We want to develop meaningful insights from these factors.

This process involves examining factor loadings, identifying and resolving cross-loadings, and assigning meaningful names to the extracted factors.

6.2 Factor loadings

6.2.1 Introduction

As we noted above, factor loadings represent the correlation between each observed variable and the underlying factor, providing a measure of how strongly a variable contributes to a given factor.

High loadings (typically above 0.40 or 0.50, depending on the context) indicate that a variable is strongly associated with a factor, while low loadings suggest weaker relationships.

Factor loadings are key to interpretation, as they define the structure of each factor and its contribution to explaining the observed data. We typically look for patterns of high loadings within a factor to determine its conceptual meaning.

Additionally, loadings help evaluate the model’s overall explanatory power, with higher loadings contributing to a clearer and more interpretable factor structure.

6.2.2 Rotation methods

We’ve learned that the interpretation of factor loadings is often aided by rotation methods which adjust the factor axes to enhance clarity and simplify the factor structure.

Orthogonal rotations like varimax assume factors are uncorrelated, while oblique rotations, such as oblimin, allow for correlations between factors. Rotation can sharpen the distinction between factors, reducing ambiguity and highlighting the variables that most strongly define each factor

Once the factor loadings have been interpreted, We need to look at cross-loadings, which can complicate the interpretation of individual variables.

6.3 Cross-loadings

Cross-loadings occur when a variable loads strongly onto more than one factor, complicating the interpretation of the factor structure.

Ideally, each variable should exhibit a high loading on only one factor and minimal loadings on others, but cross-loadings are common in real-world data. High cross-loadings may indicate that a variable is influenced by multiple underlying dimensions, making it difficult to assign the variable to a single factor. This issue can lead to conceptual ambiguity, as it becomes unclear which factor a variable truly represents.

To address cross-loadings, we may remove or reassign problematic variables, depending on the theoretical context and the size of the loadings. In some cases, slight cross-loadings may be acceptable if the primary loading is significantly higher than the secondary ones (e.g., a difference of 0.20 or more).

Additionally, oblique rotations can sometimes help resolve cross-loadings by accounting for correlations between factors. Managing cross-loadings is essential for achieving a clean and interpretable factor structure, which is a prerequisite for the final step of naming factors.

6.4 Naming factors

6.4.1 Why name factors?

Naming factors involves assigning meaningful labels that summarise the variables loading onto each factor and reflect the underlying construct it represents. This process is both statistical and theoretical, requiring us to draw on domain knowledge to interpret the variables associated with each factor.

A well-named factor should capture the essence of its constituent variables, providing a concise yet informative description. For instance, a factor with high loadings from variables related to anxiety, stress, and worry might be named “Emotional Distress.”

6.4.2 How do we name factors?

The naming process is inherently subjective and often iterative, particularly when the factor structure is complex. We may need to refine factor names based on subsequent analysis or feedback from experts in the field.

Clear and consistent factor naming enhances the interpretability of the EFA results, allowing others to understand the findings of our analysis without returning to the raw data.