Introduction to Structural Equation Modeling (SEM)
What is it?
A multivariate statistical analysis technique that allows us to specify and estimate models that describe:
relationships among observed (measured) variables
relationships between latent (unobserved) constructs (factors) underlying these observed variables.
Key aspects of SEM
“Measurement Model”
Part of the model that links the observed indicators (e.g., survey items) to their underlying latent variables/factors.
Key aspects of SEM
“Structural Model”
Part of the model that specifies how latent variables (and possibly observed variables) relate to each other (i.e., regression paths among latent variables).
A structural model…
Key aspects of SEM
Latent Variables: Constructs that cannot be measured directly but are inferred from multiple observed indicators.
Model Fit: We want to see how well the hypothesised model reproduces the observed data (i.e., how well the covariance structure in our data is captured by our model).
Model Fit
Absolute Fit: Measures how well the model reproduces observed data.
Incremental Fit: Compares the model’s fit to a null or baseline model.
Parsimony Fit: Evaluates model complexity, penalizing overfitting.
Model Fit
Why is SEM useful?
It allows researchers to simultaneously test complex relationships.
It accounts for measurement error explicitly by modeling latent constructs.
It provides multiple model fit indices to assess how well the specified model aligns with the data.
Generating Dataset
We’ll simulate a dataset involving 120 athletes with the following hypothetical constructs:
Motivation: Latent variable measured by three items (Motiv1, Motiv2, Motiv3).
PhysicalCondition: Latent variable measured by three items (Phys1, Phys2, Phys3).
Performance: Latent variable measured by three items (Perf1, Perf2, Perf3).
A simple conceptual model might look like this:
Motivation → → Performance
PhysicalCondition → → Performance
That is, an athlete’s motivation and physical condition predict their performance.
This is a highly simplified scenario for demonstration purposes. Real-world examples often involve more nuanced relationships and additional constructs (e.g., psychological well-being, coaching satisfaction, etc.).
Data Generation Code
Code
library(lavaan)set.seed(123)# N athletesN <-120# TRUE (population) parameters (for simulation):# define each latent variable with means = 0, variance = 1# then define factor loadings for each item.# create latent variables as standardised normal random variablesMotivation <-rnorm(N, mean =0, sd =1)PhysicalCondition <-rnorm(N, mean =0, sd =1)# Suppose Performance is influenced by both Motivation and PhysicalCondition# define Performance = 0.6*Motivation + 0.5*PhysicalCondition + error# define error also as a normal random variablePerformance <-0.6* Motivation +0.5* PhysicalCondition +rnorm(N, 0, 1)# For measurement items, have them reflect their latent constructs# define item loadings in plausible range, e.g. around 0.7 to 0.9# Motivation itemsMotiv1 <-0.8* Motivation +rnorm(N, 0, 0.5)Motiv2 <-0.9* Motivation +rnorm(N, 0, 0.5)Motiv3 <-0.7* Motivation +rnorm(N, 0, 0.5)# PhysicalCondition itemsPhys1 <-0.8* PhysicalCondition +rnorm(N, 0, 0.5)Phys2 <-0.7* PhysicalCondition +rnorm(N, 0, 0.5)Phys3 <-0.9* PhysicalCondition +rnorm(N, 0, 0.5)# Performance itemsPerf1 <-0.8* Performance +rnorm(N, 0, 0.5)Perf2 <-0.8* Performance +rnorm(N, 0, 0.5)Perf3 <-0.8* Performance +rnorm(N, 0, 0.5)# Combine into a data framesport_data <-data.frame( Motiv1, Motiv2, Motiv3, Phys1, Phys2, Phys3, Perf1, Perf2, Perf3)
What we’ve done
Created three latent constructs in the simulation: Motivation, PhysicalCondition, Performance.
Created item-level observed variables (3 items per latent variable).
Imposed factor loadings so that each item loads onto its respective latent variable plus some measurement error.
Linked Performance with Motivation and PhysicalCondition at the structural level.
Exploring and Preparing the Data
Although we generated the data, in a real-world scenario, we’d import the data (e.g., via read.csv), inspect it (e.g., check summary statistics), and clean/prepare it (handle missing data, outliers, etc.) before analysis.
Summary statistics
Code
# Summary statisticssummary(sport_data)
Motiv1 Motiv2 Motiv3 Phys1
Min. :-1.95578 Min. :-2.46935 Min. :-1.87478 Min. :-1.6706258
1st Qu.:-0.53845 1st Qu.:-0.54647 1st Qu.:-0.59084 1st Qu.:-0.6588508
Median : 0.02446 Median :-0.05042 Median :-0.02529 Median :-0.1774015
Mean : 0.02414 Mean : 0.01876 Mean :-0.04613 Mean :-0.0008326
3rd Qu.: 0.54019 3rd Qu.: 0.70533 3rd Qu.: 0.53138 3rd Qu.: 0.5912595
Max. : 2.30500 Max. : 2.30523 Max. : 1.94040 Max. : 2.8674545
Phys2 Phys3 Perf1 Perf2
Min. :-1.95334 Min. :-1.714333 Min. :-2.65410 Min. :-2.82536
1st Qu.:-0.47479 1st Qu.:-0.775219 1st Qu.:-0.74166 1st Qu.:-0.69050
Median :-0.01934 Median :-0.181477 Median : 0.03018 Median : 0.04108
Mean : 0.05195 Mean : 0.002534 Mean : 0.06116 Mean : 0.05313
3rd Qu.: 0.54350 3rd Qu.: 0.699804 3rd Qu.: 0.85931 3rd Qu.: 0.92512
Max. : 2.87921 Max. : 2.850848 Max. : 2.56591 Max. : 2.97722
Perf3
Min. :-2.47781
1st Qu.:-0.65611
Median : 0.12665
Mean : 0.05785
3rd Qu.: 0.72227
Max. : 2.69991
Correlations
Code
library(corrplot)# Compute correlation matrixM <-cor(sport_data)# Create a heatmap-style plotcorrplot(M, method ="color", # color-coded squarestype ="upper", # show upper triangular matrix onlyaddCoef.col ="black", # add correlation coefficientstl.col ="black", # text label colortl.srt =45# rotate text labels)
Specifying an SEM model involves defining relationships between latent and observed variables, specifying regression paths, and incorporating error terms and constraints.
It establishes which variables are exogenous (independent, not influenced by other variables) or endogenous (influenced by other variables in the model), ensuring the model aligns with theory and is statistically identifiable.
Endogenous and Exogenous
Exogenous Variables (Independent Variables):
Motivation (not influenced by any other variable)
Physical Condition (not influenced by any other variable)
Endogenous Variable (Dependent Variable):
Performance (influenced by both Motivation and Physical Condition)
Note!
In an SEM model, the observed variables (Motiv1, Motiv2, Motiv3, Phys1, Phys2, Phys3, Perf1, Perf2, Perf3) are indicators rather than exogenous or endogenous variables themselves; they measure the latent constructs.
Measurement Model
At the level of the measurement model, we need to specify how the observed items load on their respective latent factors.
In lavaan syntax, a single latent variable is specified by listing the factor name to the left of =~ and the item names to the right.
For instance:
Motivation =~ Motiv1 + Motiv2 + Motiv3
Structural Model
At the level of the structural model, we want to express how the latent factors relate to each other at the structural level.
Our hypothesis is that Motivation and PhysicalCondition both positively predict Performance.
=~ indicates that items load on their respective latent variable.
Performance ~ Motivation + PhysicalCondition indicates that Performance is regressed on Motivation and PhysicalCondition.
Fitting the Model in R using lavaan
We can now estimate the parameters of this model (factor loadings, intercepts, residual variances, regression coefficients, etc.) using our simulated dataset.
R code
Code
# Fit the modelfit <-sem(model = sem_model, data = sport_data)
Summary of our model
# View a summary of the fitted modelsummary(fit, fit.measures =TRUE, standardized =TRUE)
lavaan 0.6-19 ended normally after 31 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 21
Number of observations 120
Model Test User Model:
Test statistic 28.950
Degrees of freedom 24
P-value (Chi-square) 0.222
Model Test Baseline Model:
Test statistic 676.336
Degrees of freedom 36
P-value 0.000
User Model versus Baseline Model:
Comparative Fit Index (CFI) 0.992
Tucker-Lewis Index (TLI) 0.988
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) -1155.721
Loglikelihood unrestricted model (H1) -1141.246
Akaike (AIC) 2353.442
Bayesian (BIC) 2411.980
Sample-size adjusted Bayesian (SABIC) 2345.588
Root Mean Square Error of Approximation:
RMSEA 0.041
90 Percent confidence interval - lower 0.000
90 Percent confidence interval - upper 0.089
P-value H_0: RMSEA <= 0.050 0.568
P-value H_0: RMSEA >= 0.080 0.100
Standardized Root Mean Square Residual:
SRMR 0.034
Parameter Estimates:
Standard errors Standard
Information Expected
Information saturated (h1) model Structured
Latent Variables:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
Motivation =~
Motiv1 1.000 0.713 0.809
Motiv2 1.131 0.115 9.800 0.000 0.806 0.852
Motiv3 0.934 0.096 9.705 0.000 0.666 0.840
PhysicalCondition =~
Phys1 1.000 0.671 0.720
Phys2 1.049 0.129 8.156 0.000 0.704 0.799
Phys3 1.334 0.159 8.397 0.000 0.896 0.908
Performance =~
Perf1 1.000 0.956 0.895
Perf2 1.056 0.072 14.619 0.000 1.009 0.921
Perf3 0.919 0.071 12.988 0.000 0.878 0.857
Regressions:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
Performance ~
Motivation 0.558 0.125 4.464 0.000 0.416 0.416
PhysicalCondtn 0.525 0.134 3.916 0.000 0.369 0.369
Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
Motivation ~~
PhysicalCondtn 0.030 0.050 0.597 0.551 0.062 0.062
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.Motiv1 0.268 0.049 5.520 0.000 0.268 0.345
.Motiv2 0.246 0.053 4.604 0.000 0.246 0.274
.Motiv3 0.186 0.038 4.897 0.000 0.186 0.295
.Phys1 0.419 0.065 6.431 0.000 0.419 0.482
.Phys2 0.281 0.053 5.291 0.000 0.281 0.362
.Phys3 0.171 0.066 2.595 0.009 0.171 0.175
.Perf1 0.226 0.046 4.934 0.000 0.226 0.198
.Perf2 0.183 0.046 4.020 0.000 0.183 0.153
.Perf3 0.279 0.047 5.925 0.000 0.279 0.266
Motivation 0.508 0.100 5.082 0.000 1.000 1.000
PhysicalCondtn 0.451 0.105 4.296 0.000 1.000 1.000
.Performance 0.613 0.107 5.738 0.000 0.672 0.672
Explanation of arguments
model = sem_model is our model specification
string. data = sport_data tells lavaan which dataset to use.
The measurement portion will list loadings of each item on its factor.
Typically, loadings above ~0.60 are considered good.
Standard errors and p-values indicate whether they differ significantly from zero (they should, if the items indeed measure the construct).
Standardised estimates
Make interpretation straightforward in terms of the relative strength of each predictor.
Refining and Checking Model Fit
If the initial model fit is not satisfactory, researchers often examine:
Modification Indices to see if freeing certain parameters might improve fit. (Always interpret these carefully and theoretically!)
Residual correlation matrix to identify areas where the model does not capture the relationships well.
In lavaan, you can check modification indices using:
modindices(fit, sort =TRUE, minimum.value =10)
[1] lhs op rhs mi epc sepc.lv sepc.all sepc.nox
<0 rows> (or 0-length row.names)
Warning: Blindly adding modifications can lead to overfitting. You should only add paths that make conceptual sense.
Reporting and Discussion
Once you have a final model with acceptable fit, you would typically report:
Fit indices (CFI, TLI, RMSEA, SRMR, χ2 , etc.).
Path estimates (standardised/unstandardised) among latent constructs.
Factor loadings and reliability measures for each latent factor (e.g., Cronbach’s alpha, composite reliability).
Any theoretical or practical implications of these relationships in your study context.
Visualising Structural Equation Models
The semPlot package provides a function called semPaths() that can render path diagrams directly from a fitted lavaan model.
Code
library(semPlot)
Assume we have a fitted lavaan model called fit. We can visualise it via:
Code
library(semPlot)png("sem_plot.png", width =800, height =600) # open a PNG devicesemPaths(fit, what ="std",layout ="tree",style ="ram",nCharNodes =0,residuals =FALSE,intercepts =FALSE# etc. )dev.off() # close the PNG device
quartz_off_screen
2
Visualising with lavaanPlot
An alternative package is lavaanPlot, which aims to make generating publication-ready SEM diagrams straightforward.
Code
library(lavaan)library(lavaanPlot)lavaanPlot(model = fit,stand =TRUE, # use standardized coefficientscoefs =TRUE, # display coefficients on pathscovs =TRUE, # show covariances (if any)stars ="regress", # add significance stars to regression pathsnode_options =list(shape ="ellipse", fontname ="Helvetica"),edge_options =list(color ="gray"))