Structural Equation Modelling - Demonstration

B1705, Week Six

Introduction to Structural Equation Modeling (SEM)

What is it?

  • A multivariate statistical analysis technique that allows us to specify and estimate models that describe:
    • relationships among observed (measured) variables
    • relationships between latent (unobserved) constructs (factors) underlying these observed variables.

Key aspects of SEM

“Measurement Model”

  • Part of the model that links the observed indicators (e.g., survey items) to their underlying latent variables/factors.

Key aspects of SEM

“Structural Model”

  • Part of the model that specifies how latent variables (and possibly observed variables) relate to each other (i.e., regression paths among latent variables).

A structural model…

Key aspects of SEM

  • Latent Variables: Constructs that cannot be measured directly but are inferred from multiple observed indicators.
  • Model Fit: We want to see how well the hypothesised model reproduces the observed data (i.e., how well the covariance structure in our data is captured by our model).

Model Fit

  • Absolute Fit: Measures how well the model reproduces observed data.

  • Incremental Fit: Compares the model’s fit to a null or baseline model.

  • Parsimony Fit: Evaluates model complexity, penalizing overfitting.

Model Fit

Why is SEM useful?

  • It allows researchers to simultaneously test complex relationships.

  • It accounts for measurement error explicitly by modeling latent constructs.

  • It provides multiple model fit indices to assess how well the specified model aligns with the data.

Generating Dataset

We’ll simulate a dataset involving 120 athletes with the following hypothetical constructs:

  • Motivation: Latent variable measured by three items (Motiv1, Motiv2, Motiv3).

  • PhysicalCondition: Latent variable measured by three items (Phys1, Phys2, Phys3).

  • Performance: Latent variable measured by three items (Perf1, Perf2, Perf3).

A simple conceptual model might look like this:

Motivation → → Performance

PhysicalCondition → → Performance

That is, an athlete’s motivation and physical condition predict their performance.

This is a highly simplified scenario for demonstration purposes. Real-world examples often involve more nuanced relationships and additional constructs (e.g., psychological well-being, coaching satisfaction, etc.).

Data Generation Code

Code
library(lavaan)

set.seed(123)

# N athletes
N <- 120

# TRUE (population) parameters (for simulation):
# define each latent variable with means = 0, variance = 1
# then define factor loadings for each item.

# create latent variables as standardised normal random variables
Motivation         <- rnorm(N, mean = 0, sd = 1)
PhysicalCondition  <- rnorm(N, mean = 0, sd = 1)

# Suppose Performance is influenced by both Motivation and PhysicalCondition
# define Performance = 0.6*Motivation + 0.5*PhysicalCondition + error
# define error also as a normal random variable
Performance <- 0.6 * Motivation + 0.5 * PhysicalCondition + rnorm(N, 0, 1)

# For measurement items, have them reflect their latent constructs
# define item loadings in plausible range, e.g. around 0.7 to 0.9

# Motivation items
Motiv1 <- 0.8 * Motivation + rnorm(N, 0, 0.5)
Motiv2 <- 0.9 * Motivation + rnorm(N, 0, 0.5)
Motiv3 <- 0.7 * Motivation + rnorm(N, 0, 0.5)

# PhysicalCondition items
Phys1 <- 0.8 * PhysicalCondition + rnorm(N, 0, 0.5)
Phys2 <- 0.7 * PhysicalCondition + rnorm(N, 0, 0.5)
Phys3 <- 0.9 * PhysicalCondition + rnorm(N, 0, 0.5)

# Performance items
Perf1 <- 0.8 * Performance + rnorm(N, 0, 0.5)
Perf2 <- 0.8 * Performance + rnorm(N, 0, 0.5)
Perf3 <- 0.8 * Performance + rnorm(N, 0, 0.5)

# Combine into a data frame
sport_data <- data.frame(
  Motiv1, Motiv2, Motiv3,
  Phys1, Phys2, Phys3,
  Perf1, Perf2, Perf3
)

What we’ve done

  • Created three latent constructs in the simulation: Motivation, PhysicalCondition, Performance.
  • Created item-level observed variables (3 items per latent variable).
  • Imposed factor loadings so that each item loads onto its respective latent variable plus some measurement error.
  • Linked Performance with Motivation and PhysicalCondition at the structural level.

Exploring and Preparing the Data

Although we generated the data, in a real-world scenario, we’d import the data (e.g., via read.csv), inspect it (e.g., check summary statistics), and clean/prepare it (handle missing data, outliers, etc.) before analysis.

Summary statistics

Code
# Summary statistics
summary(sport_data)
     Motiv1             Motiv2             Motiv3             Phys1           
 Min.   :-1.95578   Min.   :-2.46935   Min.   :-1.87478   Min.   :-1.6706258  
 1st Qu.:-0.53845   1st Qu.:-0.54647   1st Qu.:-0.59084   1st Qu.:-0.6588508  
 Median : 0.02446   Median :-0.05042   Median :-0.02529   Median :-0.1774015  
 Mean   : 0.02414   Mean   : 0.01876   Mean   :-0.04613   Mean   :-0.0008326  
 3rd Qu.: 0.54019   3rd Qu.: 0.70533   3rd Qu.: 0.53138   3rd Qu.: 0.5912595  
 Max.   : 2.30500   Max.   : 2.30523   Max.   : 1.94040   Max.   : 2.8674545  
     Phys2              Phys3               Perf1              Perf2         
 Min.   :-1.95334   Min.   :-1.714333   Min.   :-2.65410   Min.   :-2.82536  
 1st Qu.:-0.47479   1st Qu.:-0.775219   1st Qu.:-0.74166   1st Qu.:-0.69050  
 Median :-0.01934   Median :-0.181477   Median : 0.03018   Median : 0.04108  
 Mean   : 0.05195   Mean   : 0.002534   Mean   : 0.06116   Mean   : 0.05313  
 3rd Qu.: 0.54350   3rd Qu.: 0.699804   3rd Qu.: 0.85931   3rd Qu.: 0.92512  
 Max.   : 2.87921   Max.   : 2.850848   Max.   : 2.56591   Max.   : 2.97722  
     Perf3         
 Min.   :-2.47781  
 1st Qu.:-0.65611  
 Median : 0.12665  
 Mean   : 0.05785  
 3rd Qu.: 0.72227  
 Max.   : 2.69991  

Correlations

Code
library(corrplot)

# Compute correlation matrix
M <- cor(sport_data)

# Create a heatmap-style plot
corrplot(M, 
         method = "color",   # color-coded squares
         type = "upper",     # show upper triangular matrix only
         addCoef.col = "black",  # add correlation coefficients
         tl.col = "black",   # text label color
         tl.srt = 45         # rotate text labels
)

Correlations

round(cor(sport_data), 2)
       Motiv1 Motiv2 Motiv3 Phys1 Phys2 Phys3 Perf1 Perf2 Perf3
Motiv1   1.00   0.68   0.69  0.01  0.03  0.09  0.30  0.31  0.39
Motiv2   0.68   1.00   0.72  0.09  0.03  0.01  0.36  0.34  0.42
Motiv3   0.69   0.72   1.00  0.06 -0.01  0.07  0.30  0.29  0.30
Phys1    0.01   0.09   0.06  1.00  0.58  0.65  0.30  0.23  0.24
Phys2    0.03   0.03  -0.01  0.58  1.00  0.72  0.30  0.26  0.18
Phys3    0.09   0.01   0.07  0.65  0.72  1.00  0.38  0.35  0.24
Perf1    0.30   0.36   0.30  0.30  0.30  0.38  1.00  0.82  0.76
Perf2    0.31   0.34   0.29  0.23  0.26  0.35  0.82  1.00  0.79
Perf3    0.39   0.42   0.30  0.24  0.18  0.24  0.76  0.79  1.00

Specifying the SEM Model

Introduction

  • Specifying an SEM model involves defining relationships between latent and observed variables, specifying regression paths, and incorporating error terms and constraints.

  • It establishes which variables are exogenous (independent, not influenced by other variables) or endogenous (influenced by other variables in the model), ensuring the model aligns with theory and is statistically identifiable.

Endogenous and Exogenous

  • Exogenous Variables (Independent Variables):

    • Motivation (not influenced by any other variable)

    • Physical Condition (not influenced by any other variable)

  • Endogenous Variable (Dependent Variable):

    • Performance (influenced by both Motivation and Physical Condition)

Note!

In an SEM model, the observed variables (Motiv1, Motiv2, Motiv3, Phys1, Phys2, Phys3, Perf1, Perf2, Perf3) are indicators rather than exogenous or endogenous variables themselves; they measure the latent constructs.

Measurement Model

  • At the level of the measurement model, we need to specify how the observed items load on their respective latent factors.

  • In lavaan syntax, a single latent variable is specified by listing the factor name to the left of =~ and the item names to the right.

For instance:

Motivation =~ Motiv1 + Motiv2 + Motiv3

Structural Model

  • At the level of the structural model, we want to express how the latent factors relate to each other at the structural level.

  • Our hypothesis is that Motivation and PhysicalCondition both positively predict Performance.

Hence, we might write in lavaan syntax:

Performance ~ Motivation + PhysicalCondition
Performance ~ Motivation + PhysicalCondition

Putting it all together

So the full model becomes:

sem_model <- '
  # Measurement Model
  Motivation =~ Motiv1 + Motiv2 + Motiv3
  PhysicalCondition =~ Phys1 + Phys2 + Phys3
  Performance =~ Perf1 + Perf2 + Perf3

  # Structural Model
  Performance ~ Motivation + PhysicalCondition
'

In the code:

  • =~ indicates that items load on their respective latent variable.

  • Performance ~ Motivation + PhysicalCondition indicates that Performance is regressed on Motivation and PhysicalCondition.

Fitting the Model in R using lavaan

We can now estimate the parameters of this model (factor loadings, intercepts, residual variances, regression coefficients, etc.) using our simulated dataset.

R code

Code
# Fit the model
fit <- sem(model = sem_model, data = sport_data)

Summary of our model

# View a summary of the fitted model
summary(fit, fit.measures = TRUE, standardized = TRUE)
lavaan 0.6-19 ended normally after 31 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        21

  Number of observations                           120

Model Test User Model:
                                                      
  Test statistic                                28.950
  Degrees of freedom                                24
  P-value (Chi-square)                           0.222

Model Test Baseline Model:

  Test statistic                               676.336
  Degrees of freedom                                36
  P-value                                        0.000

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    0.992
  Tucker-Lewis Index (TLI)                       0.988

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -1155.721
  Loglikelihood unrestricted model (H1)      -1141.246
                                                      
  Akaike (AIC)                                2353.442
  Bayesian (BIC)                              2411.980
  Sample-size adjusted Bayesian (SABIC)       2345.588

Root Mean Square Error of Approximation:

  RMSEA                                          0.041
  90 Percent confidence interval - lower         0.000
  90 Percent confidence interval - upper         0.089
  P-value H_0: RMSEA <= 0.050                    0.568
  P-value H_0: RMSEA >= 0.080                    0.100

Standardized Root Mean Square Residual:

  SRMR                                           0.034

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Latent Variables:
                       Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  Motivation =~                                                             
    Motiv1                1.000                               0.713    0.809
    Motiv2                1.131    0.115    9.800    0.000    0.806    0.852
    Motiv3                0.934    0.096    9.705    0.000    0.666    0.840
  PhysicalCondition =~                                                      
    Phys1                 1.000                               0.671    0.720
    Phys2                 1.049    0.129    8.156    0.000    0.704    0.799
    Phys3                 1.334    0.159    8.397    0.000    0.896    0.908
  Performance =~                                                            
    Perf1                 1.000                               0.956    0.895
    Perf2                 1.056    0.072   14.619    0.000    1.009    0.921
    Perf3                 0.919    0.071   12.988    0.000    0.878    0.857

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  Performance ~                                                         
    Motivation        0.558    0.125    4.464    0.000    0.416    0.416
    PhysicalCondtn    0.525    0.134    3.916    0.000    0.369    0.369

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  Motivation ~~                                                         
    PhysicalCondtn    0.030    0.050    0.597    0.551    0.062    0.062

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .Motiv1            0.268    0.049    5.520    0.000    0.268    0.345
   .Motiv2            0.246    0.053    4.604    0.000    0.246    0.274
   .Motiv3            0.186    0.038    4.897    0.000    0.186    0.295
   .Phys1             0.419    0.065    6.431    0.000    0.419    0.482
   .Phys2             0.281    0.053    5.291    0.000    0.281    0.362
   .Phys3             0.171    0.066    2.595    0.009    0.171    0.175
   .Perf1             0.226    0.046    4.934    0.000    0.226    0.198
   .Perf2             0.183    0.046    4.020    0.000    0.183    0.153
   .Perf3             0.279    0.047    5.925    0.000    0.279    0.266
    Motivation        0.508    0.100    5.082    0.000    1.000    1.000
    PhysicalCondtn    0.451    0.105    4.296    0.000    1.000    1.000
   .Performance       0.613    0.107    5.738    0.000    0.672    0.672

Explanation of arguments

  • model = sem_model is our model specification
  • string. data = sport_data tells lavaan which dataset to use.
  • summary(..., fit.measures = TRUE, standardized = TRUE) provides additional output:
    • fit.measures = TRUE displays commonly used fit indices (e.g., CFI, TLI, RMSEA, SRMR).
    • standardized = TRUE displays both unstandardized and standardized estimates.

Interpreting the Output

Model fit indices

  • Chi-square (χ2) test statistic: tests exact fit (often criticised as being too sensitive to sample size).

  • CFI (Comparative Fit Index): values close to .95 or higher are considered indicative of good fit.

Model fit indices

  • TLI (Tucker-Lewis Index): also close to .95 or higher for good fit.

  • RMSEA (Root Mean Square Error of Approximation): .06 or below is often considered a good fit.

Model fit indices

  • SRMR (Standardized Root Mean Residual): .08 or below is often considered acceptable.

Regression coefficients

Look for estimates, standard errors, and p-values for the paths:

  • Performance ← Motivation , Performance ← PhysicalCondition

  • Performance←Motivation,Performance←PhysicalCondition

Factor loadings

  • The measurement portion will list loadings of each item on its factor.

  • Typically, loadings above ~0.60 are considered good.

  • Standard errors and p-values indicate whether they differ significantly from zero (they should, if the items indeed measure the construct).

Standardised estimates

Make interpretation straightforward in terms of the relative strength of each predictor.

Refining and Checking Model Fit

  • If the initial model fit is not satisfactory, researchers often examine:

  • Modification Indices to see if freeing certain parameters might improve fit. (Always interpret these carefully and theoretically!)

  • Residual correlation matrix to identify areas where the model does not capture the relationships well.

In lavaan, you can check modification indices using:

modindices(fit, sort = TRUE, minimum.value = 10)
[1] lhs      op       rhs      mi       epc      sepc.lv  sepc.all sepc.nox
<0 rows> (or 0-length row.names)

Warning: Blindly adding modifications can lead to overfitting. You should only add paths that make conceptual sense.

Reporting and Discussion

Once you have a final model with acceptable fit, you would typically report:

  • Fit indices (CFI, TLI, RMSEA, SRMR, χ2 , etc.).

  • Path estimates (standardised/unstandardised) among latent constructs.

  • Factor loadings and reliability measures for each latent factor (e.g., Cronbach’s alpha, composite reliability).

  • Any theoretical or practical implications of these relationships in your study context.

Visualising Structural Equation Models

The semPlot package provides a function called semPaths() that can render path diagrams directly from a fitted lavaan model.

Code
library(semPlot)

Assume we have a fitted lavaan model called fit. We can visualise it via:

Code
library(semPlot)

png("sem_plot.png", width = 800, height = 600)  # open a PNG device

semPaths(fit, 
         what = "std",
         layout = "tree",
         style = "ram",
         nCharNodes = 0,
         residuals = FALSE,
         intercepts = FALSE
         # etc. 
)

dev.off()  # close the PNG device
quartz_off_screen 
                2 

Visualising with lavaanPlot

An alternative package is lavaanPlot, which aims to make generating publication-ready SEM diagrams straightforward.

Code
library(lavaan)
library(lavaanPlot)

lavaanPlot(
  model = fit,
  stand = TRUE,        # use standardized coefficients
  coefs = TRUE,        # display coefficients on paths
  covs = TRUE,         # show covariances (if any)
  stars = "regress",   # add significance stars to regression paths
  node_options = list(shape = "ellipse", fontname = "Helvetica"),
  edge_options = list(color = "gray")
)

Plot

Summary of Methods