Canonical Correlation Analysis in R

1 Introduction

  • This demonstrates Canonical Correlation Analysis (CCA) step-by-step using R.
  • We’ll use a dataset based on sport to mimic a real-world application.
  • We’ll explore why CCA is useful,

2 Load Libraries

# AH libraries for CCA

library(CCA)  # For canonical correlation analysis
library(knitr) # For formatting tables
library(kableExtra) # Really useful for nice tables!
library(MASS)  # For generalised inverse if needed
library(ggplot2) # For visual exploration
library(GGally)  # For correlation matrix

# I also use ggcorplot later...

3 Define the Dataset

  • We have a sport-related dataset with two groups of variables:
    • Physical attributes (e.g., speed, endurance, strength, flexibility)
    • Performance metrics (e.g., goals scored, assists, accuracy, stamina)
  • Our goal is to understand how these two sets of variables relate to each other.
Physical Attributes
Speed Endurance Strength Flexibility
6.2 77.5 71.5 66.3
6.7 74.7 83.1 66.2
9.3 74.6 77.0 62.7
7.1 88.7 75.8 51.9
7.2 72.7 68.6 59.0
9.6 90.2 79.5 57.8
Performance Metrics
Goals_Scored Assists Accuracy Stamina
18.8 6.9 62.8 87.1
15.2 6.3 62.5 60.1
8.9 7.0 60.6 74.2
12.2 8.3 59.5 79.0
8.3 12.8 65.6 70.0
8.1 7.7 73.3 81.0

4 Exploratory Data Analysis

  • Before applying CCA, we explore the relationships within and between datasets.
  • We check correlations between physical attributes and performance metrics.
  • If we see strong correlations, CCA may be useful…
# Correlation matrix for Physical Attributes
GGally::ggpairs(A) 

# Correlation matrix for Performance Metrics
GGally::ggpairs(B) 

# Correlation between A and B
cor_matrix <- round(cor(cbind(A, B)),2)
kable(cor_matrix, caption = "Correlation Matrix Between All Variables") %>% kable_styling()
Correlation Matrix Between All Variables
Speed Endurance Strength Flexibility Goals_Scored Assists Accuracy Stamina
Speed 1.00 -0.04 0.03 -0.12 -0.21 0.02 -0.03 -0.14
Endurance -0.04 1.00 -0.16 -0.16 -0.13 -0.06 0.09 -0.05
Strength 0.03 -0.16 1.00 -0.01 -0.04 -0.19 -0.03 0.07
Flexibility -0.12 -0.16 -0.01 1.00 0.20 0.06 -0.05 0.13
Goals_Scored -0.21 -0.13 -0.04 0.20 1.00 -0.05 0.05 -0.03
Assists 0.02 -0.06 -0.19 0.06 -0.05 1.00 -0.13 -0.13
Accuracy -0.03 0.09 -0.03 -0.05 0.05 -0.13 1.00 -0.10
Stamina -0.14 -0.05 0.07 0.13 -0.03 -0.13 -0.10 1.00

4.1 Visualisation of the correlation matrix between variables

Code
library(ggcorrplot)

# Compute correlation matrix between all variables
cor_matrix <- round(cor(cbind(A, B)), 2)

# Create heatmap-style correlation plot
ggcorrplot(cor_matrix, 
           method = "circle",  
           type = "lower",      
           lab = TRUE,          
           colors = c("#D73027", "white", "#1A9850"),  # Red - Green gradient
           title = "Heatmap of Correlations Between All Variables") +
  theme_minimal()

5 Centering the Data

  • We subtract the mean from each variable.
  • This makes the data easier to analyse by removing biases.
A_centered <- scale(A, center = TRUE, scale = FALSE)
B_centered <- scale(B, center = TRUE, scale = FALSE)

kable(head(A_centered,6), caption = "Phys Attributes Centered") %>% kable_styling()
Phys Attributes Centered
Speed Endurance Strength Flexibility
-0.856 1.036 -5.452 5.992
-0.356 -1.764 6.148 5.892
2.244 -1.864 0.048 2.392
0.044 12.236 -1.152 -8.408
0.144 -3.764 -8.352 -1.308
2.544 13.736 2.548 -2.508
kable(head(B_centered,6), caption = "Perf Metrics Centered") %>% kable_styling()
Perf Metrics Centered
Goals_Scored Assists Accuracy Stamina
8.842 -1.846 -7.226 9.768
5.242 -2.446 -7.526 -17.232
-1.058 -1.746 -9.426 -3.132
2.242 -0.446 -10.526 1.668
-1.658 4.054 -4.426 -7.332
-1.858 -1.046 3.274 3.668

6 Compute Covariance Matrices

S_AA <- cov(A_centered)
S_BB <- cov(B_centered)
S_AB <- cov(A_centered, B_centered)
S_BA <- t(S_AB)

kable(S_AA, caption = "Covariance Matrix of A") %>% kable_styling()
Covariance Matrix of A
Speed Endurance Strength Flexibility
Speed 1.9123102 -0.4634531 0.4674367 -1.2537224
Endurance -0.4634531 82.0758204 -16.6872735 -10.5227673
Strength 0.4674367 -16.6872735 140.7556082 -0.6830776
Flexibility -1.2537224 -10.5227673 -0.6830776 55.4868735

6.1 Interpreting the covariance matrices

The covariance matrix of A provides insights into how the physical attributes (Speed, Endurance, Strength, and Flexibility) vary together.

Each value in the matrix represents how two variables change in relation to each other.

6.1.1 Diagonal Values

The diagonal values represent the variance of each variable. - A higher value indicates that this variable has more variation across athletes.

6.1.2 Off-Diagonal Elements

  • These values show the covariance between pairs of variables.
  • A positive covariance means that when one variable increases, the other tends to increase as well.
  • A negative covariance suggests that as one variable increases, the other tends to decrease.

For example:

  • If the covariance between Speed and Endurance is high and positive, faster athletes also tend to have higher endurance.
  • If the covariance between Strength and Flexibility is negative, it may indicate that stronger athletes tend to be less flexible.

This matrix helps us understand which physical attributes are related and can guide training decisions. If certain attributes are highly correlated, it may suggest that improving one could also enhance the other.

kable(S_BB, caption = "Covariance Matrix of B") %>% kable_styling()
Covariance Matrix of B
Goals_Scored Assists Accuracy Stamina
Goals_Scored 14.3457510 -0.5608857 1.857237 -1.192914
Assists -0.5608857 8.0233510 -3.653465 -3.415176
Accuracy 1.8572367 -3.6534653 106.331351 -9.767788
Stamina -1.1929143 -3.4151755 -9.767788 90.113241
kable(S_AB, caption = "Cross-Covariance Matrix") %>% kable_styling()
Cross-Covariance Matrix
Goals_Scored Assists Accuracy Stamina
Speed -1.123314 0.0843102 -0.4653633 -1.872237
Endurance -4.316645 -1.5548408 8.4095265 -4.541886
Strength -1.829812 -6.3312163 -3.3772980 7.941771
Flexibility 5.573812 1.3190122 -4.0555184 9.279535

6.2 Interpreting the Cross-Covariance Matrix (S_AB)

The cross-covariance matrix S_AB measures how the variables in dataset A (physical attributes) relate to the variables in dataset B (performance metrics).

6.2.1 What It Tells Us

Strength of Relationships Between A and B

  • Each value in the matrix shows how much a variable in A changes when a variable in B changes.
  • A high positive value means that as a physical attribute increases, the performance metric also tends to increase.
  • A high negative value means that as a physical attribute increases, the performance metric tends to decrease.

6.2.2 Example Interpretations

If Speed and Goals Scored have a high positive covariance, faster athletes tend to score more goals. If Strength and Accuracy have a low or negative covariance, being stronger does not necessarily improve shooting accuracy.

6.2.3 Why It Matters for CCA

The goal of Canonical Correlation Analysis (CCA) is to find the best linear combinations of A and B that maximise their relationships. The cross-covariance matrix helps identify which pairs of variables are most strongly linked, guiding the formation of these combinations. Thus, S_AB is a key step to understand how physical ability translates into performance outcomes. It lays the foundation for finding the strongest overall relationships using CCA.

7 Visualising the Covariance Matrices

7.1 Using heatmaps

library(ggcorrplot)

# Plot covariance matrix for Physical Attributes (A)
ggcorrplot(S_AA, 
           method = "circle", 
           type = "lower", 
           lab = TRUE, 
           colors = c("#D73027", "white", "#1A9850"), 
           title = "Covariance Matrix: Physical Attributes")

# Plot covariance matrix for Performance Metrics (B)
ggcorrplot(S_BB, 
           method = "circle", 
           type = "lower", 
           lab = TRUE, 
           colors = c("#D73027", "white", "#1A9850"), 
           title = "Covariance Matrix: Performance Metrics")

# Plot covariance between A and B (Cross-Covariance Matrix)
ggcorrplot(S_AB, 
           method = "circle", 
           lab = TRUE, 
           colors = c("#D73027", "white", "#1A9850"), 
           title = "Cross-Covariance Matrix: A vs. B")

7.2 Using scatter plots

library(GGally)

# Scatterplot for physical attributes
GGally::ggpairs(A, title = "Relationships Between Physical Attributes")

# Scatterplot for performance metrics
GGally::ggpairs(B, title = "Relationships Between Performance Metrics")

# Scatterplot between A and B
GGally::ggpairs(cbind(A, B), title = "Relationships Between Physical & Performance Metrics")

8 Compute Canonical Correlation

8.1 Computing canonical correlation in R

Now, we can calculate the canonical correlations between our two datasets.

cca_result <- cancor(A, B)
canonical_correlations <- cca_result$cor

kable(data.frame(Canonical_Correlations = canonical_correlations),
      caption = "Canonical Correlations") %>% kable_styling()
Canonical Correlations
Canonical_Correlations
0.3695260
0.2164503
0.0975520
0.0241750

8.2 Interpreting the output

Each value represents the strength of the relationship between a pair of linear combinations of A (physical attributes) and B (performance metrics).

First Canonical Correlation (0.37)

  • This is the strongest relationship found between the two datasets.
  • While moderate, it suggests that some physical attributes influence performance metrics.

Second Canonical Correlation (0.22)

  • This is weaker than the first but still indicates a moderate relationship.

Third Canonical Correlation (0.10)

  • This is quite weak, suggesting minimal correlation between this pair of canonical variables.

Fourth Canonical Correlation (0.02)

  • This is very close to zero, indicating almost no meaningful relationship.

8.3 Number of coefficients produced

  • Note that the maximum number of canonical correlations is equal to the minimum number of variables in either dataset A or B.
  • In this case, A has 4 variables, and B has 4 variables.
  • This means we can have at most 4 canonical correlations.
  • However, only the first one or two tend to be meaningful, as the later ones often capture weak or random relationships.

9 Compute Canonical Coefficients

canonical_weights_A <- cca_result$xcoef
canonical_weights_B <- cca_result$ycoef

kable(canonical_weights_A, caption = "Canonical Weights for A") %>% kable_styling()
Canonical Weights for A
Speed -0.0627604 0.0388275 -0.0589409 0.0441739
Endurance -0.0062280 -0.0040666 0.0100981 0.0102400
Strength -0.0023771 -0.0112968 -0.0038385 0.0008839
Flexibility 0.0107944 -0.0002215 -0.0042680 0.0157773
kable(canonical_weights_B, caption = "Canonical Weights for B") %>% kable_styling()
Canonical Weights for B
Goals_Scored 0.0319353 -0.0009409 0.0026402 -0.0200758
Assists 0.0165542 0.0442138 0.0044469 0.0198735
Accuracy -0.0007733 0.0004309 0.0139553 0.0015648
Stamina 0.0077856 -0.0057752 0.0010788 0.0117718

10 Interpretation

  • The first canonical correlation shows the strongest relationship between physical attributes and performance.
  • The canonical weights help determine which variables contribute the most.
  • If correlations are strong, training based on these insights could improve performance.

11 Conclusion

In this section we:

  • demonstrated Canonical Correlation Analysis (CCA) step-by-step.
  • explored the data to ensure CCA was appropriate.
  • calculated canonical correlations and weights to understand relationships between physical attributes and performance.