# AH libraries for CCA
library(CCA) # For canonical correlation analysis
library(knitr) # For formatting tables
library(kableExtra) # Really useful for nice tables!
library(MASS) # For generalised inverse if needed
library(ggplot2) # For visual exploration
library(GGally) # For correlation matrix
# I also use ggcorplot later...
Canonical Correlation Analysis in R
1 Introduction
- This demonstrates Canonical Correlation Analysis (CCA) step-by-step using R.
- We’ll use a dataset based on sport to mimic a real-world application.
- We’ll explore why CCA is useful,
2 Load Libraries
3 Define the Dataset
- We have a sport-related dataset with two groups of variables:
- Physical attributes (e.g., speed, endurance, strength, flexibility)
- Performance metrics (e.g., goals scored, assists, accuracy, stamina)
- Our goal is to understand how these two sets of variables relate to each other.
Speed | Endurance | Strength | Flexibility |
---|---|---|---|
6.2 | 77.5 | 71.5 | 66.3 |
6.7 | 74.7 | 83.1 | 66.2 |
9.3 | 74.6 | 77.0 | 62.7 |
7.1 | 88.7 | 75.8 | 51.9 |
7.2 | 72.7 | 68.6 | 59.0 |
9.6 | 90.2 | 79.5 | 57.8 |
Goals_Scored | Assists | Accuracy | Stamina |
---|---|---|---|
18.8 | 6.9 | 62.8 | 87.1 |
15.2 | 6.3 | 62.5 | 60.1 |
8.9 | 7.0 | 60.6 | 74.2 |
12.2 | 8.3 | 59.5 | 79.0 |
8.3 | 12.8 | 65.6 | 70.0 |
8.1 | 7.7 | 73.3 | 81.0 |
4 Exploratory Data Analysis
- Before applying CCA, we explore the relationships within and between datasets.
- We check correlations between physical attributes and performance metrics.
- If we see strong correlations, CCA may be useful…
# Correlation matrix for Physical Attributes
::ggpairs(A) GGally
# Correlation matrix for Performance Metrics
::ggpairs(B) GGally
# Correlation between A and B
<- round(cor(cbind(A, B)),2)
cor_matrix kable(cor_matrix, caption = "Correlation Matrix Between All Variables") %>% kable_styling()
Speed | Endurance | Strength | Flexibility | Goals_Scored | Assists | Accuracy | Stamina | |
---|---|---|---|---|---|---|---|---|
Speed | 1.00 | -0.04 | 0.03 | -0.12 | -0.21 | 0.02 | -0.03 | -0.14 |
Endurance | -0.04 | 1.00 | -0.16 | -0.16 | -0.13 | -0.06 | 0.09 | -0.05 |
Strength | 0.03 | -0.16 | 1.00 | -0.01 | -0.04 | -0.19 | -0.03 | 0.07 |
Flexibility | -0.12 | -0.16 | -0.01 | 1.00 | 0.20 | 0.06 | -0.05 | 0.13 |
Goals_Scored | -0.21 | -0.13 | -0.04 | 0.20 | 1.00 | -0.05 | 0.05 | -0.03 |
Assists | 0.02 | -0.06 | -0.19 | 0.06 | -0.05 | 1.00 | -0.13 | -0.13 |
Accuracy | -0.03 | 0.09 | -0.03 | -0.05 | 0.05 | -0.13 | 1.00 | -0.10 |
Stamina | -0.14 | -0.05 | 0.07 | 0.13 | -0.03 | -0.13 | -0.10 | 1.00 |
4.1 Visualisation of the correlation matrix between variables
Code
library(ggcorrplot)
# Compute correlation matrix between all variables
<- round(cor(cbind(A, B)), 2)
cor_matrix
# Create heatmap-style correlation plot
ggcorrplot(cor_matrix,
method = "circle",
type = "lower",
lab = TRUE,
colors = c("#D73027", "white", "#1A9850"), # Red - Green gradient
title = "Heatmap of Correlations Between All Variables") +
theme_minimal()
5 Centering the Data
- We subtract the mean from each variable.
- This makes the data easier to analyse by removing biases.
<- scale(A, center = TRUE, scale = FALSE)
A_centered <- scale(B, center = TRUE, scale = FALSE)
B_centered
kable(head(A_centered,6), caption = "Phys Attributes Centered") %>% kable_styling()
Speed | Endurance | Strength | Flexibility |
---|---|---|---|
-0.856 | 1.036 | -5.452 | 5.992 |
-0.356 | -1.764 | 6.148 | 5.892 |
2.244 | -1.864 | 0.048 | 2.392 |
0.044 | 12.236 | -1.152 | -8.408 |
0.144 | -3.764 | -8.352 | -1.308 |
2.544 | 13.736 | 2.548 | -2.508 |
kable(head(B_centered,6), caption = "Perf Metrics Centered") %>% kable_styling()
Goals_Scored | Assists | Accuracy | Stamina |
---|---|---|---|
8.842 | -1.846 | -7.226 | 9.768 |
5.242 | -2.446 | -7.526 | -17.232 |
-1.058 | -1.746 | -9.426 | -3.132 |
2.242 | -0.446 | -10.526 | 1.668 |
-1.658 | 4.054 | -4.426 | -7.332 |
-1.858 | -1.046 | 3.274 | 3.668 |
6 Compute Covariance Matrices
<- cov(A_centered)
S_AA <- cov(B_centered)
S_BB <- cov(A_centered, B_centered)
S_AB <- t(S_AB)
S_BA
kable(S_AA, caption = "Covariance Matrix of A") %>% kable_styling()
Speed | Endurance | Strength | Flexibility | |
---|---|---|---|---|
Speed | 1.9123102 | -0.4634531 | 0.4674367 | -1.2537224 |
Endurance | -0.4634531 | 82.0758204 | -16.6872735 | -10.5227673 |
Strength | 0.4674367 | -16.6872735 | 140.7556082 | -0.6830776 |
Flexibility | -1.2537224 | -10.5227673 | -0.6830776 | 55.4868735 |
6.1 Interpreting the covariance matrices
The covariance matrix of A provides insights into how the physical attributes (Speed
, Endurance
, Strength
, and Flexibility
) vary together.
Each value in the matrix represents how two variables change in relation to each other.
6.1.1 Diagonal Values
The diagonal values represent the variance of each variable. - A higher value indicates that this variable has more variation across athletes.
6.1.2 Off-Diagonal Elements
- These values show the covariance between pairs of variables.
- A positive covariance means that when one variable increases, the other tends to increase as well.
- A negative covariance suggests that as one variable increases, the other tends to decrease.
For example:
- If the covariance between
Speed
andEndurance
is high and positive, faster athletes also tend to have higher endurance. - If the covariance between
Strength
andFlexibility
is negative, it may indicate that stronger athletes tend to be less flexible.
This matrix helps us understand which physical attributes are related and can guide training decisions. If certain attributes are highly correlated, it may suggest that improving one could also enhance the other.
kable(S_BB, caption = "Covariance Matrix of B") %>% kable_styling()
Goals_Scored | Assists | Accuracy | Stamina | |
---|---|---|---|---|
Goals_Scored | 14.3457510 | -0.5608857 | 1.857237 | -1.192914 |
Assists | -0.5608857 | 8.0233510 | -3.653465 | -3.415176 |
Accuracy | 1.8572367 | -3.6534653 | 106.331351 | -9.767788 |
Stamina | -1.1929143 | -3.4151755 | -9.767788 | 90.113241 |
kable(S_AB, caption = "Cross-Covariance Matrix") %>% kable_styling()
Goals_Scored | Assists | Accuracy | Stamina | |
---|---|---|---|---|
Speed | -1.123314 | 0.0843102 | -0.4653633 | -1.872237 |
Endurance | -4.316645 | -1.5548408 | 8.4095265 | -4.541886 |
Strength | -1.829812 | -6.3312163 | -3.3772980 | 7.941771 |
Flexibility | 5.573812 | 1.3190122 | -4.0555184 | 9.279535 |
6.2 Interpreting the Cross-Covariance Matrix (S_AB)
The cross-covariance matrix S_AB
measures how the variables in dataset A (physical attributes) relate to the variables in dataset B (performance metrics).
6.2.1 What It Tells Us
Strength of Relationships Between A and B
- Each value in the matrix shows how much a variable in A changes when a variable in B changes.
- A high positive value means that as a physical attribute increases, the performance metric also tends to increase.
- A high negative value means that as a physical attribute increases, the performance metric tends to decrease.
6.2.2 Example Interpretations
If Speed
and Goals Scored
have a high positive covariance, faster athletes tend to score more goals. If Strength
and Accuracy
have a low or negative covariance, being stronger does not necessarily improve shooting accuracy.
6.2.3 Why It Matters for CCA
The goal of Canonical Correlation Analysis (CCA) is to find the best linear combinations of A and B that maximise their relationships. The cross-covariance matrix helps identify which pairs of variables are most strongly linked, guiding the formation of these combinations. Thus, S_AB
is a key step to understand how physical ability translates into performance outcomes. It lays the foundation for finding the strongest overall relationships using CCA.
7 Visualising the Covariance Matrices
7.1 Using heatmaps
library(ggcorrplot)
# Plot covariance matrix for Physical Attributes (A)
ggcorrplot(S_AA,
method = "circle",
type = "lower",
lab = TRUE,
colors = c("#D73027", "white", "#1A9850"),
title = "Covariance Matrix: Physical Attributes")
# Plot covariance matrix for Performance Metrics (B)
ggcorrplot(S_BB,
method = "circle",
type = "lower",
lab = TRUE,
colors = c("#D73027", "white", "#1A9850"),
title = "Covariance Matrix: Performance Metrics")
# Plot covariance between A and B (Cross-Covariance Matrix)
ggcorrplot(S_AB,
method = "circle",
lab = TRUE,
colors = c("#D73027", "white", "#1A9850"),
title = "Cross-Covariance Matrix: A vs. B")
7.2 Using scatter plots
library(GGally)
# Scatterplot for physical attributes
::ggpairs(A, title = "Relationships Between Physical Attributes") GGally
# Scatterplot for performance metrics
::ggpairs(B, title = "Relationships Between Performance Metrics") GGally
# Scatterplot between A and B
::ggpairs(cbind(A, B), title = "Relationships Between Physical & Performance Metrics") GGally
8 Compute Canonical Correlation
8.1 Computing canonical correlation in R
Now, we can calculate the canonical correlations between our two datasets.
<- cancor(A, B)
cca_result <- cca_result$cor
canonical_correlations
kable(data.frame(Canonical_Correlations = canonical_correlations),
caption = "Canonical Correlations") %>% kable_styling()
Canonical_Correlations |
---|
0.3695260 |
0.2164503 |
0.0975520 |
0.0241750 |
8.2 Interpreting the output
Each value represents the strength of the relationship between a pair of linear combinations of A (physical attributes) and B (performance metrics).
First Canonical Correlation (0.37)
- This is the strongest relationship found between the two datasets.
- While moderate, it suggests that some physical attributes influence performance metrics.
Second Canonical Correlation (0.22)
- This is weaker than the first but still indicates a moderate relationship.
Third Canonical Correlation (0.10)
- This is quite weak, suggesting minimal correlation between this pair of canonical variables.
Fourth Canonical Correlation (0.02)
- This is very close to zero, indicating almost no meaningful relationship.
8.3 Number of coefficients produced
- Note that the maximum number of canonical correlations is equal to the minimum number of variables in either dataset A or B.
- In this case, A has 4 variables, and B has 4 variables.
- This means we can have at most 4 canonical correlations.
- However, only the first one or two tend to be meaningful, as the later ones often capture weak or random relationships.
9 Compute Canonical Coefficients
<- cca_result$xcoef
canonical_weights_A <- cca_result$ycoef
canonical_weights_B
kable(canonical_weights_A, caption = "Canonical Weights for A") %>% kable_styling()
Speed | -0.0627604 | 0.0388275 | -0.0589409 | 0.0441739 |
Endurance | -0.0062280 | -0.0040666 | 0.0100981 | 0.0102400 |
Strength | -0.0023771 | -0.0112968 | -0.0038385 | 0.0008839 |
Flexibility | 0.0107944 | -0.0002215 | -0.0042680 | 0.0157773 |
kable(canonical_weights_B, caption = "Canonical Weights for B") %>% kable_styling()
Goals_Scored | 0.0319353 | -0.0009409 | 0.0026402 | -0.0200758 |
Assists | 0.0165542 | 0.0442138 | 0.0044469 | 0.0198735 |
Accuracy | -0.0007733 | 0.0004309 | 0.0139553 | 0.0015648 |
Stamina | 0.0077856 | -0.0057752 | 0.0010788 | 0.0117718 |
10 Interpretation
- The first canonical correlation shows the strongest relationship between physical attributes and performance.
- The canonical weights help determine which variables contribute the most.
- If correlations are strong, training based on these insights could improve performance.
11 Conclusion
In this section we:
- demonstrated Canonical Correlation Analysis (CCA) step-by-step.
- explored the data to ensure CCA was appropriate.
- calculated canonical correlations and weights to understand relationships between physical attributes and performance.