Unsupervised Learning Techniques

B1705, Week Ten

1 Introduction

1.1 What is unsupervised learning?

Unsupervised learning techniques analyse data without predefined labels or outcomes
They reveal hidden patterns and structures.

1.2 Uses of unsupervised learning

2 Clustering Techniques

2.1 What is clustering?

Clustering groups similar data points based on inherent characteristics.
Traditional clustering methods rely on predefined assumptions about the data, such as the the shape of the clusters.
Unsupervised learning offers more advanced techniques that adapt to complex, high-dimensional data, improving cluster accuracy and interpretability.

2.2 Three methods

2.3 Gaussian Mixture Models (GMM)

Guassian Mixture Models assume that data originates from multiple Gaussian (normal) distributions, each with distinct characteristics.
Unlike K-means, which assigns each point to a single cluster, GMM provides a probabilistic classification, allowing data points to belong to multiple clusters with varying probabilities.

Example

Libraries required

rm(list=ls())

library(mclust)
library(ggplot2)
library(dplyr)

1 Dataset creation

Code

set.seed(42)

# Define n players
num_players <- 300  

# Generate synthetic data w more overlap
# Attackers High shots, high goals, moderate assists (increased SD)
attackers <- data.frame(
  Shots = round(rnorm(num_players / 3, mean = 8, sd = 2.5),0),
  Goals = round(rnorm(num_players / 3, mean = 5, sd = 2.5),0),
  Assists = round(rnorm(num_players / 3, mean = 3, sd = 2.5),0),
  Position = "Attacker"
)

# Defenders: Low shots, low goals, high assists (increased SD)
defenders <- data.frame(
  Shots = rnorm(num_players / 3, mean = 3, sd = 2.5),
  Goals = rnorm(num_players / 3, mean = 1, sd = 2.5),
  Assists = rnorm(num_players / 3, mean = 5, sd = 2.5),
  Position = "Defender"
)

# Goalkeepers: Very low shots, very low goals, low assists (increased SD)
goalkeepers <- data.frame(
  Shots = round(rnorm(num_players / 3, mean = 1, sd = 2),0),
  Goals = round(rnorm(num_players / 3, mean = 0, sd = 1.5),0),
  Assists = round(rnorm(num_players / 3, mean = 1, sd = 2),0),
  Position = "Goalkeeper"
)

# Combine into one dataset
water_polo_data <- bind_rows(attackers, defenders, goalkeepers)

rm(attackers)
rm(defenders)
rm(goalkeepers)

# View first few rows
head(water_polo_data)

  Shots Goals Assists Position
1    11     8      -2 Attacker
2     7     8       4 Attacker
3     9     2       6 Attacker
4    10    10       8 Attacker
5     9     3       0 Attacker
6     8     5       0 Attacker

2 Fit GMM

Code

# Fit Gaussian Mixture Model
gmm_model <- Mclust(water_polo_data[, c("Shots", "Goals", "Assists")])

# Print model summary
summary(gmm_model)

---------------------------------------------------- 
Gaussian finite mixture model fitted by EM algorithm 
---------------------------------------------------- 

Mclust VII (spherical, varying volume) model with 3 components: 

 log-likelihood   n df       BIC       ICL
       -2206.83 300 14 -4493.513 -4576.572

Clustering table:
  1   2   3 
106 112  82

3 Add cluster labels to the dataset

Code

water_polo_data$Cluster <- as.factor(gmm_model$classification)

# View assigned clusters
table(water_polo_data$Cluster)


  1   2   3 
106 112  82

4 Visualise clustering

Code

ggplot(water_polo_data, aes(x = Shots, y = Goals, color = Cluster)) +
  geom_point(size = 3, alpha = 0.7) +  # Make overlapping points semi-transparent
  labs(title = "Gaussian Mixture Model Clustering in Water Polo",
       x = "Shots Attempted",
       y = "Goals Scored") +
  theme_minimal()

5 Compare clusters to positions

Code

# Compare model clusters to actual positions
table(water_polo_data$Cluster, water_polo_data$Position)

   
    Attacker Defender Goalkeeper
  1       94       12          0
  2        4       19         89
  3        2       69         11

6 Analysing probabilities of cluster assignments

Unlike K-means, GMM gives soft assignments: each player belongs to clusters with certain probabilities.

Code

# Get probability matrix of cluster membership
head(gmm_model$z)

          [,1]         [,2]         [,3]
[1,] 0.9999996 1.163294e-10 3.790314e-07
[2,] 0.9985202 1.269046e-07 1.479720e-03
[3,] 0.9471837 3.412050e-06 5.281287e-02
[4,] 0.9999848 2.084930e-14 1.520399e-05
[5,] 0.9982464 6.070962e-05 1.692916e-03
[6,] 0.9988918 3.110449e-05 1.077056e-03

Each row shows the probability of belonging to each cluster.
Players with mixed roles (e.g., some defenders take many shots) will have non-binary probabilities.

7 Advantages of GMM

GMM handles overlapping clusters well, as it assigns probabilities to each data point’s membership of the different clusters.
It’s more flexible than K-means, as it does not assume spherical clusters.
It provides probabilistic insights useful for decision-making - this is more subtle than simply allocating cluster membership on a yes/no basis.

2.4 Spectral Clustering

Spectral clustering uses similarity matrices and eigenvectors to partition data.
Unlike traditional methods, it captures complex relationships, making it ideal for networked data (like football passing patterns).
For example, it could help identify strategic subgroups within teams that operate as cohesive units.

Example

Libraries required

Code

rm(list=ls())

library(tidyverse)
library(igraph)
library(kernlab)
library(cluster)
library(factoextra)
library(ggplot2)
library(RColorBrewer)
library(FNN)
library(mlbench)
set.seed(42)

1 Introduction

Spectral clustering is useful for non-convex clusters.
It uses the eigenvalues of a similarity matrix.
We’ll cluster basketball players based on performance.

2 Generating dataset

Code

n_players <- 25

cluster1 <- tibble(
  Points_per_Game = rnorm(n_players/3, mean = 12, sd = 3.5),
  Assists_per_Game = rnorm(n_players/3, mean = 4.5, sd = 1.8),
  Rebounds_per_Game = rnorm(n_players/3, mean = 7, sd = 2.5),
  Steals_per_Game = rnorm(n_players/3, mean = 1.4, sd = 0.4),
  Blocks_per_Game = rnorm(n_players/3, mean = 1, sd = 0.3)
)

cluster2 <- tibble(
  Points_per_Game = rnorm(n_players/3, mean = 15.5, sd = 3.5),
  Assists_per_Game = rnorm(n_players/3, mean = 6.5, sd = 1.8),
  Rebounds_per_Game = rnorm(n_players/3, mean = 9, sd = 2.5),
  Steals_per_Game = rnorm(n_players/3, mean = 1.7, sd = 0.4),
  Blocks_per_Game = rnorm(n_players/3, mean = 1.2, sd = 0.3)
)

cluster3 <- tibble(
  Points_per_Game = rnorm(n_players/3, mean = 19, sd = 3.5),
  Assists_per_Game = rnorm(n_players/3, mean = 8, sd = 1.8),
  Rebounds_per_Game = rnorm(n_players/3, mean = 10, sd = 2.5),
  Steals_per_Game = rnorm(n_players/3, mean = 2.0, sd = 0.4),
  Blocks_per_Game = rnorm(n_players/3, mean = 1.4, sd = 0.3)
)

df <- bind_rows(cluster1, cluster2, cluster3) %>% mutate(Player_ID = row_number())
head(df)

# A tibble: 6 × 6
  Points_per_Game Assists_per_Game Rebounds_per_Game Steals_per_Game
            <dbl>            <dbl>             <dbl>           <dbl>
1            16.8             8.13             6.29            2.16 
2            10.0             4.39             0.359           1.23 
3            13.3             6.85             0.899           1.30 
4            14.2             8.62            10.3             0.695
5            13.4             2.00             6.23            1.58 
6            11.6             4.00             2.55            1.14 
# ℹ 2 more variables: Blocks_per_Game <dbl>, Player_ID <int>

3 Data visualisation - raw data

Code

ggplot(df, aes(x = Points_per_Game, y = Assists_per_Game)) +
  geom_point(alpha = 0.7) +
  labs(title = "Basketball Player Stats", x = "Points per Game", y = "Assists per Game")

4 Constructing the similarity graph

Code

X <- df %>% select(-Player_ID) %>% mutate(across(everything(), as.numeric)) %>% scale()


# Compute k-nearest neighbors adjacency matrix
k <- min(10, nrow(X) - 1)  # Ensure k is valid
knn <- get.knn(X, k=k)

adjacency_matrix <- matrix(0, nrow=nrow(X), ncol=nrow(X))
for (i in 1:nrow(X)) {
  adjacency_matrix[i, knn$nn.index[i, ]] <- 1
  adjacency_matrix[knn$nn.index[i, ], i] <- 1  # Ensure symmetry
}

graph <- graph_from_adjacency_matrix(adjacency_matrix, mode = "undirected", diag = FALSE)

# Improved visualisation
plot(
  graph, 
  vertex.size=10,  # Adjust node size
  vertex.label=NA, 
  edge.width=1.5,  # Moderate edge width
  main="Enhanced KNN Graph of Players", 
  layout=layout_with_kk(graph),  # Kamada-Kawai layout - better spacing
  edge.color="darkgrey",  # Darken edges for clarity
  vertex.color="steelblue"  # Improve node visibility
)

5 Explanation

Each node represents a player, and edges indicate similarity based on performance metrics. Players connected by edges have similar stats.
Graph captures local relationships between players using a k-nearest neighbors (KNN) approach, showing natural groupings based on shared characteristics.
Structure of graph helps define clusters in a way that traditional distance-based methods (like k-means) might miss, making it useful for identifying non-convex or irregular clusters.

6 Spectral Clustering - implementation

Code

# Compute similarity matrix
similarity_matrix <- exp(-as.matrix(dist(X))^2 / (2 * sd(as.matrix(dist(X)))^2))

# Compute normalised Laplacian matrix
d <- rowSums(similarity_matrix)
D <- diag(d)
L <- D - similarity_matrix
D_inv_sqrt <- diag(1 / sqrt(d))
L_norm <- D_inv_sqrt %*% L %*% D_inv_sqrt

eigen_decomp <- eigen(L_norm, symmetric = TRUE)
X_transformed <- as.data.frame(eigen_decomp$vectors[,1:3])

# K-Means clustering on eigenvectors
kmeans_res <- kmeans(X_transformed, centers = 3, nstart = 10)
df$Cluster <- as.factor(kmeans_res$cluster)

7 Clustering results

Code

ggplot(df, aes(x = Points_per_Game, y = Assists_per_Game, color = Cluster)) +
  geom_point(size = 3) +
  labs(title = "Spectral Clustering of Players", x = "Points per Game", y = "Assists per Game") +
  scale_color_brewer(palette = "Set1")

8 Comparison with k-means

K-Means assumes clusters are spherical and fails on complex shapes.
Spectral Clustering finds structure using a similarity graph.
The graph-based approach allows for non-convex clusters.

Code

# Generate synthetic dataset with two moons shape
df_moons <- as.data.frame(mlbench.twonorm(100, d = 2)$x)
names(df_moons) <- c("x", "y")

# Apply K-Means
kmeans_res <- kmeans(df_moons, centers = 2, nstart = 10)
df_moons$kmeans_cluster <- as.factor(kmeans_res$cluster)

# Spectral Clustering
dist_mat <- as.matrix(dist(df_moons))
similarity_matrix <- exp(-dist_mat^2 / (2 * sd(dist_mat)^2))
d <- rowSums(similarity_matrix)
D <- diag(d)
L <- D - similarity_matrix
D_inv_sqrt <- diag(1 / sqrt(d))
L_norm <- D_inv_sqrt %*% L %*% D_inv_sqrt
eigen_decomp <- eigen(L_norm, symmetric = TRUE)
spectral_features <- as.data.frame(eigen_decomp$vectors[, 1:2])
spectral_kmeans <- kmeans(spectral_features, centers = 2, nstart = 10)
df_moons$spectral_cluster <- as.factor(spectral_kmeans$cluster)

# Vis
p1 <- ggplot(df_moons, aes(x = x, y = y, color = kmeans_cluster)) +
  geom_point(size = 3) +
  ggtitle("K-Means Clustering") +
  theme_minimal()

p2 <- ggplot(df_moons, aes(x = x, y = y, color = spectral_cluster)) +
  geom_point(size = 3) +
  ggtitle("Spectral Clustering") +
  theme_minimal()

library(patchwork)
p1 + p2

9 How Spectral Clustering works

Data points are treated as nodes, and edges represent similarities.
Laplacian Matrix Calculation used to construct a matrix based on the similarity between nodes.
By analysing the eigenvectors of this matrix, data is projected into a lower-dimensional space where clustering becomes easier
Finally, standard clustering methods (like K-means) are then applied in this transformed space.

2.5 DBSCAN and OPTICS

Density-Based Spatial Clustering of Applications with Noise identifies clusters based on data density rather than relying purely on distances between data points.

These approaches are effective in:

Identifying arbitrarily shaped clusters, including irregular or elongated patterns that traditional clustering algorithms, like k-means, often miss.
Handling noise and outliers effectively by classifying points in low-density regions as noise, ensuring robustness even when data includes irrelevant or anomalous points.
Detecting clusters without needing prior knowledge of the number of clusters, making them highly flexible and suitable for exploratory data analysis.

1 How DBSCAN Works

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) finds clusters based on density rather than distances.
It identifies core points, border points, and outliers:
- Core Points: Points with at least minPts neighbors within eps distance.
- Border Points: Close to a core point but have fewer than minPts neighbors.
- Noise: Points that don’t belong to any cluster.

2 DBSCAN in Action

Code

rm(list=ls())

library(dbscan)

# Create synthetic dataset with two distinct clusters and noise
df_dbscan_demo <- tibble(
  x = c(rnorm(50, mean = 2, sd = 0.5), rnorm(50, mean = 6, sd = 0.5), runif(20, min = 0, max = 8)),
  y = c(rnorm(50, mean = 2, sd = 0.5), rnorm(50, mean = 6, sd = 0.5), runif(20, min = 0, max = 8))
)

# Apply DBSCAN
dbscan_res_demo <- dbscan(df_dbscan_demo, eps = 0.8, minPts = 5)
df_dbscan_demo$cluster <- as.factor(dbscan_res_demo$cluster)

# Visualising DBSCAN clustering

ggplot(df_dbscan_demo, aes(x = x, y = y, color = cluster)) +
  geom_point(size = 3) +
  ggtitle("DBSCAN Clustering") +
  theme_minimal()

3 Comparing Clustering Methods

K-Means: Assumes clusters are spherical and evenly sized.
Spectral Clustering: Uses graph-based methods to find non-convex clusters.
DBSCAN: Density-based clustering that can find arbitrarily shaped clusters.
OPTICS: An improved version of DBSCAN that handles varying densities.

4 Libraries required

Code

library(tidyverse)
library(igraph)
library(kernlab)
library(cluster)
library(factoextra)
library(ggplot2)
library(RColorBrewer)
library(FNN) # For k-nearest neighbors
library(mlbench) # For synthetic datasets
library(dbscan) # For DBSCAN and OPTICS
set.seed(42)

5 Generating dataset

df_clusters <- as.data.frame(mlbench.spirals(200, cycles = 1, sd = 0.1)$x)
names(df_clusters) <- c("x", "y")

6 K-Means Clustering

Code

kmeans_res <- kmeans(df_clusters, centers = 3, nstart = 10)
df_clusters$kmeans_cluster <- as.factor(kmeans_res$cluster)

ggplot(df_clusters, aes(x = x, y = y, color = kmeans_cluster)) +
  geom_point(size = 3) +
  ggtitle("K-Means Clustering") +
  theme_minimal()

7 Spectral Clustering

Code

dist_mat <- as.matrix(dist(df_clusters))
similarity_matrix <- exp(-dist_mat^2 / (2 * sd(dist_mat)^2))
d <- rowSums(similarity_matrix)
D <- diag(d)
L <- D - similarity_matrix
D_inv_sqrt <- diag(1 / sqrt(d))
L_norm <- D_inv_sqrt %*% L %*% D_inv_sqrt
eigen_decomp <- eigen(L_norm, symmetric = TRUE)
spectral_features <- as.data.frame(eigen_decomp$vectors[, 1:3])
spectral_kmeans <- kmeans(spectral_features, centers = 3, nstart = 10)
df_clusters$spectral_cluster <- as.factor(spectral_kmeans$cluster)

ggplot(df_clusters, aes(x = x, y = y, color = spectral_cluster)) +
  geom_point(size = 3) +
  ggtitle("Spectral Clustering") +
  theme_minimal()

8 DBSCAN Clustering

Code

dbscan_res <- dbscan(df_clusters[, c("x", "y")], eps = 0.15, minPts = 5)
df_clusters$dbscan_cluster <- as.factor(dbscan_res$cluster)

ggplot(df_clusters, aes(x = x, y = y, color = dbscan_cluster)) +
  geom_point(size = 3) +
  ggtitle("DBSCAN Clustering") +
  theme_minimal()

9 Introducing OPTICS

OPTICS (Ordering Points To Identify the Clustering Structure) improves DBSCAN by using a hierarchical clustering approach, where points are ordered based on their density relationships rather than grouped directly.
OPTICS enhances DBSCAN by providing flexibility and clarity, particularly if we’re dealing with datasets containing clusters of varying densities or complex, nested groupings.

10 Example

Code

optics_res <- optics(df_clusters[, c("x", "y")], eps = 0.2, minPts = 5)
df_clusters$optics_cluster <- as.factor(extractDBSCAN(optics_res, eps_cl = 0.15)$cluster)

ggplot(df_clusters, aes(x = x, y = y, color = optics_cluster)) +
  geom_point(size = 3) +
  ggtitle("OPTICS Clustering") +
  theme_minimal()

11 OPTICS Reachability Plot

A key feature of OPTICS is the reachability plot (a visual representation of the clustering structure) that helps us intuitively identify the optimal cluster configuration.
The reachability plot shows the structure of clusters.
Valleys represent dense regions (clusters), while peaks indicate sparser areas.

Code

plot(optics_res, main="OPTICS Reachability Plot")

12 Summary

Code

library(patchwork)

p1 <- ggplot(df_clusters, aes(x = x, y = y, color = kmeans_cluster)) +
  geom_point(size = 3) +
  ggtitle("K-Means") +
  theme_minimal()

p2 <- ggplot(df_clusters, aes(x = x, y = y, color = spectral_cluster)) +
  geom_point(size = 3) +
  ggtitle("Spectral Clustering") +
  theme_minimal()

p3 <- ggplot(df_clusters, aes(x = x, y = y, color = dbscan_cluster)) +
  geom_point(size = 3) +
  ggtitle("DBSCAN") +
  theme_minimal()

p4 <- ggplot(df_clusters, aes(x = x, y = y, color = optics_cluster)) +
  geom_point(size = 3) +
  ggtitle("OPTICS") +
  theme_minimal()

p1 + p2 + p3 + p4

2.6 Conclusion

K-Means is fast but struggles with non-spherical clusters.
Spectral Clustering handles complex shapes but requires choosing parameters carefully.
DBSCAN is great for arbitrary shapes but depends on eps tuning.
OPTICS improves on DBSCAN by handling varying density clusters.

2.7 Validating clusters

These models are unsupervised machine learning; they lack predefined labels against which to test the model.
Therefore, model validation is essential to help us assess the quality and performance of our model.

Metrics used to do this include:

Silhouette Score: measures cluster compactness - more compact the better.
Davies-Bouldin Index: assesses cluster separation - more separation the better.
Domain-Specific Comparisons: cross-checks results with subject knowledge to enhance reliability.

3 Dimensionality Reduction

3.1 Introduction

“Dimensionality reduction” means transforming high-dimensional data into a lower-dimensional space while preserving important patterns and structures.
Helps improve computational efficiency, reduce noise, and enhance visualisation.

3.2 Techniques

Commonly-using techniques include PCA (Principal Component Analysis), t-SNE (t-Distributed Stochastic Neighbor Embedding), Manifold Learning and Neural Networks.

3.3 Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a technique for reducing dimensionality.
PCA transforms correlated variables into uncorrelated principal components.
Useful for data visualisation, noise reduction, feature selection.

1 Libraries

rm(list=ls())

library(tidyverse)
library(factoextra)
library(ggplot2)
set.seed(123)

2 Generating dataset

Code

# Simulated dataset with correlated variables
df_pca <- tibble(
  Feature1 = rnorm(100, mean = 5, sd = 2),
  Feature2 = Feature1 * 0.8 + rnorm(100, mean = 0, sd = 0.5),
  Feature3 = rnorm(100, mean = 3, sd = 1)
)

3 Performing PCA

Code

# Perform PCA with scaling
pca_res <- prcomp(df_pca, scale. = TRUE)

PCA transforms the original variables into new principal components (PCs).
The first few PCs capture the most variation in the data.

4 Scree Plot: variance explained

Code

# Scree plot to visualise variance explained by each PC
fviz_eig(pca_res, addlabels = TRUE, ggtheme = theme_minimal())

The scree plot shows the proportion of variance explained by each PC.
The elbow method helps decide how many components to retain.

5 PCA Biplot

Code

# PCA biplot to visualise principal components
fviz_pca_biplot(pca_res, repel = TRUE, ggtheme = theme_minimal())

Points represent observations, and arrows represent features.
The direction of an arrow indicates the contribution of a feature to the PCs.
Closer points indicate greater similarity.

6 PCA Component Loadings

Code

# Loadings (contribution of original variables to each PC)
print(pca_res$rotation)

                PC1        PC2         PC3
Feature1  0.6970177 -0.1150203  0.70776878
Feature2  0.6957155 -0.1305247 -0.70635919
Feature3 -0.1736269 -0.9847505  0.01095666

Loadings show how much each original feature contributes to each principal component.
Large values indicate a strong influence of the feature on the component.

7 Interpreting PCA Results

PCA simplifies high-dimensional data while preserving variance.
The first few PCs capture most of the data’s structure.
Useful for:
- Data visualisation (reducing dimensions to 2D or 3D).
- Feature selection (removing redundant variables).
- Noise reduction (focusing on key patterns).

8 Conclusion

PCA is a powerful tool for understanding complex datasets.
Helps in dimensionality reduction while retaining important information.
Consider PCA before applying machine learning models to improve efficiency.

3.4 t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-SNE (t-Distributed Stochastic Neighbor Embedding) is a technique for dimensionality reduction.
Unlike PCA, t-SNE is non-linear and excels at visualising high-dimensional data.
t-SNE helps preserve local structure in complex datasets.

1 Example

rm(list=ls())

library(tidyverse)
library(Rtsne)
library(ggplot2)
library(factoextra)
set.seed(123)

2 Generating basketball dataset

Code

# Simulated basketball player stats dataset
df_sports <- tibble(
  Points_per_Game = rnorm(100, mean = 15, sd = 5),
  Assists_per_Game = rnorm(100, mean = 5, sd = 2),
  Rebounds_per_Game = rnorm(100, mean = 7, sd = 3),
  Steals_per_Game = rnorm(100, mean = 1.5, sd = 0.5),
  Blocks_per_Game = rnorm(100, mean = 1, sd = 0.5)
)

3 Performing t-SNE

Code

# Run t-SNE on the dataset
tsne_res <- Rtsne(df_sports, perplexity = 30, theta = 0.5, check_duplicates = FALSE)
tsne_df <- as_tibble(tsne_res$Y) %>% rename(Dim1 = V1, Dim2 = V2)

t-SNE reduces the original dataset into two dimensions for visualisation.
Perplexity -> key parameter controlling how local/global the embedding is.

4 t-SNE Visualisation

Code

# Scatter plot of t-SNE result
ggplot(tsne_df, aes(x = Dim1, y = Dim2)) +
  geom_point(color = "blue", size = 3, alpha = 0.7) +
  ggtitle("t-SNE Visualisation of Sports Data") +
  theme_minimal()

Each point represents a player, and their positioning reflects similarity.
t-SNE tends to cluster similar players together.

5 Comparing PCA vs. t-SNE

Code

# Perform PCA
pca_res <- prcomp(df_sports, scale. = TRUE)
pca_df <- as_tibble(pca_res$x[, 1:2]) %>% rename(Dim1 = PC1, Dim2 = PC2)

# PCA vs. t-SNE plot
p1 <- ggplot(pca_df, aes(x = Dim1, y = Dim2)) +
  geom_point(color = "red", size = 3, alpha = 0.7) +
  ggtitle("PCA Result") +
  theme_minimal()

p2 <- ggplot(tsne_df, aes(x = Dim1, y = Dim2)) +
  geom_point(color = "blue", size = 3, alpha = 0.7) +
  ggtitle("t-SNE Result") +
  theme_minimal()

library(patchwork)
p1 + p2

PCA maintains global structure but assumes linear relationships.
t-SNE captures local clusters better but does not preserve distances well.

6 When to Use t-SNE

Use PCA when you want a quick linear dimensionality reduction.
Use t-SNE when your dataset has complex, non-linear relationships.
t-SNE is useful for visualising clusters, but doesn’t work well for feature selection.

7 Conclusion

t-SNE is a powerful tool for visualising high-dimensional sports data.
Unlike PCA, it excels at identifying hidden patterns and clusters.
Use both PCA and t-SNE together to get the best insights from your data!

3.5 Manifold Learning

Manifold Learning is a set of techniques for non-linear dimensionality reduction.
t-SNE is considered a type of manifold learning because it preserves complex relationships in high-dimensional data (unlike PCA).
Other common methods include UMAP and Isomap.

1 Libraries required

## Note: dimRed not currently available via CRAN, so...
## Install devtools if you haven't already
# install.packages("devtools")
## Load devtools
# library(devtools)
## Install the archived version of dimRed
# install_url("https://cran.r-project.org/src/contrib/Archive/dimRed/dimRed_0.2.6.tar.gz")


library(tidyverse)
library(Rtsne)
library(ggplot2)
library(factoextra)
library(umap)
library(igraph)
library(dimRed)
set.seed(42)

2 Generating Dataset

Code

# Simulated basketball player stats dataset
df_sports <- tibble(
  Points_per_Game = rnorm(100, mean = 15, sd = 5),
  Assists_per_Game = rnorm(100, mean = 5, sd = 2),
  Rebounds_per_Game = rnorm(100, mean = 7, sd = 3),
  Steals_per_Game = rnorm(100, mean = 1.5, sd = 0.5),
  Blocks_per_Game = rnorm(100, mean = 1, sd = 0.5)
)

3 Performing t-SNE

First, we’ll repeat the application of a t-SNE model to this new data:

Code

tsne_res <- Rtsne(df_sports, perplexity = 30, theta = 0.5, check_duplicates = FALSE)

tsne_df <- as_tibble(tsne_res$Y) %>% rename(Dim1 = V1, Dim2 = V2)

Code

ggplot(tsne_df, aes(x = Dim1, y = Dim2)) +
  geom_point(color = "blue", size = 3, alpha = 0.7) +
  ggtitle("t-SNE Visualisation of Sports Data") +
  theme_minimal()

Note that t-SNE excels at local structure preservation.

4 Performing UMAP

Code

umap_res <- umap(df_sports)
umap_df <- as_tibble(umap_res$layout) %>% rename(Dim1 = V1, Dim2 = V2)

Code

ggplot(umap_df, aes(x = Dim1, y = Dim2)) +
  geom_point(color = "green", size = 3, alpha = 0.7) +
  ggtitle("UMAP Visualisation of Sports Data") +
  theme_minimal()

UMAP is faster than t-SNE and retains more global structure.

5 Performing Isomap

Code

library(vegan)
library(ggplot2)

# Compute distance matrix
dist_matrix <- dist(df_sports, method = "euclidean")

# Apply Isomap
isomap_res <- isomap(dist_matrix, ndim = 2, k = 5)

# Extract embedded coordinates
isomap_df <- as_tibble(isomap_res$points)

# Rename dimensions
colnames(isomap_df) <- c("Dim1", "Dim2")

# Plot Isomap-transformed data
ggplot(isomap_df, aes(x = Dim1, y = Dim2)) +
  geom_point(alpha = 0.7, color = "blue") +
  labs(title = "Isomap Dimensionality Reduction",
       x = "Dim1",
       y = "Dim2") +
  theme_minimal()

Isomap preserves geodesic distances and is useful for manifold structures.

6 Comparing PCA, t-SNE, UMAP, and Isomap

Code

# Perform PCA
pca_res <- prcomp(df_sports, scale. = TRUE)
pca_df <- as_tibble(pca_res$x[, 1:2])%>% rename(Dim1 = PC1, Dim2 = PC2)

p1 <- ggplot(pca_df, aes(x = Dim1, y = Dim2)) +
  geom_point(color = "purple", size = 3, alpha = 0.7) +
  ggtitle("PCA") +
  theme_minimal()

p2 <- ggplot(tsne_df, aes(x = Dim1, y = Dim2)) +
  geom_point(color = "blue", size = 3, alpha = 0.7) +
  ggtitle("t-SNE") +
  theme_minimal()

p3 <- ggplot(umap_df, aes(x = Dim1, y = Dim2)) +
  geom_point(color = "green", size = 3, alpha = 0.7) +
  ggtitle("UMAP") +
  theme_minimal()

p4 <- ggplot(isomap_df, aes(x = Dim1, y = Dim2)) +
  geom_point(color = "red", size = 3, alpha = 0.7) +
  ggtitle("Isomap") +
  theme_minimal()

library(patchwork)
p1 + p2 + p3 + p4

7 When to use each method

PCA: Fast, preserves global structure, good for linear data.
t-SNE: Best for small datasets, captures local structure well.
UMAP: Fast and effective for large datasets, balances local/global structure.
Isomap: Good for manifold-like data, preserves distances.

8 Conclusion

Manifold learning methods help uncover hidden structures in data.
Use the method that best fits your dataset size and structure.
Experiment with different techniques for better insights…

3.6 Autoencoders (Neural Network-Based Reduction)

An autoencoder is a type of neural network that learns to compress input data into a lower-dimensional “latent space” (encoding) and then reconstruct the original data (decoding).
It’s trained to minimise the difference between the input and its reconstruction, known as the reconstruction error.

Example

In the example, I train an autoencoder on a synthetic dataset.
The network learned the typical patterns in the data (such as typical game performance in terms of points, assists, rebounds, etc.) by encoding the input into 3 latent features and then reconstructing the original 5 features.

Example

Reconstruction error as a signal

The reconstruction error measures how well the autoencoder can recreate the input data.
Low Reconstruction Error indicates that the input follows the learned normal patterns.
High Reconstruction Error suggests that the input is unusual or anomalous because it deviates from the patterns the autoencoder learned.

Usefulness?

Anomaly Detection: Analysts can use high reconstruction errors as flags for unusual events or changes in the data.
Data Compression: The latent features provide a compact representation of the data, which can be used for further analysis, visualisation, or as input to other models.
Insight Generation: By understanding which patterns the autoencoder fails to reconstruct, analysts can gain insights into rare or extreme events that warrant closer investigation.

4 Association Rule Learning

4.1 Overview

4.2 What is Association Rule Learning?

Used to identify patterns and relationships between variables in large datasets.
Applied in areas where understanding likelihood of items or events appearing together is valuable.

4.3 Concept

The core idea is to generate rules that describe how certain variables are related.
A rule is typically represented as A → B, meaning that if A occurs, then B is likely to occur as well.
For example, in retail analysis, a common rule might be “If a customer buys bread, they are likely to also buy butter.”

4.4 Metrics

Three key metrics are used:

Support refers to how frequently items appear together in the dataset, calculated as the proportion of transactions containing both A and B.
Confidence measures the likelihood that B occurs given A, indicating how often the rule holds true.
Lift evaluates how much more likely B is to occur when A is present, compared to its independent occurrence. A lift value greater than 1 suggests a positive correlation, while a value close to 1 indicates little to no association.

Example

Libraries required:

rm(list=ls())

library(arules)
library(arulesViz)

1 Create simple dataset

# Define dataset
hockey_transactions <- list(
  c("Player_A", "Pass", "Assist", "Goal"),
  c("Player_B", "Pass", "Goal"),
  c("Player_A", "Player_B", "Shot", "Save"),
  c("Player_C", "Turnover", "Goal"),
  c("Player_A", "Pass", "Assist", "Goal"),
  c("Player_A", "Player_D", "Shot", "Save"),
  c("Player_B", "Pass", "Assist", "Goal"),
  c("Player_B", "Pass", "Goal"),
  c("Player_C", "Turnover", "Goal"),
  c("Player_A", "Pass", "Assist", "Goal")
)

2 Convert to transactions format

Code

# Convert to transactions format
transactions <- as(hockey_transactions, "transactions")

3 Explore dataset

Code

# Inspect dataset
summary(transactions)

transactions as itemMatrix in sparse format with
 10 rows (elements/itemsets/transactions) and
 10 columns (items) and a density of 0.36 

most frequent items:
    Goal     Pass Player_A   Assist Player_B  (Other) 
       8        6        5        4        4        9 

element (itemset/transaction) length distribution:
sizes
3 4 
4 6 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    3.0     3.0     4.0     3.6     4.0     4.0 

includes extended item information - examples:
  labels
1 Assist
2   Goal
3   Pass

Code

inspect(transactions[1:5])

    items                           
[1] {Assist, Goal, Pass, Player_A}  
[2] {Goal, Pass, Player_B}          
[3] {Player_A, Player_B, Save, Shot}
[4] {Goal, Player_C, Turnover}      
[5] {Assist, Goal, Pass, Player_A}

4 Run apriori algo

Code

rules <- apriori(transactions, parameter = list(supp = 0.2, conf = 0.6, minlen = 2))

Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.6    0.1    1 none FALSE            TRUE       5     0.2      2
 maxlen target  ext
     10  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 2 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[10 item(s), 10 transaction(s)] done [0.00s].
sorting and recoding items ... [9 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 done [0.00s].
writing ... [42 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].

Code

# Check if any rules were found
length(rules)

[1] 42

Code

inspect(sort(rules, by = "lift")[1:10])

     lhs                 rhs        support confidence coverage lift count
[1]  {Player_C}       => {Turnover} 0.2     1          0.2      5.0  2    
[2]  {Turnover}       => {Player_C} 0.2     1          0.2      5.0  2    
[3]  {Save}           => {Shot}     0.2     1          0.2      5.0  2    
[4]  {Shot}           => {Save}     0.2     1          0.2      5.0  2    
[5]  {Goal, Player_C} => {Turnover} 0.2     1          0.2      5.0  2    
[6]  {Goal, Turnover} => {Player_C} 0.2     1          0.2      5.0  2    
[7]  {Player_A, Save} => {Shot}     0.2     1          0.2      5.0  2    
[8]  {Player_A, Shot} => {Save}     0.2     1          0.2      5.0  2    
[9]  {Pass, Player_A} => {Assist}   0.3     1          0.3      2.5  3    
[10] {Goal, Player_A} => {Assist}   0.3     1          0.3      2.5  3

5 Explanation

Once Apriori algorithm has been applied to our hockey dataset, we can inspect rules to identify meaningful patterns between players and actions. The key metrics used to evaluate these rules are:

Support: The proportion of transactions where a particular itemset appears. This helps determine how frequently a pattern occurs within the dataset.
Confidence: The likelihood that the outcome in the rule occurs when the antecedent (left-hand side) conditions are met. Higher confidence means a stronger predictive relationship.
Lift: A measure of how much more likely the outcome is, compared to random chance. A lift value greater than 1 indicates a strong positive association, while a value close to 1 suggests little to no relationship.

6 Visualise the rules

Code

library(arulesViz)

# Plot rules using a graph
plot(rules, method = "graph", control = list(type = "items"))

Available control parameters (with default values):
layout   =  stress
circular     =  FALSE
ggraphdots   =  NULL
edges    =  <environment>
nodes    =  <environment>
nodetext     =  <environment>
colors   =  c("#EE0000FF", "#EEEEEEFF")
engine   =  ggplot2
max  =  100
verbose  =  FALSE

4.5 Algorithms

Association Rule Learning uses a number of different algorithms to detect the relationships (associations) between items in the dataset:

Apriori Algorithm - breadth-first approach to iteratively extend frequent itemsets. It prunes unpromising candidates but requires multiple database scans, making it less efficient for large datasets.
FP-Growth Algorithm - builds a Frequent Pattern tree (FP-tree) to compress data, reducing redundant storage. This improves efficiency over Apriori but can struggle with large FP-trees.
Eclat Algorithm - uses a depth-first search and stores itemsets as transaction IDs (tidsets), making it highly efficient for dense datasets, but memory-intensive.