Cluster Analysis

1 Libraries used in this demonstration

Code
library(ggplot2)
library(cluster)
library(factoextra)
library(dplyr)
# library(dbscan)
library(clusterSim)
library(fpc)
library(pheatmap)
library(GGally)
library(clusterCrit)

2 Partitioning and Hierarchical Models

flowchart TD
 A[Types] --> B[Partitioning]
 A --> C[Hierarchical]

  classDef level1 fill:#add8e6,stroke:#333,stroke-width:1px,font-weight:normal;
  classDef level2 fill:#fffacd,stroke:#333,stroke-width:1px,font-weight:normal;

  class A level1;
  class B,C level2;
  

Partitioning methods, such as k-means, require the number of clusters to be specified upfront.

Hierarchical clustering, in contrast, builds a tree-like structure (dendrogram) that shows how clusters are formed and allows flexibility in choosing the number of clusters.

In R, the commands for creating both partitioning and hierarchical cluster models are fairly simple:

#-------------------------
# K-means clustering (Partitioning)
#-------------------------

# Create model using [x] and [y] features

kmeans_result <- kmeans(data[, c("x", "y")], centers = 3)  # Ensure correct columns for clustering are selected


# Create cluster membership for each observation

data$kmeans_cluster <- as.factor(kmeans_result$cluster)
head(data)
            x           y kmeans_cluster
1 -0.56047565  0.25331851              3
2 -0.23017749 -0.02854676              2
3  1.55870831 -0.04287046              1
4  0.07050839  1.36860228              3
5  0.12928774 -0.22577099              2
6  1.71506499  1.51647060              1
#-------------------------
# Hierarchical clustering
#-------------------------

# Create hierarchical model using [x] and [y] features

dist_matrix <- dist(data[, c("x", "y")])  # Compute distance matrix
hclust_result <- hclust(dist_matrix) # Create hierarchical cluster
data$hclust_cluster <- cutree(hclust_result, k = 3) # Allocate cluster membership

head(data)
            x           y kmeans_cluster hclust_cluster
1 -0.56047565  0.25331851              3              1
2 -0.23017749 -0.02854676              2              1
3  1.55870831 -0.04287046              1              2
4  0.07050839  1.36860228              3              1
5  0.12928774 -0.22577099              2              2
6  1.71506499  1.51647060              1              2

We now have two models - one a partitioning model and one a hierarchical model.

We can visualise these using a scatter plot (partitioning) or a dendrogram (hierarchical):

Code
#----------------------
# Plot cluster models
#----------------------

# Create scatterplot
p1 <- ggplot(data, aes(x = x, y = y, color = kmeans_cluster)) +
  geom_point(size = 3) +
  labs(title = "Partitioning Model") +
  theme_minimal()

# Visualise dendrogram using factoextra
p2 <- fviz_dend(hclust_result, k = 3, show_labels = FALSE, rect = TRUE)

# Print the plots
print(p1)  # Scatter plot for k-means clustering

Code
print(p2)  # Dendrogram for hierarchical clustering

3 Measuring Similarity

Clustering is based on the similarity of data points - trying to maximise the similarity between points.

There are several different approaches we can use.

3.1 Euclidean and Manhattan Distance

Euclidean distance is the straight-line distance between two points.

Manhattan distance calculates the sum of the absolute differences along each dimension.

Euclidean Distance: 5 
Manhattan Distance: 7 

3.2 Cosine Similarity

Cosine similarity measures the angle between two vectors, making it useful for high-dimensional data where magnitude is less important than direction.

Cosine Similarity: 0.9746318 

3.3 Pearson Correlation

The Pearson correlation is also a measure of similarity, as it measures the linear relationship between two variables ranging from -1 (perfect negative) to +1 (perfect positive).

Pearson Correlation: 0.98 

4 Clustering Algorithms

You’ll encounter a number of different clustering algorithims. In this section we’ll cover how to conduct the most common ones in R.

4.1 k-means Clustering

K-means partitions the data into clusters, minimising the within-cluster variance. The algorithm iterates to find optimal cluster centroids.

Code
# Step 1: Create dataset
set.seed(123)  # For reproducibility
football_data <- data.frame(
  Speed = c(rnorm(50, mean = 8, sd = 1.5), rnorm(50, mean = 5, sd = 1)),
  Stamina = c(rnorm(50, mean = 7, sd = 1), rnorm(50, mean = 6, sd = 1)),
  ShootingAccuracy = c(rnorm(50, mean = 8, sd = 1), rnorm(50, mean = 4, sd = 1.5))
)

# Step 2: Run k-means clustering based on first two features (variables)
kmeans_2d <- kmeans(football_data[, c("Speed", "Stamina")], centers = 2)
football_data$Cluster2D <- as.factor(kmeans_2d$cluster)

# Step 3: Visualise cluster membership for first two features
plot_2d <- ggplot(football_data, aes(x = Speed, y = Stamina, color = Cluster2D)) +
  geom_point(size = 3) +
  labs(title = "K-Means Clustering based on 2 features: Speed & Stamina)",
       x = "Speed",
       y = "Stamina",
       color = "Cluster") +
  theme_minimal()

# Step 4: Run k-means clustering based on all three features
kmeans_3d <- kmeans(football_data, centers = 2)
football_data$Cluster3D <- as.factor(kmeans_3d$cluster)



# Display plots
print(plot_2d)

4.2 Hierarchical Clustering

Hierarchical clustering builds a dendrogram to represent the nested grouping of data points.

Code
fviz_dend(hclust_result, k = 3, show_labels = FALSE, rect = TRUE)

4.3 DBSCAN

DBSCAN groups data based on density, identifying clusters of arbitrary shapes and separating noise points.

For me, the main advantage is that you don’t have to specify the number of clusters in the model.

Applied to the previous dataset:

Code
# Apply DBSCAN to dataset

# Normalise data (optional but recommended for DBSCAN)
scaled_data <- scale(football_data[, c("Speed", "Stamina", "ShootingAccuracy")])
scaled_data <- data.frame(scaled_data)

# Run DBSCAN with eps and minPts values

dbscan_result <- dbscan(scaled_data, eps = 0.5, MinPts = 5) # note I'm using FPC, not DBSCAN library

football_data$DBSCAN_Cluster <- as.factor(dbscan_result$cluster)

# Visualise DBSCAN results using first two features
plot_dbscan_2d <- ggplot(football_data, aes(x = Speed, y = Stamina, color = DBSCAN_Cluster)) +
  geom_point(size = 3) +
  labs(title = "DBSCAN Clustering (2 Features: Speed & Stamina)",
       x = "Speed",
       y = "Stamina",
       color = "Cluster") +
  theme_minimal()

# Display plot
print(plot_dbscan_2d)

We can manipulate Epsilon (maximum distance between two points for them to be considered as part of the same neighborhood) and Minimum Points (minimum number of points required to form a dense region):

Code
# Play about with eps and minPts

dbscan_result2 <- dbscan(scaled_data, eps = 0.6, MinPts = 7)
football_data$DBSCAN_Cluster2 <- as.factor(dbscan_result2$cluster)

# Visualise new DBSCAN results - REMEMBER THESE ONLY VISUALSE 2 DIMENSIONS!
plot_dbscan_3 <- ggplot(football_data, aes(x = Speed, y = Stamina, color = DBSCAN_Cluster2)) +
  geom_point(size = 3) +
  labs(title = "DBSCAN Clustering (2 Features: Speed & Stamina) eps=0.6, minPts =7",
       x = "Speed",
       y = "Stamina",
       color = "Cluster") +
  theme_minimal()

print(plot_dbscan_3)

Shiny app - dbscan_01

4.4 Gaussian Mixture Models

Gaussian Mixture Models (GMMs) use probabilistic models to identify clusters, assuming data is generated from multiple Gaussian distributions.

They are an example of ‘soft’ clustering - a data point can belong to more than one cluster.

Shiny app - gmm_01

5 Validating Clusters

Cluster analysis requires judgement in terms of the number of clusters we might identify. Some helpful tools have been developed to assist in evaluating the ‘best’ number of clusters to divide our observations into.

5.1 Silhouette Score

Note

The silhouette score measures how similar a data point is to its own cluster compared to other clusters. It’s a good measure of how effective the clustering process has been. The higher the average silhouette score, the better.

Code
# Load required libraries
library(ggplot2)
library(cluster)  # For silhouette score
library(dplyr)


# Run k-means clustering
set.seed(123)  # For reproducibility
kmeans_model <- kmeans(football_data[, c("Speed", "Stamina")], centers = 2)
football_data$Cluster <- as.factor(kmeans_model$cluster)

# Compute silhouette score
silhouette_scores <- silhouette(kmeans_model$cluster, dist(football_data[, c("Speed", "Stamina")]))
silhouette_summary <- summary(silhouette_scores)

# Visualise silhouette scores
silhouette_plot <- function(silhouette_obj) {
  plot(silhouette_obj, border = NA, col = c("red", "blue"), main = "Silhouette Plot for K-Means Clustering",
       xlab = "Silhouette Width", ylab = "Cluster")
}

# Visualise clustering results
clustering_plot <- ggplot(football_data, aes(x = Speed, y = Stamina, color = Cluster)) +
  geom_point(size = 3) +
  labs(
    title = "K-Means Clustering Results",
    x = "Speed",
    y = "Stamina",
    color = "Cluster"
  ) +
  theme_minimal()

# summary and visualisations
cat("Average Silhouette Width:", silhouette_summary$avg.width, "\n")
Average Silhouette Width: 0.4985825 
Code
cat("Cluster Sizes:", silhouette_summary$clus.sizes, "\n")
Cluster Sizes: 44 56 
Code
# Display clustering plot
print(clustering_plot)

Code
# Display silhouette plot
silhouette_plot(silhouette_scores)

5.2 Davies-Bouldin Index

The Davies-Bouldin Index evaluates the ratio of intra-cluster distances to inter-cluster distances, with lower values indicating better clustering.

Code
# Load required libraries
library(clusterSim)
library(dplyr)
library(ggplot2)

# Run k-means clustering
set.seed(123)
kmeans_result <- kmeans(football_data[, c("Speed", "Stamina")], centers = 2)
clusters <- kmeans_result$cluster

# Calculate Davis-Bouldin Index using clusterSim
dbi_value <- index.DB(football_data[, c("Speed", "Stamina")], clusters)$DB

# Visualise clustering
clustering_plot <- ggplot(football_data, aes(x = Speed, y = Stamina, color = as.factor(clusters))) +
  geom_point(size = 3) +
  labs(
    title = "K-Means Clustering Results",
    x = "Speed",
    y = "Stamina",
    color = "Cluster"
  ) +
  theme_minimal()

# Display results
print(clustering_plot)

Code
cat("Davis-Bouldin Index (DBI):", dbi_value, "\n")
Davis-Bouldin Index (DBI): 0.8364559 
Code
cat("Text Comments:\n")
Text Comments:
Code
cat("1. The Davis-Bouldin Index measures clustering quality based on compactness and separation.\n")
1. The Davis-Bouldin Index measures clustering quality based on compactness and separation.
Code
cat("2. Lower DBI values indicate better-defined clusters.\n")
2. Lower DBI values indicate better-defined clusters.
Code
cat("3. Use this metric to evaluate different numbers of clusters (k) to find the optimal clustering.\n")
3. Use this metric to evaluate different numbers of clusters (k) to find the optimal clustering.

5.3 Calinski-Barabasz Index

The Calinski-Harabasz Index evaluates clusters based on the variance ratio, with higher values indicating better-defined clusters.

Code
# Load libraries
library(cluster)
library(clusterSim)  # For Calinski-Harabasz Index
library(ggplot2)


# Run k-means clustering
set.seed(123)
kmeans_result <- kmeans(football_data[, c("Speed", "Stamina")], centers = 2)
clusters <- kmeans_result$cluster

# Calculate the Calinski-Harabasz Index using clusterSim
ch_index <- index.G1(football_data[, c("Speed", "Stamina")], clusters)

# Visualise
clustering_plot <- ggplot(football_data, aes(x = Speed, y = Stamina, color = as.factor(clusters))) +
  geom_point(size = 3) +
  labs(
    title = "K-Means Clustering Results",
    x = "Speed",
    y = "Stamina",
    color = "Cluster"
  ) +
  theme_minimal()

# results
print(clustering_plot)

Code
cat("Calinski-Harabasz Index (CH Index):", ch_index, "\n")
Calinski-Harabasz Index (CH Index): 140.0164 
Code
cat("Text Comments:\n")
Text Comments:
Code
cat("1. The Calinski-Harabasz Index evaluates clustering quality based on compactness and separation.\n")
1. The Calinski-Harabasz Index evaluates clustering quality based on compactness and separation.
Code
cat("2. Higher values of CH Index indicate better-defined clusters.\n")
2. Higher values of CH Index indicate better-defined clusters.
Code
cat("3. Compare CH Index values across different numbers of clusters (k) to find the optimal clustering.\n")
3. Compare CH Index values across different numbers of clusters (k) to find the optimal clustering.

6 Example

We have a dataset for a rugby team. It includes player height, weight, speed and strength.

6.1 Data Scaling

First, we need to scale the dataset to ensure all features are on a comparable scale (very important for clustering).

Code
# Remove PlayerID for clustering and scale data
rugby_scaled <- scale(rugby_data[, -1])

# Check scaled data
summary(rugby_scaled)
     Height            Weight            Speed            Strength       
 Min.   :-2.1070   Min.   :-1.6257   Min.   :-1.6819   Min.   :-1.85475  
 1st Qu.:-0.7433   1st Qu.:-0.8112   1st Qu.:-0.5974   1st Qu.:-0.72505  
 Median : 0.1638   Median : 0.1222   Median : 0.1172   Median :-0.09287  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
 3rd Qu.: 0.7660   3rd Qu.: 0.9393   3rd Qu.: 0.7497   3rd Qu.: 0.66658  
 Max.   : 1.8583   Max.   : 2.0688   Max.   : 1.6404   Max.   : 2.25554  

6.2 Approach One: Hierarchical Clustering

Compute Distance Matrix

Hierarchical clustering requires a distance matrix. We’ll use Euclidean distance.

Code
# Compute the distance matrix
dist_matrix <- dist(rugby_scaled, method = "euclidean")

Perform Hierarchical Clustering

We use Ward’s method for hierarchical clustering. Ward’s method aims to minimise the total within-cluster variance at each step. It is an agglomerative clustering technique, meaning it starts with each data point as its own cluster and then iteratively merges clusters based on a criterion until all points belong to a single cluster.

Ward’s method is a good starting-point for hierarchical cluster analysis as it produces compact and well-seperated clusters.

Code
# Perform hierarchical clustering
hclust_result <- hclust(dist_matrix, method = "ward.D2")

# Plot the dendrogram
plot(hclust_result, labels = FALSE, main = "Hierarchical Clustering Dendrogram")
abline(h = 7, col = "red", lty = 2)  # Add a threshold for cutting clusters

Assign Clusters

Ward’s method, like other hierarchical clustering methods, does not inherently decide the number of clusters to retain. The decision on the number of clusters is left to the user, who must interpret the dendrogram or use statistical metrics to determine the optimal cut point.

Based on the dendrogram above, we’ll cut the dendrogram to form 3 clusters.

Code
# Cut dendrogram into 3 clusters
rugby_data$HierarchicalCluster <- cutree(hclust_result, k = 3)

# Display cluster assignments
table(rugby_data$HierarchicalCluster)

 1  2  3 
25 50 25 

Since the dataset has multiple features (4 in this case), we can use Principal Component Analysis (PCA) to reduce the dimensionality and plot the clusters in 2D

Code
# Perform PCA for dimensionality reduction
pca_result <- prcomp(rugby_scaled, scale. = TRUE)

# Create a data frame for visualization
pca_data <- as.data.frame(pca_result$x)
pca_data$Cluster <- as.factor(rugby_data$HierarchicalCluster)

# Plot the PCA results with cluster colors
library(ggplot2)
ggplot(pca_data, aes(x = PC1, y = PC2, color = Cluster)) +
  geom_point(size = 3) +
  labs(title = "Hierarchical Clustering Visualization (PCA Projection)",
       x = "Principal Component 1",
       y = "Principal Component 2") +
  theme_minimal()

To understand why observations are assigned to specific clusters, we can compare the centroids of each cluster, and evaluate how each feature (height, weight, speed, strength) differs between clusters.

Code
library(tidyr)

# Compute mean values for each cluster

cluster_summary <- rugby_data %>%
  group_by(HierarchicalCluster) %>%
  summarise(
    Avg_Height = mean(Height),
    Avg_Weight = mean(Weight),
    Avg_Speed = mean(Speed),
    Avg_Strength = mean(Strength),
    .groups = "drop"
  )

# Display the summary table
print(cluster_summary)
# A tibble: 3 × 5
  HierarchicalCluster Avg_Height Avg_Weight Avg_Speed Avg_Strength
                <int>      <dbl>      <dbl>     <dbl>        <dbl>
1                   1       170.       69.2      4.51         94.9
2                   2       188.       84.4      5.36        109. 
3                   3       200.      101.       6.02        130. 
Code
# Create a long-format data frame for visualization
cluster_summary_long <- cluster_summary %>%
  pivot_longer(cols = starts_with("Avg_"), names_to = "Feature", values_to = "Value")

# Plot the cluster-wise feature averages
ggplot(cluster_summary_long, aes(x = Feature, y = Value, fill = as.factor(HierarchicalCluster))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Feature Averages by Cluster",
       x = "Features",
       y = "Average Value",
       fill = "Cluster") +
  theme_minimal()

We could also use:

Code
# PCA cluster plot with feature loadings

fviz_pca_biplot(pca_result, label = "var", habillage = rugby_data$HierarchicalCluster,
                addEllipses = TRUE, ellipse.level = 0.95, repel = TRUE) +
  labs(title = "PCA Biplot with Clusters and Feature Loadings",
       x = "Principal Component 1",
       y = "Principal Component 2")

# Prepare data for heatmap
#| code-fold: true

library(pheatmap)

# Example data
set.seed(123)
heatmap_data <- as.data.frame(matrix(rnorm(50), nrow = 10))
rownames(heatmap_data) <- paste("Player", 1:10)
colnames(heatmap_data) <- c("Height", "Weight", "Speed", "Strength", "Extra")

# Create row annotation
row_annotation <- data.frame(Cluster = sample(1:3, 10, replace = TRUE))
rownames(row_annotation) <- rownames(heatmap_data)

# Define annotation colors
annotation_colors <- list(
  Cluster = c("1" = "red", "2" = "blue", "3" = "green")
)

# Plot heatmap
pheatmap(heatmap_data,
         annotation_row = row_annotation,
         annotation_colors = annotation_colors,
         cluster_rows = FALSE,
         cluster_cols = TRUE,
         main = "Heatmap with Annotations")

6.3 Approach Two: Partitioning Clustering (K-means)

Determine Optimal Number of Clusters

We use the Elbow Method to determine the optimal number of clusters.

Code
# Elbow method to determine the optimal number of clusters
fviz_nbclust(rugby_scaled, kmeans, method = "wss") +
  labs(title = "Elbow Method for Optimal K")

Perform K-means Clustering

We assume 4 clusters based on the elbow method.

Code
# Perform k-means clustering
set.seed(123)
kmeans_result <- kmeans(rugby_scaled, centers = 4, nstart = 25)

# Add K-means clusters to the dataset
rugby_data$KMeansCluster <- kmeans_result$cluster

# Display cluster assignments
table(rugby_data$KMeansCluster)

 1  2  3  4 
24 26 25 25 

We could visualise this in a few different ways:

Code
# Perform PCA
library(ggplot2)
pca_result <- prcomp(rugby_scaled, scale. = TRUE)
pca_data <- as.data.frame(pca_result$x)
pca_data$Cluster <- as.factor(rugby_data$KMeansCluster)

# Plot PCA with clusters
ggplot(pca_data, aes(x = PC1, y = PC2, color = Cluster)) +
  geom_point(size = 3) +
  labs(title = "PCA Plot of K-Means Clusters",
       x = "Principal Component 1",
       y = "Principal Component 2") +
  theme_minimal()

Code
library(GGally)

# Add cluster assignments to scaled data
rugby_scaled_df <- as.data.frame(rugby_scaled)
rugby_scaled_df$Cluster <- as.factor(rugby_data$KMeansCluster)

# Parallel coordinate plot
ggparcoord(data = rugby_scaled_df, columns = 1:4, groupColumn = "Cluster",
           scale = "globalminmax", alphaLines = 0.5) +
  labs(title = "Parallel Coordinate Plot of Features by Cluster",
       x = "Features",
       y = "Scaled Values") +
  theme_minimal()

Code
# Count observations in each cluster
cluster_sizes <- as.data.frame(table(rugby_data$KMeansCluster))
colnames(cluster_sizes) <- c("Cluster", "Count")

# Plot cluster sizes
ggplot(cluster_sizes, aes(x = Cluster, y = Count, fill = Cluster)) +
  geom_bar(stat = "identity") +
  labs(title = "Number of Observations in Each Cluster",
       x = "Cluster",
       y = "Count") +
  theme_minimal()

Code
# Create example heatmap data
set.seed(123)
heatmap_data <- as.data.frame(matrix(rnorm(40), nrow = 10))
rownames(heatmap_data) <- paste("Player", 1:10, sep = "_")
colnames(heatmap_data) <- c("Height", "Weight", "Speed", "Strength")

# Create annotation for clusters
row_annotation <- data.frame(Cluster = sample(1:4, 10, replace = TRUE))
rownames(row_annotation) <- rownames(heatmap_data)

# Define annotation colors
annotation_colors <- list(
  Cluster = c("1" = "red", "2" = "blue", "3" = "green", "4" = "purple")
)

# Generate heatmap
pheatmap(heatmap_data,
         annotation_row = row_annotation,
         annotation_colors = annotation_colors,
         cluster_rows = FALSE,
         cluster_cols = TRUE,
         main = "Heatmap with Clusters")

6.4 Evaluate Clusters

Silhouette Score

We evaluate both hierarchical and k-means clusters using the silhouette score.

Code
# Silhouette score for hierarchical clustering
silhouette_hc <- silhouette(rugby_data$HierarchicalCluster, dist_matrix)
fviz_silhouette(silhouette_hc) + labs(title = "Silhouette Plot (Hierarchical)")
  cluster size ave.sil.width
1       1   25          0.73
2       2   50          0.47
3       3   25          0.69

Code
# Silhouette score for k-means clustering
silhouette_km <- silhouette(kmeans_result$cluster, dist_matrix)
fviz_silhouette(silhouette_km) + labs(title = "Silhouette Plot (K-means)")
  cluster size ave.sil.width
1       1   24          0.48
2       2   26          0.51
3       3   25          0.60
4       4   25          0.65

Davies-Bouldin Index

Code
# Davies-Bouldin Index for hierarchical clustering

library(clusterCrit)

# Prepare data for clusterCrit
rugby_scaled_df <- as.data.frame(rugby_scaled)
cluster_labels <- rugby_data$HierarchicalCluster

# Calculate internal cluster validation indices
db_index <- intCriteria(as.matrix(rugby_scaled_df), cluster_labels, "Davies_Bouldin")
cat("Davies-Bouldin Index (Hierarchical):", round(db_index[[1]], 2), "\n")
Davies-Bouldin Index (Hierarchical): 0.52 

6.5 Visualise Clusters

Cluster Plot for Hierarchical Clustering

Code
fviz_cluster(list(data = rugby_scaled, cluster = rugby_data$HierarchicalCluster)) +
  labs(title = "Cluster Plot (Hierarchical Clustering)")

Cluster Plot for K-means

Code
fviz_cluster(kmeans_result, data = rugby_scaled) +
  labs(title = "Cluster Plot (K-means Clustering)")

6.6 Assign Clusters to Observations

Each observation (player) is now assigned to a cluster. We can inspect this in the dataset.

Code
# Display the first few rows with cluster assignments
library(dplyr)

rugby_data %>%
  dplyr::select(PlayerID, Height, Weight, Speed, Strength, HierarchicalCluster, KMeansCluster) %>%
  head()
  PlayerID   Height   Weight    Speed Strength HierarchicalCluster
1 Player 1 167.7581 67.86878 4.719881 91.42379                   1
2 Player 2 169.0793 70.77065 4.631241 91.23656                   1
3 Player 3 176.2348 69.25992 4.473485 90.30731                   1
4 Player 4 170.2820 68.95737 4.554319 89.73743                   1
5 Player 5 170.5172 67.14514 4.458566 92.81420                   1
6 Player 6 176.8603 69.86492 4.452375 96.65590                   1
  KMeansCluster
1             4
2             4
3             4
4             4
5             4
6             4

shiny apps - rugby_example_01 and 02