Code
library(ggplot2)
library(cluster)
library(factoextra)
library(dplyr)
# library(dbscan)
library(clusterSim)
library(fpc)
library(pheatmap)
library(GGally)
library(clusterCrit)library(ggplot2)
library(cluster)
library(factoextra)
library(dplyr)
# library(dbscan)
library(clusterSim)
library(fpc)
library(pheatmap)
library(GGally)
library(clusterCrit)flowchart TD A[Types] --> B[Partitioning] A --> C[Hierarchical] classDef level1 fill:#add8e6,stroke:#333,stroke-width:1px,font-weight:normal; classDef level2 fill:#fffacd,stroke:#333,stroke-width:1px,font-weight:normal; class A level1; class B,C level2;
Partitioning methods, such as k-means, require the number of clusters to be specified upfront.
Hierarchical clustering, in contrast, builds a tree-like structure (dendrogram) that shows how clusters are formed and allows flexibility in choosing the number of clusters.
In R, the commands for creating both partitioning and hierarchical cluster models are fairly simple:
#-------------------------
# K-means clustering (Partitioning)
#-------------------------
# Create model using [x] and [y] features
kmeans_result <- kmeans(data[, c("x", "y")], centers = 3) # Ensure correct columns for clustering are selected
# Create cluster membership for each observation
data$kmeans_cluster <- as.factor(kmeans_result$cluster)
head(data) x y kmeans_cluster
1 -0.56047565 0.25331851 3
2 -0.23017749 -0.02854676 2
3 1.55870831 -0.04287046 1
4 0.07050839 1.36860228 3
5 0.12928774 -0.22577099 2
6 1.71506499 1.51647060 1
#-------------------------
# Hierarchical clustering
#-------------------------
# Create hierarchical model using [x] and [y] features
dist_matrix <- dist(data[, c("x", "y")]) # Compute distance matrix
hclust_result <- hclust(dist_matrix) # Create hierarchical cluster
data$hclust_cluster <- cutree(hclust_result, k = 3) # Allocate cluster membership
head(data) x y kmeans_cluster hclust_cluster
1 -0.56047565 0.25331851 3 1
2 -0.23017749 -0.02854676 2 1
3 1.55870831 -0.04287046 1 2
4 0.07050839 1.36860228 3 1
5 0.12928774 -0.22577099 2 2
6 1.71506499 1.51647060 1 2
We now have two models - one a partitioning model and one a hierarchical model.
We can visualise these using a scatter plot (partitioning) or a dendrogram (hierarchical):
#----------------------
# Plot cluster models
#----------------------
# Create scatterplot
p1 <- ggplot(data, aes(x = x, y = y, color = kmeans_cluster)) +
geom_point(size = 3) +
labs(title = "Partitioning Model") +
theme_minimal()
# Visualise dendrogram using factoextra
p2 <- fviz_dend(hclust_result, k = 3, show_labels = FALSE, rect = TRUE)
# Print the plots
print(p1) # Scatter plot for k-means clustering
print(p2) # Dendrogram for hierarchical clustering
Clustering is based on the similarity of data points - trying to maximise the similarity between points.
There are several different approaches we can use.
Euclidean distance is the straight-line distance between two points.
Manhattan distance calculates the sum of the absolute differences along each dimension.
Euclidean Distance: 5
Manhattan Distance: 7

Cosine similarity measures the angle between two vectors, making it useful for high-dimensional data where magnitude is less important than direction.
Cosine Similarity: 0.9746318

The Pearson correlation is also a measure of similarity, as it measures the linear relationship between two variables ranging from -1 (perfect negative) to +1 (perfect positive).
Pearson Correlation: 0.98

You’ll encounter a number of different clustering algorithims. In this section we’ll cover how to conduct the most common ones in R.
K-means partitions the data into clusters, minimising the within-cluster variance. The algorithm iterates to find optimal cluster centroids.
# Step 1: Create dataset
set.seed(123) # For reproducibility
football_data <- data.frame(
Speed = c(rnorm(50, mean = 8, sd = 1.5), rnorm(50, mean = 5, sd = 1)),
Stamina = c(rnorm(50, mean = 7, sd = 1), rnorm(50, mean = 6, sd = 1)),
ShootingAccuracy = c(rnorm(50, mean = 8, sd = 1), rnorm(50, mean = 4, sd = 1.5))
)
# Step 2: Run k-means clustering based on first two features (variables)
kmeans_2d <- kmeans(football_data[, c("Speed", "Stamina")], centers = 2)
football_data$Cluster2D <- as.factor(kmeans_2d$cluster)
# Step 3: Visualise cluster membership for first two features
plot_2d <- ggplot(football_data, aes(x = Speed, y = Stamina, color = Cluster2D)) +
geom_point(size = 3) +
labs(title = "K-Means Clustering based on 2 features: Speed & Stamina)",
x = "Speed",
y = "Stamina",
color = "Cluster") +
theme_minimal()
# Step 4: Run k-means clustering based on all three features
kmeans_3d <- kmeans(football_data, centers = 2)
football_data$Cluster3D <- as.factor(kmeans_3d$cluster)
# Display plots
print(plot_2d)
Hierarchical clustering builds a dendrogram to represent the nested grouping of data points.
fviz_dend(hclust_result, k = 3, show_labels = FALSE, rect = TRUE)
DBSCAN groups data based on density, identifying clusters of arbitrary shapes and separating noise points.
For me, the main advantage is that you don’t have to specify the number of clusters in the model.
Applied to the previous dataset:
# Apply DBSCAN to dataset
# Normalise data (optional but recommended for DBSCAN)
scaled_data <- scale(football_data[, c("Speed", "Stamina", "ShootingAccuracy")])
scaled_data <- data.frame(scaled_data)
# Run DBSCAN with eps and minPts values
dbscan_result <- dbscan(scaled_data, eps = 0.5, MinPts = 5) # note I'm using FPC, not DBSCAN library
football_data$DBSCAN_Cluster <- as.factor(dbscan_result$cluster)
# Visualise DBSCAN results using first two features
plot_dbscan_2d <- ggplot(football_data, aes(x = Speed, y = Stamina, color = DBSCAN_Cluster)) +
geom_point(size = 3) +
labs(title = "DBSCAN Clustering (2 Features: Speed & Stamina)",
x = "Speed",
y = "Stamina",
color = "Cluster") +
theme_minimal()
# Display plot
print(plot_dbscan_2d)
We can manipulate Epsilon (maximum distance between two points for them to be considered as part of the same neighborhood) and Minimum Points (minimum number of points required to form a dense region):
# Play about with eps and minPts
dbscan_result2 <- dbscan(scaled_data, eps = 0.6, MinPts = 7)
football_data$DBSCAN_Cluster2 <- as.factor(dbscan_result2$cluster)
# Visualise new DBSCAN results - REMEMBER THESE ONLY VISUALSE 2 DIMENSIONS!
plot_dbscan_3 <- ggplot(football_data, aes(x = Speed, y = Stamina, color = DBSCAN_Cluster2)) +
geom_point(size = 3) +
labs(title = "DBSCAN Clustering (2 Features: Speed & Stamina) eps=0.6, minPts =7",
x = "Speed",
y = "Stamina",
color = "Cluster") +
theme_minimal()
print(plot_dbscan_3)
Shiny app - dbscan_01
Gaussian Mixture Models (GMMs) use probabilistic models to identify clusters, assuming data is generated from multiple Gaussian distributions.
They are an example of ‘soft’ clustering - a data point can belong to more than one cluster.
Shiny app - gmm_01
Cluster analysis requires judgement in terms of the number of clusters we might identify. Some helpful tools have been developed to assist in evaluating the ‘best’ number of clusters to divide our observations into.
The silhouette score measures how similar a data point is to its own cluster compared to other clusters. It’s a good measure of how effective the clustering process has been. The higher the average silhouette score, the better.
# Load required libraries
library(ggplot2)
library(cluster) # For silhouette score
library(dplyr)
# Run k-means clustering
set.seed(123) # For reproducibility
kmeans_model <- kmeans(football_data[, c("Speed", "Stamina")], centers = 2)
football_data$Cluster <- as.factor(kmeans_model$cluster)
# Compute silhouette score
silhouette_scores <- silhouette(kmeans_model$cluster, dist(football_data[, c("Speed", "Stamina")]))
silhouette_summary <- summary(silhouette_scores)
# Visualise silhouette scores
silhouette_plot <- function(silhouette_obj) {
plot(silhouette_obj, border = NA, col = c("red", "blue"), main = "Silhouette Plot for K-Means Clustering",
xlab = "Silhouette Width", ylab = "Cluster")
}
# Visualise clustering results
clustering_plot <- ggplot(football_data, aes(x = Speed, y = Stamina, color = Cluster)) +
geom_point(size = 3) +
labs(
title = "K-Means Clustering Results",
x = "Speed",
y = "Stamina",
color = "Cluster"
) +
theme_minimal()
# summary and visualisations
cat("Average Silhouette Width:", silhouette_summary$avg.width, "\n")Average Silhouette Width: 0.4985825
cat("Cluster Sizes:", silhouette_summary$clus.sizes, "\n")Cluster Sizes: 44 56
# Display clustering plot
print(clustering_plot)
# Display silhouette plot
silhouette_plot(silhouette_scores)
The Davies-Bouldin Index evaluates the ratio of intra-cluster distances to inter-cluster distances, with lower values indicating better clustering.
# Load required libraries
library(clusterSim)
library(dplyr)
library(ggplot2)
# Run k-means clustering
set.seed(123)
kmeans_result <- kmeans(football_data[, c("Speed", "Stamina")], centers = 2)
clusters <- kmeans_result$cluster
# Calculate Davis-Bouldin Index using clusterSim
dbi_value <- index.DB(football_data[, c("Speed", "Stamina")], clusters)$DB
# Visualise clustering
clustering_plot <- ggplot(football_data, aes(x = Speed, y = Stamina, color = as.factor(clusters))) +
geom_point(size = 3) +
labs(
title = "K-Means Clustering Results",
x = "Speed",
y = "Stamina",
color = "Cluster"
) +
theme_minimal()
# Display results
print(clustering_plot)
cat("Davis-Bouldin Index (DBI):", dbi_value, "\n")Davis-Bouldin Index (DBI): 0.8364559
cat("Text Comments:\n")Text Comments:
cat("1. The Davis-Bouldin Index measures clustering quality based on compactness and separation.\n")1. The Davis-Bouldin Index measures clustering quality based on compactness and separation.
cat("2. Lower DBI values indicate better-defined clusters.\n")2. Lower DBI values indicate better-defined clusters.
cat("3. Use this metric to evaluate different numbers of clusters (k) to find the optimal clustering.\n")3. Use this metric to evaluate different numbers of clusters (k) to find the optimal clustering.
The Calinski-Harabasz Index evaluates clusters based on the variance ratio, with higher values indicating better-defined clusters.
# Load libraries
library(cluster)
library(clusterSim) # For Calinski-Harabasz Index
library(ggplot2)
# Run k-means clustering
set.seed(123)
kmeans_result <- kmeans(football_data[, c("Speed", "Stamina")], centers = 2)
clusters <- kmeans_result$cluster
# Calculate the Calinski-Harabasz Index using clusterSim
ch_index <- index.G1(football_data[, c("Speed", "Stamina")], clusters)
# Visualise
clustering_plot <- ggplot(football_data, aes(x = Speed, y = Stamina, color = as.factor(clusters))) +
geom_point(size = 3) +
labs(
title = "K-Means Clustering Results",
x = "Speed",
y = "Stamina",
color = "Cluster"
) +
theme_minimal()
# results
print(clustering_plot)
cat("Calinski-Harabasz Index (CH Index):", ch_index, "\n")Calinski-Harabasz Index (CH Index): 140.0164
cat("Text Comments:\n")Text Comments:
cat("1. The Calinski-Harabasz Index evaluates clustering quality based on compactness and separation.\n")1. The Calinski-Harabasz Index evaluates clustering quality based on compactness and separation.
cat("2. Higher values of CH Index indicate better-defined clusters.\n")2. Higher values of CH Index indicate better-defined clusters.
cat("3. Compare CH Index values across different numbers of clusters (k) to find the optimal clustering.\n")3. Compare CH Index values across different numbers of clusters (k) to find the optimal clustering.
We have a dataset for a rugby team. It includes player height, weight, speed and strength.
First, we need to scale the dataset to ensure all features are on a comparable scale (very important for clustering).
# Remove PlayerID for clustering and scale data
rugby_scaled <- scale(rugby_data[, -1])
# Check scaled data
summary(rugby_scaled) Height Weight Speed Strength
Min. :-2.1070 Min. :-1.6257 Min. :-1.6819 Min. :-1.85475
1st Qu.:-0.7433 1st Qu.:-0.8112 1st Qu.:-0.5974 1st Qu.:-0.72505
Median : 0.1638 Median : 0.1222 Median : 0.1172 Median :-0.09287
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
3rd Qu.: 0.7660 3rd Qu.: 0.9393 3rd Qu.: 0.7497 3rd Qu.: 0.66658
Max. : 1.8583 Max. : 2.0688 Max. : 1.6404 Max. : 2.25554
Hierarchical clustering requires a distance matrix. We’ll use Euclidean distance.
# Compute the distance matrix
dist_matrix <- dist(rugby_scaled, method = "euclidean")We use Ward’s method for hierarchical clustering. Ward’s method aims to minimise the total within-cluster variance at each step. It is an agglomerative clustering technique, meaning it starts with each data point as its own cluster and then iteratively merges clusters based on a criterion until all points belong to a single cluster.
Ward’s method is a good starting-point for hierarchical cluster analysis as it produces compact and well-seperated clusters.
# Perform hierarchical clustering
hclust_result <- hclust(dist_matrix, method = "ward.D2")
# Plot the dendrogram
plot(hclust_result, labels = FALSE, main = "Hierarchical Clustering Dendrogram")
abline(h = 7, col = "red", lty = 2) # Add a threshold for cutting clusters
Ward’s method, like other hierarchical clustering methods, does not inherently decide the number of clusters to retain. The decision on the number of clusters is left to the user, who must interpret the dendrogram or use statistical metrics to determine the optimal cut point.
Based on the dendrogram above, we’ll cut the dendrogram to form 3 clusters.
# Cut dendrogram into 3 clusters
rugby_data$HierarchicalCluster <- cutree(hclust_result, k = 3)
# Display cluster assignments
table(rugby_data$HierarchicalCluster)
1 2 3
25 50 25
Since the dataset has multiple features (4 in this case), we can use Principal Component Analysis (PCA) to reduce the dimensionality and plot the clusters in 2D
# Perform PCA for dimensionality reduction
pca_result <- prcomp(rugby_scaled, scale. = TRUE)
# Create a data frame for visualization
pca_data <- as.data.frame(pca_result$x)
pca_data$Cluster <- as.factor(rugby_data$HierarchicalCluster)
# Plot the PCA results with cluster colors
library(ggplot2)
ggplot(pca_data, aes(x = PC1, y = PC2, color = Cluster)) +
geom_point(size = 3) +
labs(title = "Hierarchical Clustering Visualization (PCA Projection)",
x = "Principal Component 1",
y = "Principal Component 2") +
theme_minimal()
To understand why observations are assigned to specific clusters, we can compare the centroids of each cluster, and evaluate how each feature (height, weight, speed, strength) differs between clusters.
library(tidyr)
# Compute mean values for each cluster
cluster_summary <- rugby_data %>%
group_by(HierarchicalCluster) %>%
summarise(
Avg_Height = mean(Height),
Avg_Weight = mean(Weight),
Avg_Speed = mean(Speed),
Avg_Strength = mean(Strength),
.groups = "drop"
)
# Display the summary table
print(cluster_summary)# A tibble: 3 × 5
HierarchicalCluster Avg_Height Avg_Weight Avg_Speed Avg_Strength
<int> <dbl> <dbl> <dbl> <dbl>
1 1 170. 69.2 4.51 94.9
2 2 188. 84.4 5.36 109.
3 3 200. 101. 6.02 130.
# Create a long-format data frame for visualization
cluster_summary_long <- cluster_summary %>%
pivot_longer(cols = starts_with("Avg_"), names_to = "Feature", values_to = "Value")
# Plot the cluster-wise feature averages
ggplot(cluster_summary_long, aes(x = Feature, y = Value, fill = as.factor(HierarchicalCluster))) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Feature Averages by Cluster",
x = "Features",
y = "Average Value",
fill = "Cluster") +
theme_minimal()
We could also use:
# PCA cluster plot with feature loadings
fviz_pca_biplot(pca_result, label = "var", habillage = rugby_data$HierarchicalCluster,
addEllipses = TRUE, ellipse.level = 0.95, repel = TRUE) +
labs(title = "PCA Biplot with Clusters and Feature Loadings",
x = "Principal Component 1",
y = "Principal Component 2")
# Prepare data for heatmap
#| code-fold: true
library(pheatmap)
# Example data
set.seed(123)
heatmap_data <- as.data.frame(matrix(rnorm(50), nrow = 10))
rownames(heatmap_data) <- paste("Player", 1:10)
colnames(heatmap_data) <- c("Height", "Weight", "Speed", "Strength", "Extra")
# Create row annotation
row_annotation <- data.frame(Cluster = sample(1:3, 10, replace = TRUE))
rownames(row_annotation) <- rownames(heatmap_data)
# Define annotation colors
annotation_colors <- list(
Cluster = c("1" = "red", "2" = "blue", "3" = "green")
)
# Plot heatmap
pheatmap(heatmap_data,
annotation_row = row_annotation,
annotation_colors = annotation_colors,
cluster_rows = FALSE,
cluster_cols = TRUE,
main = "Heatmap with Annotations")
We use the Elbow Method to determine the optimal number of clusters.
# Elbow method to determine the optimal number of clusters
fviz_nbclust(rugby_scaled, kmeans, method = "wss") +
labs(title = "Elbow Method for Optimal K")
We assume 4 clusters based on the elbow method.
# Perform k-means clustering
set.seed(123)
kmeans_result <- kmeans(rugby_scaled, centers = 4, nstart = 25)
# Add K-means clusters to the dataset
rugby_data$KMeansCluster <- kmeans_result$cluster
# Display cluster assignments
table(rugby_data$KMeansCluster)
1 2 3 4
24 26 25 25
We could visualise this in a few different ways:
# Perform PCA
library(ggplot2)
pca_result <- prcomp(rugby_scaled, scale. = TRUE)
pca_data <- as.data.frame(pca_result$x)
pca_data$Cluster <- as.factor(rugby_data$KMeansCluster)
# Plot PCA with clusters
ggplot(pca_data, aes(x = PC1, y = PC2, color = Cluster)) +
geom_point(size = 3) +
labs(title = "PCA Plot of K-Means Clusters",
x = "Principal Component 1",
y = "Principal Component 2") +
theme_minimal()
library(GGally)
# Add cluster assignments to scaled data
rugby_scaled_df <- as.data.frame(rugby_scaled)
rugby_scaled_df$Cluster <- as.factor(rugby_data$KMeansCluster)
# Parallel coordinate plot
ggparcoord(data = rugby_scaled_df, columns = 1:4, groupColumn = "Cluster",
scale = "globalminmax", alphaLines = 0.5) +
labs(title = "Parallel Coordinate Plot of Features by Cluster",
x = "Features",
y = "Scaled Values") +
theme_minimal()
# Count observations in each cluster
cluster_sizes <- as.data.frame(table(rugby_data$KMeansCluster))
colnames(cluster_sizes) <- c("Cluster", "Count")
# Plot cluster sizes
ggplot(cluster_sizes, aes(x = Cluster, y = Count, fill = Cluster)) +
geom_bar(stat = "identity") +
labs(title = "Number of Observations in Each Cluster",
x = "Cluster",
y = "Count") +
theme_minimal()
# Create example heatmap data
set.seed(123)
heatmap_data <- as.data.frame(matrix(rnorm(40), nrow = 10))
rownames(heatmap_data) <- paste("Player", 1:10, sep = "_")
colnames(heatmap_data) <- c("Height", "Weight", "Speed", "Strength")
# Create annotation for clusters
row_annotation <- data.frame(Cluster = sample(1:4, 10, replace = TRUE))
rownames(row_annotation) <- rownames(heatmap_data)
# Define annotation colors
annotation_colors <- list(
Cluster = c("1" = "red", "2" = "blue", "3" = "green", "4" = "purple")
)
# Generate heatmap
pheatmap(heatmap_data,
annotation_row = row_annotation,
annotation_colors = annotation_colors,
cluster_rows = FALSE,
cluster_cols = TRUE,
main = "Heatmap with Clusters")
We evaluate both hierarchical and k-means clusters using the silhouette score.
# Silhouette score for hierarchical clustering
silhouette_hc <- silhouette(rugby_data$HierarchicalCluster, dist_matrix)
fviz_silhouette(silhouette_hc) + labs(title = "Silhouette Plot (Hierarchical)") cluster size ave.sil.width
1 1 25 0.73
2 2 50 0.47
3 3 25 0.69

# Silhouette score for k-means clustering
silhouette_km <- silhouette(kmeans_result$cluster, dist_matrix)
fviz_silhouette(silhouette_km) + labs(title = "Silhouette Plot (K-means)") cluster size ave.sil.width
1 1 24 0.48
2 2 26 0.51
3 3 25 0.60
4 4 25 0.65

# Davies-Bouldin Index for hierarchical clustering
library(clusterCrit)
# Prepare data for clusterCrit
rugby_scaled_df <- as.data.frame(rugby_scaled)
cluster_labels <- rugby_data$HierarchicalCluster
# Calculate internal cluster validation indices
db_index <- intCriteria(as.matrix(rugby_scaled_df), cluster_labels, "Davies_Bouldin")
cat("Davies-Bouldin Index (Hierarchical):", round(db_index[[1]], 2), "\n")Davies-Bouldin Index (Hierarchical): 0.52
fviz_cluster(list(data = rugby_scaled, cluster = rugby_data$HierarchicalCluster)) +
labs(title = "Cluster Plot (Hierarchical Clustering)")
fviz_cluster(kmeans_result, data = rugby_scaled) +
labs(title = "Cluster Plot (K-means Clustering)")
Each observation (player) is now assigned to a cluster. We can inspect this in the dataset.
# Display the first few rows with cluster assignments
library(dplyr)
rugby_data %>%
dplyr::select(PlayerID, Height, Weight, Speed, Strength, HierarchicalCluster, KMeansCluster) %>%
head() PlayerID Height Weight Speed Strength HierarchicalCluster
1 Player 1 167.7581 67.86878 4.719881 91.42379 1
2 Player 2 169.0793 70.77065 4.631241 91.23656 1
3 Player 3 176.2348 69.25992 4.473485 90.30731 1
4 Player 4 170.2820 68.95737 4.554319 89.73743 1
5 Player 5 170.5172 67.14514 4.458566 92.81420 1
6 Player 6 176.8603 69.86492 4.452375 96.65590 1
KMeansCluster
1 4
2 4
3 4
4 4
5 4
6 4
shiny apps - rugby_example_01 and 02