Code
library(ggplot2)
library(cluster)
library(factoextra)
library(dplyr)
# library(dbscan)
library(clusterSim)
library(fpc)
library(pheatmap)
library(GGally)
library(clusterCrit)
library(ggplot2)
library(cluster)
library(factoextra)
library(dplyr)
# library(dbscan)
library(clusterSim)
library(fpc)
library(pheatmap)
library(GGally)
library(clusterCrit)
flowchart TD A[Types] --> B[Partitioning] A --> C[Hierarchical] classDef level1 fill:#add8e6,stroke:#333,stroke-width:1px,font-weight:normal; classDef level2 fill:#fffacd,stroke:#333,stroke-width:1px,font-weight:normal; class A level1; class B,C level2;
Partitioning methods, such as k-means, require the number of clusters to be specified upfront.
Hierarchical clustering, in contrast, builds a tree-like structure (dendrogram) that shows how clusters are formed and allows flexibility in choosing the number of clusters.
In R, the commands for creating both partitioning and hierarchical cluster models are fairly simple:
#-------------------------
# K-means clustering (Partitioning)
#-------------------------
# Create model using [x] and [y] features
<- kmeans(data[, c("x", "y")], centers = 3) # Ensure correct columns for clustering are selected
kmeans_result
# Create cluster membership for each observation
$kmeans_cluster <- as.factor(kmeans_result$cluster)
datahead(data)
x y kmeans_cluster
1 -0.56047565 0.25331851 3
2 -0.23017749 -0.02854676 2
3 1.55870831 -0.04287046 1
4 0.07050839 1.36860228 3
5 0.12928774 -0.22577099 2
6 1.71506499 1.51647060 1
#-------------------------
# Hierarchical clustering
#-------------------------
# Create hierarchical model using [x] and [y] features
<- dist(data[, c("x", "y")]) # Compute distance matrix
dist_matrix <- hclust(dist_matrix) # Create hierarchical cluster
hclust_result $hclust_cluster <- cutree(hclust_result, k = 3) # Allocate cluster membership
data
head(data)
x y kmeans_cluster hclust_cluster
1 -0.56047565 0.25331851 3 1
2 -0.23017749 -0.02854676 2 1
3 1.55870831 -0.04287046 1 2
4 0.07050839 1.36860228 3 1
5 0.12928774 -0.22577099 2 2
6 1.71506499 1.51647060 1 2
We now have two models - one a partitioning model and one a hierarchical model.
We can visualise these using a scatter plot (partitioning) or a dendrogram (hierarchical):
#----------------------
# Plot cluster models
#----------------------
# Create scatterplot
<- ggplot(data, aes(x = x, y = y, color = kmeans_cluster)) +
p1 geom_point(size = 3) +
labs(title = "Partitioning Model") +
theme_minimal()
# Visualise dendrogram using factoextra
<- fviz_dend(hclust_result, k = 3, show_labels = FALSE, rect = TRUE)
p2
# Print the plots
print(p1) # Scatter plot for k-means clustering
print(p2) # Dendrogram for hierarchical clustering
Clustering is based on the similarity of data points - trying to maximise the similarity between points.
There are several different approaches we can use.
Euclidean distance is the straight-line distance between two points.
Manhattan distance calculates the sum of the absolute differences along each dimension.
Euclidean Distance: 5
Manhattan Distance: 7
Cosine similarity measures the angle between two vectors, making it useful for high-dimensional data where magnitude is less important than direction.
Cosine Similarity: 0.9746318
The Pearson correlation is also a measure of similarity, as it measures the linear relationship between two variables ranging from -1 (perfect negative) to +1 (perfect positive).
Pearson Correlation: 0.98
You’ll encounter a number of different clustering algorithims. In this section we’ll cover how to conduct the most common ones in R.
K-means partitions the data into clusters, minimising the within-cluster variance. The algorithm iterates to find optimal cluster centroids.
# Step 1: Create dataset
set.seed(123) # For reproducibility
<- data.frame(
football_data Speed = c(rnorm(50, mean = 8, sd = 1.5), rnorm(50, mean = 5, sd = 1)),
Stamina = c(rnorm(50, mean = 7, sd = 1), rnorm(50, mean = 6, sd = 1)),
ShootingAccuracy = c(rnorm(50, mean = 8, sd = 1), rnorm(50, mean = 4, sd = 1.5))
)
# Step 2: Run k-means clustering based on first two features (variables)
<- kmeans(football_data[, c("Speed", "Stamina")], centers = 2)
kmeans_2d $Cluster2D <- as.factor(kmeans_2d$cluster)
football_data
# Step 3: Visualise cluster membership for first two features
<- ggplot(football_data, aes(x = Speed, y = Stamina, color = Cluster2D)) +
plot_2d geom_point(size = 3) +
labs(title = "K-Means Clustering based on 2 features: Speed & Stamina)",
x = "Speed",
y = "Stamina",
color = "Cluster") +
theme_minimal()
# Step 4: Run k-means clustering based on all three features
<- kmeans(football_data, centers = 2)
kmeans_3d $Cluster3D <- as.factor(kmeans_3d$cluster)
football_data
# Display plots
print(plot_2d)
Hierarchical clustering builds a dendrogram to represent the nested grouping of data points.
fviz_dend(hclust_result, k = 3, show_labels = FALSE, rect = TRUE)
DBSCAN groups data based on density, identifying clusters of arbitrary shapes and separating noise points.
For me, the main advantage is that you don’t have to specify the number of clusters in the model.
Applied to the previous dataset:
# Apply DBSCAN to dataset
# Normalise data (optional but recommended for DBSCAN)
<- scale(football_data[, c("Speed", "Stamina", "ShootingAccuracy")])
scaled_data <- data.frame(scaled_data)
scaled_data
# Run DBSCAN with eps and minPts values
<- dbscan(scaled_data, eps = 0.5, MinPts = 5) # note I'm using FPC, not DBSCAN library
dbscan_result
$DBSCAN_Cluster <- as.factor(dbscan_result$cluster)
football_data
# Visualise DBSCAN results using first two features
<- ggplot(football_data, aes(x = Speed, y = Stamina, color = DBSCAN_Cluster)) +
plot_dbscan_2d geom_point(size = 3) +
labs(title = "DBSCAN Clustering (2 Features: Speed & Stamina)",
x = "Speed",
y = "Stamina",
color = "Cluster") +
theme_minimal()
# Display plot
print(plot_dbscan_2d)
We can manipulate Epsilon (maximum distance between two points for them to be considered as part of the same neighborhood) and Minimum Points (minimum number of points required to form a dense region):
# Play about with eps and minPts
<- dbscan(scaled_data, eps = 0.6, MinPts = 7)
dbscan_result2 $DBSCAN_Cluster2 <- as.factor(dbscan_result2$cluster)
football_data
# Visualise new DBSCAN results - REMEMBER THESE ONLY VISUALSE 2 DIMENSIONS!
<- ggplot(football_data, aes(x = Speed, y = Stamina, color = DBSCAN_Cluster2)) +
plot_dbscan_3 geom_point(size = 3) +
labs(title = "DBSCAN Clustering (2 Features: Speed & Stamina) eps=0.6, minPts =7",
x = "Speed",
y = "Stamina",
color = "Cluster") +
theme_minimal()
print(plot_dbscan_3)
Shiny app - dbscan_01
Gaussian Mixture Models (GMMs) use probabilistic models to identify clusters, assuming data is generated from multiple Gaussian distributions.
They are an example of ‘soft’ clustering - a data point can belong to more than one cluster.
Shiny app - gmm_01
Cluster analysis requires judgement in terms of the number of clusters we might identify. Some helpful tools have been developed to assist in evaluating the ‘best’ number of clusters to divide our observations into.
The silhouette score measures how similar a data point is to its own cluster compared to other clusters. It’s a good measure of how effective the clustering process has been. The higher the average silhouette score, the better.
# Load required libraries
library(ggplot2)
library(cluster) # For silhouette score
library(dplyr)
# Run k-means clustering
set.seed(123) # For reproducibility
<- kmeans(football_data[, c("Speed", "Stamina")], centers = 2)
kmeans_model $Cluster <- as.factor(kmeans_model$cluster)
football_data
# Compute silhouette score
<- silhouette(kmeans_model$cluster, dist(football_data[, c("Speed", "Stamina")]))
silhouette_scores <- summary(silhouette_scores)
silhouette_summary
# Visualise silhouette scores
<- function(silhouette_obj) {
silhouette_plot plot(silhouette_obj, border = NA, col = c("red", "blue"), main = "Silhouette Plot for K-Means Clustering",
xlab = "Silhouette Width", ylab = "Cluster")
}
# Visualise clustering results
<- ggplot(football_data, aes(x = Speed, y = Stamina, color = Cluster)) +
clustering_plot geom_point(size = 3) +
labs(
title = "K-Means Clustering Results",
x = "Speed",
y = "Stamina",
color = "Cluster"
+
) theme_minimal()
# summary and visualisations
cat("Average Silhouette Width:", silhouette_summary$avg.width, "\n")
Average Silhouette Width: 0.4985825
cat("Cluster Sizes:", silhouette_summary$clus.sizes, "\n")
Cluster Sizes: 44 56
# Display clustering plot
print(clustering_plot)
# Display silhouette plot
silhouette_plot(silhouette_scores)
The Davies-Bouldin Index evaluates the ratio of intra-cluster distances to inter-cluster distances, with lower values indicating better clustering.
# Load required libraries
library(clusterSim)
library(dplyr)
library(ggplot2)
# Run k-means clustering
set.seed(123)
<- kmeans(football_data[, c("Speed", "Stamina")], centers = 2)
kmeans_result <- kmeans_result$cluster
clusters
# Calculate Davis-Bouldin Index using clusterSim
<- index.DB(football_data[, c("Speed", "Stamina")], clusters)$DB
dbi_value
# Visualise clustering
<- ggplot(football_data, aes(x = Speed, y = Stamina, color = as.factor(clusters))) +
clustering_plot geom_point(size = 3) +
labs(
title = "K-Means Clustering Results",
x = "Speed",
y = "Stamina",
color = "Cluster"
+
) theme_minimal()
# Display results
print(clustering_plot)
cat("Davis-Bouldin Index (DBI):", dbi_value, "\n")
Davis-Bouldin Index (DBI): 0.8364559
cat("Text Comments:\n")
Text Comments:
cat("1. The Davis-Bouldin Index measures clustering quality based on compactness and separation.\n")
1. The Davis-Bouldin Index measures clustering quality based on compactness and separation.
cat("2. Lower DBI values indicate better-defined clusters.\n")
2. Lower DBI values indicate better-defined clusters.
cat("3. Use this metric to evaluate different numbers of clusters (k) to find the optimal clustering.\n")
3. Use this metric to evaluate different numbers of clusters (k) to find the optimal clustering.
The Calinski-Harabasz Index evaluates clusters based on the variance ratio, with higher values indicating better-defined clusters.
# Load libraries
library(cluster)
library(clusterSim) # For Calinski-Harabasz Index
library(ggplot2)
# Run k-means clustering
set.seed(123)
<- kmeans(football_data[, c("Speed", "Stamina")], centers = 2)
kmeans_result <- kmeans_result$cluster
clusters
# Calculate the Calinski-Harabasz Index using clusterSim
<- index.G1(football_data[, c("Speed", "Stamina")], clusters)
ch_index
# Visualise
<- ggplot(football_data, aes(x = Speed, y = Stamina, color = as.factor(clusters))) +
clustering_plot geom_point(size = 3) +
labs(
title = "K-Means Clustering Results",
x = "Speed",
y = "Stamina",
color = "Cluster"
+
) theme_minimal()
# results
print(clustering_plot)
cat("Calinski-Harabasz Index (CH Index):", ch_index, "\n")
Calinski-Harabasz Index (CH Index): 140.0164
cat("Text Comments:\n")
Text Comments:
cat("1. The Calinski-Harabasz Index evaluates clustering quality based on compactness and separation.\n")
1. The Calinski-Harabasz Index evaluates clustering quality based on compactness and separation.
cat("2. Higher values of CH Index indicate better-defined clusters.\n")
2. Higher values of CH Index indicate better-defined clusters.
cat("3. Compare CH Index values across different numbers of clusters (k) to find the optimal clustering.\n")
3. Compare CH Index values across different numbers of clusters (k) to find the optimal clustering.
We have a dataset for a rugby team. It includes player height, weight, speed and strength.
First, we need to scale the dataset to ensure all features are on a comparable scale (very important for clustering).
# Remove PlayerID for clustering and scale data
<- scale(rugby_data[, -1])
rugby_scaled
# Check scaled data
summary(rugby_scaled)
Height Weight Speed Strength
Min. :-2.1070 Min. :-1.6257 Min. :-1.6819 Min. :-1.85475
1st Qu.:-0.7433 1st Qu.:-0.8112 1st Qu.:-0.5974 1st Qu.:-0.72505
Median : 0.1638 Median : 0.1222 Median : 0.1172 Median :-0.09287
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
3rd Qu.: 0.7660 3rd Qu.: 0.9393 3rd Qu.: 0.7497 3rd Qu.: 0.66658
Max. : 1.8583 Max. : 2.0688 Max. : 1.6404 Max. : 2.25554
Hierarchical clustering requires a distance matrix. We’ll use Euclidean distance.
# Compute the distance matrix
<- dist(rugby_scaled, method = "euclidean") dist_matrix
We use Ward’s method for hierarchical clustering. Ward’s method aims to minimise the total within-cluster variance at each step. It is an agglomerative clustering technique, meaning it starts with each data point as its own cluster and then iteratively merges clusters based on a criterion until all points belong to a single cluster.
Ward’s method is a good starting-point for hierarchical cluster analysis as it produces compact and well-seperated clusters.
# Perform hierarchical clustering
<- hclust(dist_matrix, method = "ward.D2")
hclust_result
# Plot the dendrogram
plot(hclust_result, labels = FALSE, main = "Hierarchical Clustering Dendrogram")
abline(h = 7, col = "red", lty = 2) # Add a threshold for cutting clusters
Ward’s method, like other hierarchical clustering methods, does not inherently decide the number of clusters to retain. The decision on the number of clusters is left to the user, who must interpret the dendrogram or use statistical metrics to determine the optimal cut point.
Based on the dendrogram above, we’ll cut the dendrogram to form 3 clusters.
# Cut dendrogram into 3 clusters
$HierarchicalCluster <- cutree(hclust_result, k = 3)
rugby_data
# Display cluster assignments
table(rugby_data$HierarchicalCluster)
1 2 3
25 50 25
Since the dataset has multiple features (4 in this case), we can use Principal Component Analysis (PCA) to reduce the dimensionality and plot the clusters in 2D
# Perform PCA for dimensionality reduction
<- prcomp(rugby_scaled, scale. = TRUE)
pca_result
# Create a data frame for visualization
<- as.data.frame(pca_result$x)
pca_data $Cluster <- as.factor(rugby_data$HierarchicalCluster)
pca_data
# Plot the PCA results with cluster colors
library(ggplot2)
ggplot(pca_data, aes(x = PC1, y = PC2, color = Cluster)) +
geom_point(size = 3) +
labs(title = "Hierarchical Clustering Visualization (PCA Projection)",
x = "Principal Component 1",
y = "Principal Component 2") +
theme_minimal()
To understand why observations are assigned to specific clusters, we can compare the centroids of each cluster, and evaluate how each feature (height, weight, speed, strength) differs between clusters.
library(tidyr)
# Compute mean values for each cluster
<- rugby_data %>%
cluster_summary group_by(HierarchicalCluster) %>%
summarise(
Avg_Height = mean(Height),
Avg_Weight = mean(Weight),
Avg_Speed = mean(Speed),
Avg_Strength = mean(Strength),
.groups = "drop"
)
# Display the summary table
print(cluster_summary)
# A tibble: 3 × 5
HierarchicalCluster Avg_Height Avg_Weight Avg_Speed Avg_Strength
<int> <dbl> <dbl> <dbl> <dbl>
1 1 170. 69.2 4.51 94.9
2 2 188. 84.4 5.36 109.
3 3 200. 101. 6.02 130.
# Create a long-format data frame for visualization
<- cluster_summary %>%
cluster_summary_long pivot_longer(cols = starts_with("Avg_"), names_to = "Feature", values_to = "Value")
# Plot the cluster-wise feature averages
ggplot(cluster_summary_long, aes(x = Feature, y = Value, fill = as.factor(HierarchicalCluster))) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Feature Averages by Cluster",
x = "Features",
y = "Average Value",
fill = "Cluster") +
theme_minimal()
We could also use:
# PCA cluster plot with feature loadings
fviz_pca_biplot(pca_result, label = "var", habillage = rugby_data$HierarchicalCluster,
addEllipses = TRUE, ellipse.level = 0.95, repel = TRUE) +
labs(title = "PCA Biplot with Clusters and Feature Loadings",
x = "Principal Component 1",
y = "Principal Component 2")
# Prepare data for heatmap
#| code-fold: true
library(pheatmap)
# Example data
set.seed(123)
<- as.data.frame(matrix(rnorm(50), nrow = 10))
heatmap_data rownames(heatmap_data) <- paste("Player", 1:10)
colnames(heatmap_data) <- c("Height", "Weight", "Speed", "Strength", "Extra")
# Create row annotation
<- data.frame(Cluster = sample(1:3, 10, replace = TRUE))
row_annotation rownames(row_annotation) <- rownames(heatmap_data)
# Define annotation colors
<- list(
annotation_colors Cluster = c("1" = "red", "2" = "blue", "3" = "green")
)
# Plot heatmap
pheatmap(heatmap_data,
annotation_row = row_annotation,
annotation_colors = annotation_colors,
cluster_rows = FALSE,
cluster_cols = TRUE,
main = "Heatmap with Annotations")
We use the Elbow Method to determine the optimal number of clusters.
# Elbow method to determine the optimal number of clusters
fviz_nbclust(rugby_scaled, kmeans, method = "wss") +
labs(title = "Elbow Method for Optimal K")
We assume 4 clusters based on the elbow method.
# Perform k-means clustering
set.seed(123)
<- kmeans(rugby_scaled, centers = 4, nstart = 25)
kmeans_result
# Add K-means clusters to the dataset
$KMeansCluster <- kmeans_result$cluster
rugby_data
# Display cluster assignments
table(rugby_data$KMeansCluster)
1 2 3 4
24 26 25 25
We could visualise this in a few different ways:
# Perform PCA
library(ggplot2)
<- prcomp(rugby_scaled, scale. = TRUE)
pca_result <- as.data.frame(pca_result$x)
pca_data $Cluster <- as.factor(rugby_data$KMeansCluster)
pca_data
# Plot PCA with clusters
ggplot(pca_data, aes(x = PC1, y = PC2, color = Cluster)) +
geom_point(size = 3) +
labs(title = "PCA Plot of K-Means Clusters",
x = "Principal Component 1",
y = "Principal Component 2") +
theme_minimal()
library(GGally)
# Add cluster assignments to scaled data
<- as.data.frame(rugby_scaled)
rugby_scaled_df $Cluster <- as.factor(rugby_data$KMeansCluster)
rugby_scaled_df
# Parallel coordinate plot
ggparcoord(data = rugby_scaled_df, columns = 1:4, groupColumn = "Cluster",
scale = "globalminmax", alphaLines = 0.5) +
labs(title = "Parallel Coordinate Plot of Features by Cluster",
x = "Features",
y = "Scaled Values") +
theme_minimal()
# Count observations in each cluster
<- as.data.frame(table(rugby_data$KMeansCluster))
cluster_sizes colnames(cluster_sizes) <- c("Cluster", "Count")
# Plot cluster sizes
ggplot(cluster_sizes, aes(x = Cluster, y = Count, fill = Cluster)) +
geom_bar(stat = "identity") +
labs(title = "Number of Observations in Each Cluster",
x = "Cluster",
y = "Count") +
theme_minimal()
# Create example heatmap data
set.seed(123)
<- as.data.frame(matrix(rnorm(40), nrow = 10))
heatmap_data rownames(heatmap_data) <- paste("Player", 1:10, sep = "_")
colnames(heatmap_data) <- c("Height", "Weight", "Speed", "Strength")
# Create annotation for clusters
<- data.frame(Cluster = sample(1:4, 10, replace = TRUE))
row_annotation rownames(row_annotation) <- rownames(heatmap_data)
# Define annotation colors
<- list(
annotation_colors Cluster = c("1" = "red", "2" = "blue", "3" = "green", "4" = "purple")
)
# Generate heatmap
pheatmap(heatmap_data,
annotation_row = row_annotation,
annotation_colors = annotation_colors,
cluster_rows = FALSE,
cluster_cols = TRUE,
main = "Heatmap with Clusters")
We evaluate both hierarchical and k-means clusters using the silhouette score.
# Silhouette score for hierarchical clustering
<- silhouette(rugby_data$HierarchicalCluster, dist_matrix)
silhouette_hc fviz_silhouette(silhouette_hc) + labs(title = "Silhouette Plot (Hierarchical)")
cluster size ave.sil.width
1 1 25 0.73
2 2 50 0.47
3 3 25 0.69
# Silhouette score for k-means clustering
<- silhouette(kmeans_result$cluster, dist_matrix)
silhouette_km fviz_silhouette(silhouette_km) + labs(title = "Silhouette Plot (K-means)")
cluster size ave.sil.width
1 1 24 0.48
2 2 26 0.51
3 3 25 0.60
4 4 25 0.65
# Davies-Bouldin Index for hierarchical clustering
library(clusterCrit)
# Prepare data for clusterCrit
<- as.data.frame(rugby_scaled)
rugby_scaled_df <- rugby_data$HierarchicalCluster
cluster_labels
# Calculate internal cluster validation indices
<- intCriteria(as.matrix(rugby_scaled_df), cluster_labels, "Davies_Bouldin")
db_index cat("Davies-Bouldin Index (Hierarchical):", round(db_index[[1]], 2), "\n")
Davies-Bouldin Index (Hierarchical): 0.52
fviz_cluster(list(data = rugby_scaled, cluster = rugby_data$HierarchicalCluster)) +
labs(title = "Cluster Plot (Hierarchical Clustering)")
fviz_cluster(kmeans_result, data = rugby_scaled) +
labs(title = "Cluster Plot (K-means Clustering)")
Each observation (player) is now assigned to a cluster. We can inspect this in the dataset.
# Display the first few rows with cluster assignments
library(dplyr)
%>%
rugby_data ::select(PlayerID, Height, Weight, Speed, Strength, HierarchicalCluster, KMeansCluster) %>%
dplyrhead()
PlayerID Height Weight Speed Strength HierarchicalCluster
1 Player 1 167.7581 67.86878 4.719881 91.42379 1
2 Player 2 169.0793 70.77065 4.631241 91.23656 1
3 Player 3 176.2348 69.25992 4.473485 90.30731 1
4 Player 4 170.2820 68.95737 4.554319 89.73743 1
5 Player 5 170.5172 67.14514 4.458566 92.81420 1
6 Player 6 176.8603 69.86492 4.452375 96.65590 1
KMeansCluster
1 4
2 4
3 4
4 4
5 4
6 4
shiny apps - rugby_example_01 and 02