Ensemble Methods in Machine Learning

B1705, Week Nine

Introduction to Ensemble Methods

Summary

Definition

Ensemble methods combine multiple individual machine learning models to achieve better predictive performance than any single model alone.

Core Principle

Their effectiveness relies on reducing errors by aggregating diverse predictions from multiple models.

Bias-Variance Balance

Ensembles address the bias-variance trade-off, often reducing variance significantly without substantially increasing bias.

Summary

Model Diversity

Effective ensembles require individual models that make different types of errors. Diversity can be introduced through varied training data, different algorithms, randomisation.

Bagging (Bootstrap Aggregating)

Models trained independently on bootstrapped datasets (random subsets of training data) have their predictions averaged to improve stability and reduce variance (e.g., random forests).

Summary

Boosting

Sequentially builds models, with each new model focusing on correcting errors made by previous models, effectively reducing bias (e.g., AdaBoost, Gradient Boosting, XGBoost).

Stacking (Meta-Ensembling)

Combines predictions from multiple diverse models using a meta-model that learns how best to aggregate these predictions for optimal accuracy.

Summary

Aggregation Strategies

Common methods of combining predictions include averaging (for regression), majority voting (for classification), weighted voting (assigning importance based on model performance), and meta-model predictions (stacking).

Hyperparameter Tuning

The performance of ensemble models often relies on tuning parameters, such as the number of base learners, depth of trees, learning rates, and feature selection strategies.

Bias-variance tradeoff

  • Bias -> systematic errors, oversimplification, leads to underfitting.
  • Variance -> sensitivity to data fluctuations, causes overfitting.

Ensembles reduce variance without increasing bias significantly.

Model diversity

  • Achieved through bagging, varying model architectures, and introducing randomness.

Diversity ensures errors in one model are countered by strengths in others.

Model diversity - ‘Bagging’

  • Train multiple models independently on random data subsets.

  • Each model has unique decision boundaries, reducing variance.

  • Example -> random forests.

Model diversity - model architecture

  • Combining different model types leverages diverse strengths.
  • Example -> logistic regression, neural networks, gradient boosting each analyse different aspects of an outcome.

Model diversity - randomness

  • Injecting randomness in model training enhances diversity.

  • Random forests select random subsets of features per tree.

  • Improves robustness against data variability.

Aggregation strategies

  • Combines predictions from diverse models.

  • Key methods: Averaging, weighted voting, boosting, stacking.

Aggregation - averaging

  • Simplest method, common in bagging (e.g., random forests).
  • Reduces variance for stable, reliable predictions.
  • Effective in regression tasks.

Aggregation - weighted voting

  • Assigns model weights based on historical accuracy.
  • Useful in classification tasks (e.g., match outcome prediction).
  • Optimises model contributions for better accuracy.

Aggregation - boosting

  • Sequential training to correct previous model errors.
  • Focuses on difficult-to-predict cases, improving accuracy.
  • Example -> Gradient Boosting (XGBoost).

Aggregation - stacking

  • Higher-level meta-model combines predictions of diverse base models.
  • Meta-model determines optimal weighting of base predictions.
  • Effective in complex tasks like player valuation.

Advanced Bagging and Random Forests

Introduction

  • Generates multiple models from re-sampled datasets.
  • Aggregates by averaging (regression) or majority voting (classification).
  • Particularly suitable for high-variance models (decision trees).

Inner workings of bagging algorithms

  • Models trained on bootstrap samples to enhance diversity.
  • Aggregation through averaging (regression) or majority voting (classification).
  • Mitigates overfitting of sensitive models like decision trees.

Random forests: enhancing bagging

  • Introduces additional randomness with feature subsets.
  • De-correlates individual trees, improving generalisation.
  • Ideal for tasks with correlated predictors.

Hyperparameter tuning for random forests

  • Crucial hyperparameters: number of trees, max features, tree depth.
  • Balances performance and complexity.
  • Tuning methods: grid search, Bayesian optimisation.

Evaluating feature importance

  • Methods: Mean decrease in impurity, permutation importance.
  • Identifies influential variables in predictive tasks.
  • Provides actionable insights in sports analytics.

Advanced Boosting Methods

Introduction

  • Boosting combines weak learners to reduce errors sequentially.
  • Methods: AdaBoost, Gradient Boosting, XGBoost.
  • Improves accuracy on complex datasets.

AdaBoost

  • Sequentially trains weak learners emphasising misclassified instances.
  • Updates data weights to focus on difficult examples.
  • Optimisation through iteration number and learning rate.

Gradient boosting

  • Fits learners to residuals of previous models.
  • Optimises step size to minimise loss.
  • Key hyperparameters: learning rate, estimators, tree depth.

XGBoost

  • Enhances Gradient Boosting with regularisation and parallelism.
  • Second-order optimisation (gradient and Hessian).
  • Hyperparameters: max depth, subsample, learning rate.

Stacking

Introduction

  • Combines diverse model predictions via a meta-model.
  • Cross-validation stacking prevents overfitting.
  • Ideal for blending insights from various model types.

Deep Learning Approaches

Introduction

  • Deep learning automatically learns hierarchical patterns.
  • Models covered: CNNs, RNNs/LSTMs, Transformers.

CNNs for visual analysis

  • Specialised for spatial patterns in images.
  • Components: convolutional layers, pooling, fully connected layers.
  • Example use: event detection in sports footage.

RNNs and LSTMs for sequential data

  • Handles sequential data, capturing dependencies over time.
  • LSTMs solve vanishing gradient problem with gating mechanisms.
  • Example: forecasting player movements.

Attention mechanisms and transformers

  • Address limitations of sequential models.
  • Self-attention allows simultaneous processing.
  • Effective for complex interactions and long-range dependencies.

Demonstration

Preparation

Re-create dataset from previous practical, and run those models (GLM, Lasso, Random Forest, SVM) again.

Code
set.seed(123)  # For reproducibility
n <- 500  # Number of matches

# Simulate team skill levels (values between 50 and 100)
teamA_skill <- round(runif(n, 50, 100),1)
teamB_skill <- round(runif(n, 50, 100),1)

# Simulate home advantage (0 or 1) and match importance (categorical)
home_advantage <- rbinom(n, 1, 0.5)
match_importance <- factor(sample(c("low", "medium", "high"), n, replace = TRUE))

# Map match importance to a numeric effect
importance_effect <- ifelse(match_importance == "low", 0,
                       ifelse(match_importance == "medium", 0.3, 0.6))

# Define a linear predictor that favors Team A based on skills, home advantage, and match importance
lin_pred <- (teamA_skill - teamB_skill) / 10 + 0.5 * home_advantage + importance_effect
prob <- 1 / (1 + exp(-lin_pred))  # Logistic transformation

# Generate binary outcome (1 = Team A wins, 0 = loses)
result <- rbinom(n, 1, prob)

# Create the data frame
sports_data <- data.frame(
  teamA_skill,
  teamB_skill,
  home_advantage = factor(home_advantage),
  match_importance,
  result = factor(result)
)


set.seed(456)
train_indices <- sample(1:n, size = round(0.7 * n))
train_data <- sports_data[train_indices, ]
test_data <- sports_data[-train_indices, ]


## Generalised Linear Model (GLM)

glm_model <- glm(result ~ teamA_skill + teamB_skill + home_advantage + match_importance,
                 data = train_data, family = binomial)
summary(glm_model)

Call:
glm(formula = result ~ teamA_skill + teamB_skill + home_advantage + 
    match_importance, family = binomial, data = train_data)

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)             1.16348    0.97997   1.187  0.23512    
teamA_skill             0.10117    0.01297   7.802 6.07e-15 ***
teamB_skill            -0.10925    0.01293  -8.450  < 2e-16 ***
home_advantage1         0.77300    0.29104   2.656  0.00791 ** 
match_importancelow    -0.51460    0.35557  -1.447  0.14783    
match_importancemedium -0.82500    0.35909  -2.297  0.02159 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 479.66  on 349  degrees of freedom
Residual deviance: 301.69  on 344  degrees of freedom
AIC: 313.69

Number of Fisher Scoring iterations: 5
Code
# Prediction on test data
glm_preds <- predict(glm_model, test_data, type = "response")
glm_class <- ifelse(glm_preds > 0.5, 1, 0)
glm_accuracy <- mean(glm_class == as.numeric(as.character(test_data$result)))
cat("GLM Accuracy:", glm_accuracy, "\n")
GLM Accuracy: 0.7866667 
Code
## Regularised Linear Model (Lasso)


library(glmnet)
# Create model matrices with dummy variables for factors
x_train <- model.matrix(result ~ teamA_skill + teamB_skill + home_advantage + match_importance, train_data)[,-1]
y_train <- as.numeric(as.character(train_data$result))
x_test <- model.matrix(result ~ teamA_skill + teamB_skill + home_advantage + match_importance, test_data)[,-1]

# Cross-validation to select lambda (regularisation strength)
cv_glmnet <- cv.glmnet(x_train, y_train, family = "binomial", alpha = 1)
glmnet_preds <- predict(cv_glmnet, newx = x_test, type = "response", s = "lambda.min")
glmnet_class <- ifelse(glmnet_preds > 0.5, 1, 0)
glmnet_accuracy <- mean(glmnet_class == as.numeric(as.character(test_data$result)))
cat("Regularised GLM (Lasso) Accuracy:", glmnet_accuracy, "\n")
Regularised GLM (Lasso) Accuracy: 0.78 
Code
## Random Forest

library(randomForest)
rf_model <- randomForest(result ~ teamA_skill + teamB_skill + home_advantage + match_importance,
                         data = train_data, ntree = 100)
rf_preds <- predict(rf_model, test_data)
rf_accuracy <- mean(rf_preds == test_data$result)
cat("Random Forest Accuracy:", rf_accuracy, "\n")
Random Forest Accuracy: 0.8 
Code
## Support Vector Machines (SVM)

library(e1071)
svm_model <- svm(result ~ teamA_skill + teamB_skill + home_advantage + match_importance,
                 data = train_data, probability = TRUE)
svm_preds <- predict(svm_model, test_data, probability = TRUE)
svm_accuracy <- mean(svm_preds == test_data$result)
cat("SVM Accuracy:", svm_accuracy, "\n")
SVM Accuracy: 0.7533333 

Bagging (Bootstrap Aggregating)

Bagging trains multiple models on bootstrapped samples of the data and aggregates their predictions.

library(ipred)
bag_model <- bagging(result ~ teamA_skill + teamB_skill + home_advantage + match_importance,
                     data = train_data, nbagg = 50)
bag_preds <- predict(bag_model, test_data, type = "class")
bag_accuracy <- mean(bag_preds == test_data$result)
cat("Bagging Accuracy:", bag_accuracy, "\n")
Bagging Accuracy: 0.7866667 

By averaging predictions from many bootstrapped models, bagging reduces the variance and improves stability of the prediction compared to a single model.

Boosting (GBM)

Boosting builds models sequentially, with each model focusing on the errors of the previous ones.

library(gbm)
# Convert result to numeric for gbm
train_data$numeric_result <- as.numeric(as.character(train_data$result))
gbm_model <- gbm(numeric_result ~ teamA_skill + teamB_skill + home_advantage + match_importance,
                 data = train_data, distribution = "bernoulli", n.trees = 100, interaction.depth = 3, verbose = FALSE)
gbm_preds <- predict(gbm_model, test_data, n.trees = 100, type = "response")
gbm_class <- ifelse(gbm_preds > 0.5, 1, 0)
gbm_accuracy <- mean(gbm_class == as.numeric(as.character(test_data$result)))
cat("GBM Boosting Accuracy:", gbm_accuracy, "\n")
GBM Boosting Accuracy: 0.7866667 

Boosting corrects the mistakes of earlier models by placing greater emphasis on misclassified cases. GBM (Gradient Boosting Machine) combines these weak learners into a strong model.

XGBoost

XGBoost is an advanced implementation of gradient boosting that provides regularisation and efficiency benefits.

library(xgboost)
train_matrix <- model.matrix(result ~ teamA_skill + teamB_skill + home_advantage + match_importance, train_data)[,-1]
train_label <- as.numeric(as.character(train_data$result))
test_matrix <- model.matrix(result ~ teamA_skill + teamB_skill + home_advantage + match_importance, test_data)[,-1]

xgb_model <- xgboost(data = train_matrix, label = train_label,
                     objective = "binary:logistic", nrounds = 50, verbose = 0)
xgb_preds <- predict(xgb_model, test_matrix)
xgb_class <- ifelse(xgb_preds > 0.5, 1, 0)
xgb_accuracy <- mean(xgb_class == as.numeric(as.character(test_data$result)))
cat("XGBoost Accuracy:", xgb_accuracy, "\n")
XGBoost Accuracy: 0.78 

Model Accuracy Comparison

Finally, we summarise the accuracy results of all models on the test set.

accuracy_results <- data.frame(
  Model = c("GLM", "Regularised GLM (Lasso)", "Random Forest", "Bagging", 
            "GBM Boosting", "SVM", "XGBoost"),
  Accuracy = c(glm_accuracy, glmnet_accuracy, rf_accuracy, bag_accuracy, 
               gbm_accuracy, svm_accuracy, xgb_accuracy)
)
print(accuracy_results)
                    Model  Accuracy
1                     GLM 0.7866667
2 Regularised GLM (Lasso) 0.7800000
3           Random Forest 0.8000000
4                 Bagging 0.7866667
5            GBM Boosting 0.7866667
6                     SVM 0.7533333
7                 XGBoost 0.7800000