Supervised Learning Techniques

B1705, Week Nine

Introduction

Summary of supervised learning

Definition

Supervised learning involves training algorithms using labeled datasets…model learns to predict outcomes based on input-output examples.

Key Tasks

Primary tasks include classification (predicting discrete categories) and regression (predicting continuous values).

Training Data

Effective supervised learning depends on high-quality labeled data that accurately represent the relationships between inputs and outputs.

Summary of supervised learning

Algorithms

Common supervised learning algorithms include linear regression, logistic regression, support vector machines (SVMs), decision trees, and neural networks.

Model Evaluation

Models are evaluated using metrics (accuracy, precision, recall, F1-score, Mean Squared Error (MSE), and ROC-AUC.

Bias-Variance Trade-off

Core challenge is balancing bias (errors from overly simplistic models) and variance (errors due to overly complex models).

Summary of supervised learning

Overfitting and Underfitting

Overfitting - a model learns the training data too closely and fails to generalise.

Underfitting - the model fails to capture key patterns in the data.

Model Selection and Hyperparameter Tuning

Cross-validation and hyperparameter optimisation (e.g., grid search, random search) are commonly used methods to enhance model performance and generalisation.

Ensemble Methods

Combining multiple models (e.g., bagging, boosting, stacking) can significantly improve predictive performance by reducing variance and improving stability.

Types of learning

What is Machine Learning?

Machine Learning basics

Traditional statistics focuses on inference and significance testing.
Machine learning prioritises accurate prediction and data-driven validation.
ML techniques include regularisation and hyperparameter tuning.

What is ‘regularisation’?

Regularisation penalises overfitting.

It prevents overfitting in predictive models.
Adds penalties for overly complex solutions.
Maintains model interpretability and predictive reliability.

Machine learning in sport

High dimensionality

High dimensionality occurs when too many features lead to unstable models.
Can cause overfitting, reducing prediction accuracy.
Regularisation and feature selection help manage complexity.

Multicollinearity

Occurs when predictors are highly correlated.
Makes coefficient estimates unreliable.
Ridge and Lasso regression effectively manage multicollinearity.

Time-dependent data

Sports data often exhibits autocorrelation (sequential dependency).
Traditional methods struggle with time-dependent data.
ML techniques like recurrent neural networks explicitly model these dependencies.

Regularised Linear Models

Introduction

A regularised linear model = a linear model that has been regularised.

The aim to prevent overfitting by penalising complex models.
Ridge, Lasso, and Elastic Net are popular examples of these types of model.
Provide stable predictions for high-dimensional datasets.

Regularised Linear Models

‘Normal’ linear models minimise the sum of squared errors without constraints.
Regularised models (Ridge, Lasso, Elastic Net) add penalty terms to shrink coefficients.
As the number of features increases, regularisation helps prevent overfitting by controlling coefficient growth.

Types

Ridge Regression

Uses an L2 penalty to reduce coefficients without eliminating variables.

Retains all predictors, managing correlated variables effectively.
Enhances stability and generalisation of the model.

Ridge Regression - Example

Code

library(glmnet)

set.seed(123)
x <- matrix(rnorm(100*20), 100, 20)
y <- rnorm(100)

ridge <- glmnet(x, y, alpha = 0)

# spacing/clarity
plot(ridge, xvar = "lambda", label = TRUE, lwd = 2, cex.axis = 0.8, cex.lab = 0.9)
title("Ridge Regression: Coefficient Paths", line = 2.5, cex.main = 1)

Visualisation

Shows how coefficients change as regularisation strength (lambda) increases.
Each line represents one feature’s coefficient path.
Demonstrates coefficient shrinkage, improving stability and reducing overfitting.

Lasso Regression

Applies an L1 penalty, shrinking some coefficients exactly to zero.

Performs automatic feature selection, simplifying the model.
Particularly useful with high-dimensional, redundant datasets.

Lasso Regression

Code

library(glmnet)
set.seed(123)
x <- matrix(rnorm(100*20), 100, 20)
y <- rnorm(100)
lasso <- glmnet(x, y, alpha = 1)
plot(lasso, xvar = "lambda", label = TRUE, lwd = 2, cex.axis = 0.8, cex.lab = 0.9)
title("Lasso Regression Coefficient Path", line = 2.5, cex.main = 1)

Visualisation

Illustrates how coefficients shrink and some become exactly zero as regularisation (lambda) increases.
Clearly visualises automatic feature selection.
Helps identify important features by seeing when coefficients reach zero.

Elastic Net Regression

Combines Lasso (L1) and Ridge (L2) penalties.

Balances feature selection with coefficient stability.
Effective with highly correlated predictor variables.

Generalised Linear Models (GLMs)

Introduction

GLMs extend linear regression to model diverse data types.
They introduce link functions to handle non-normal responses.

RLM and GLM

GLMs extend ordinary linear regression to allow for different types of response variables (e.g., binary, count data) by using a link function and a distribution from the exponential family (e.g., logistic regression for binary outcomes, Poisson regression for count data).
GLMs do not inherently penalise large numbers of predictors (unlike RLMs).
Model complexity in GLMs is usually managed via feature selection, hypothesis testing, or AIC/BIC model selection, rather than explicit penalties (as in RLMs).

Logistic Regression

Specifically designed for binary classification problems.
Predicts probability of events (e.g., injury likelihood).

Logistic Regression

ROC Curve displays model’s true positive rate vs false positive rate.
The Area Under Curve (AUC) measures overall predictive accuracy.
Higher AUC indicates better performance in distinguishing outcomes.

Poisson Regression

Ideal for count-based data (e.g., goals scored, tackles).
Assumes events occur independently at a constant average rate.
Useful for forecasting discrete outcomes (like match scores).

Poisson Regression

Plots actual versus predicted count values (e.g., goals scored).
Shows the quality of model predictions in real scenarios.
Ideally, points lie along the diagonal, indicating accurate predictions.

Gamma Regression

Suitable for continuous, positively skewed data.
Useful for modelling metrics like recovery times or energy use.
Handles increasing variance with increasing response values effectively.

Gamma Regression - Example

Code

set.seed(123)
df <- data.frame(x = rnorm(100))
# Generate gamma-distributed response, depending on x
df$y <- rgamma(100, shape = 2, rate = 1/(exp(0.5 + 0.2 * df$x)))

# Fit a Gamma GLM
gamma_model <- glm(y ~ x, family = Gamma(link = "log"), data = df)

df$pred <- predict(gamma_model, type = "response")

# Plot actual vs. predicted
plot(df$x, df$y, pch = 19, col = "blue",
     xlab = "x", ylab = "y", main = "Gamma Regression: Actual vs Predicted")
points(df$x, df$pred, col = "red", pch = 19)
legend("topleft", legend = c("Actual", "Predicted"),
       pch = 19, col = c("blue", "red"))

Hyperparameter Tuning

Crucial to optimise model performance.
Techniques include cross-validation, Bayesian optimisation, grid/random search.
Enhances generalisation and robustness of predictions.

Support Vector Machines (SVM) and Kernel Methods

Introduction

Support Vector Machines (SVMs) are supervised learning models that find the optimal hyperplane to maximise the margin between classes in high-dimensional space, using kernel functions to handle non-linearly separable data.

ELI5

Imagine you have apples and oranges scattered on a table, and you want to draw a straight line to separate them.
A Support Vector Machine finds the best possible line that keeps the apples on one side and the oranges on the other, making sure there’s as much space as possible between them.
If the fruits are all mixed up, SVM can bend the line to still separate them nicely.

SVM plot

SVM Basics

Effective for capturing complex, non-linear relationships.
Uses decision boundaries to separate classes.
Ideal for complex prediction tasks beyond linear approaches.

Kernel Functions in SVM

‘Kernels’ map data into higher-dimensional spaces.
They enable linear separation of previously inseparable data.
Polynomial and RBF kernels are common and effective.

Polynomial Kernel

One type of kernel is the polynomial kernel.
This captures complex non-linear interactions between variables.
Adjusting polynomial degree (\(d\)) impacts flexibility and risk of overfitting.

Polynomial Kernel

Polynomial kernel plot

Visualises non-linear decision boundary separating two classes.
Demonstrates effectiveness of polynomial kernels for complex data patterns.
Highlights how decision boundary captures intricate relationships between features.

Radial Basis Function Kernel

Another form of kernel function
Employs Gaussian transformations for flexible, dynamic modelling.
Suitable for intricate/detailed data.
Sensitive to parameter settings (gamma), requiring careful tuning.

Radial Basis Function Kernel

Visualisation

Displays decision boundaries created using the Radial Basis Function kernel.
Illustrates flexibility in handling non-linear, complex data structures.
Helps visualise impact of kernel parameter tuning (gamma).

Kernel Trick

The kernel trick helps Support Vector Machines (SVMs) find decision boundaries for complex datasets without needing to directly transform the data into higher dimensions.
Imagine you’re trying to separate red and blue dots on a piece of paper, but they’re arranged in a way that no straight line can split them. For example, red dots might form a circle around blue dots.
One way to solve this would be to lift the paper into 3D space, like turning it into a dome, and then slice it with a flat plane.
If you drop the sliced paper back down, that straight cut looks like a curved boundary in 2D.

Advanced Decision Trees and Extensions

Decision Trees

Decision trees are simple, interpretable models but can overfit.
Random Forests and Gradient Boosting enhance prediction robustness of decision trees.

Simple Decision Tree - Example

Random Forests

Random forests use multiple decision trees.
Each tree is trained on a random subset of the data (bootstrapping).
Each split in a tree is based on a random subset of features.
Final predictions come from an ensemble vote.

Random Forests - Training

The following code trains a random forest model:

rf <- randomForest(y ~ ., data=data, importance=TRUE, ntree=100, mtry=2)

ntree=100: Uses 100 trees in the forest.
mtry=2 means that only 2 randomly chosen features are considered for splitting at each node.
This is a key difference between random forests and decision trees. A decision tree considers all features at every split.A random forest only considers a subset (mtry), reducing correlation between trees.

Feature Importance in Random Forests

varImpPlot(rf, main="Random Forest Feature Importance")

Feature Importance in Random Forests

Displays ranked predictor variables based on their contribution to accuracy.
Helps identify key predictors influencing model outcomes.
Valuable for interpreting complex models and simplifying future analyses.

Visualising individual decision trees within the model

library(rpart.plot)
single_tree <- randomForest::getTree(rf, k=1, labelVar=TRUE)
single_tree

   left daughter right daughter split var split point status prediction
1              2              3 Feature.4  -0.4625128      1       <NA>
2              4              5 Feature.5   0.1364994      1       <NA>
3              6              7 Feature.2  -1.5868402      1       <NA>
4              8              9 Feature.5  -0.4922513      1       <NA>
5             10             11 Feature.3  -1.5581840      1       <NA>
6              0              0      <NA>   0.0000000     -1          B
7             12             13 Feature.2   0.6691770      1       <NA>
8              0              0      <NA>   0.0000000     -1          B
9             14             15 Feature.4  -1.7206651      1       <NA>
10             0              0      <NA>   0.0000000     -1          A
11             0              0      <NA>   0.0000000     -1          B
12            16             17 Feature.3  -1.1501542      1       <NA>
13            18             19 Feature.1  -0.5229608      1       <NA>
14            20             21 Feature.5  -0.2887192      1       <NA>
15             0              0      <NA>   0.0000000     -1          A
16            22             23 Feature.1   0.8742761      1       <NA>
17            24             25 Feature.3   0.9827250      1       <NA>
18            26             27 Feature.5  -1.2995618      1       <NA>
19             0              0      <NA>   0.0000000     -1          A
20             0              0      <NA>   0.0000000     -1          A
21             0              0      <NA>   0.0000000     -1          B
22             0              0      <NA>   0.0000000     -1          B
23             0              0      <NA>   0.0000000     -1          A
24            28             29 Feature.2   0.1765166      1       <NA>
25            30             31 Feature.4   1.2443571      1       <NA>
26             0              0      <NA>   0.0000000     -1          B
27             0              0      <NA>   0.0000000     -1          A
28            32             33 Feature.4   1.8976717      1       <NA>
29            34             35 Feature.1   1.4439359      1       <NA>
30             0              0      <NA>   0.0000000     -1          B
31             0              0      <NA>   0.0000000     -1          A
32            36             37 Feature.1  -1.4124151      1       <NA>
33             0              0      <NA>   0.0000000     -1          B
34            38             39 Feature.3   0.3359271      1       <NA>
35             0              0      <NA>   0.0000000     -1          A
36             0              0      <NA>   0.0000000     -1          B
37             0              0      <NA>   0.0000000     -1          A
38             0              0      <NA>   0.0000000     -1          B
39             0              0      <NA>   0.0000000     -1          A

Out-of-bag (OOB) error

In a random forest, each tree is trained on a bootstrap sample (a random subset of the data, with replacement).
About 1/3rd of the training data is left out for each tree
These are called out-of-bag (OOB) samples.
These OOB samples act as a validation set, allowing us to estimate model performance without needing a separate test set.

Out-of-bag (OOB) error

The OOB error = average classification error on these OOB samples across all trees.
It provides an internal cross-validation, meaning you don’t need to set aside a test dataset.
It helps determine how many trees are needed before the model stabilises (i.e., when additional trees stop improving performance).

Out-of-bag (OOB) error

plot(rf, main="Random Forest Out-of-Bag (OOB) Error")

X-Axis: Number of Trees (ntree) - Represents number of trees in the random forest. - Starts at 1 and increases to total number of trees specified (ntree=100 by default).

Y-Axis: OOB Error Rate - The OOB error rate decreases as more trees are added. - It stabilises at some point, meaning additional trees do not significantly improve the model. - Black line is overall OOB error - Red line is class 1 error (first class in dataset) - Green line is class 2 error (second class in dataset)

Decision boundaries in random forests

Code

library(ggplot2)
library(randomForest)

# Generate a 2-feature dataset
set.seed(123)
df <- data.frame(
  Feature1 = rnorm(200),
  Feature2 = rnorm(200),
  Class = factor(sample(c("A", "B"), 200, replace=TRUE))
)

# Train a Random Forest with only 2 features
rf2 <- randomForest(Class ~ Feature1 + Feature2, data=df, ntree=100, mtry=1)

# Create a grid of points to predict across feature space
grid <- expand.grid(
  Feature1 = seq(min(df$Feature1), max(df$Feature1), length=100),
  Feature2 = seq(min(df$Feature2), max(df$Feature2), length=100)
)

# Predict class labels for the grid
grid$Prediction <- predict(rf2, newdata=grid)

# Plot decision boundary and original points
ggplot(df, aes(x=Feature1, y=Feature2)) +
  geom_tile(data=grid, aes(fill=Prediction), alpha=0.3) +
  geom_point(aes(color=Class), size=2) +
  labs(title="Random Forest Decision Boundary") +
  theme_minimal()

What the Figure Represents

The figure is a 2D classification plot with three key visual elements:

Background Color (Decision Boundary)

The background is shaded according to the predicted class (A or B).
The random forest divides the feature space into regions where each class is most likely to be predicted.
This shows the decision boundary of the random forest model.

What the Figure Represents

Points (Original Data)

The scatter points represent the actual training data.
Each point is coloured based on its true class label (Class A or Class B).
This allows us to see whether the predicted decision boundary aligns well with the actual data.

What the Figure Represents

Shading (Prediction Confidence)

The background colour is determined by the predictions from the random forest model.
Areas where Feature1 and Feature2 lead to the same predicted class are filled with that class’s colour.
The sharper or more irregular the decision boundary, the more complex the model’s decision regions.

Gradient Boosting

Gradient Boosting is an ensemble learning method that builds sequential decision trees, where each tree corrects the mistakes of the previous one.
Unlike Random Forests (which train trees in parallel), GBM trees are built iteratively, focusing on reducing the residual errors from previous trees.

Gradient Boosting - Example

Code

library(gbm)

set.seed(123)
# Synthetic regression data
df <- data.frame(matrix(rnorm(500), ncol=5))
colnames(df) <- paste0("feature", 1:5)
df$y <- rnorm(100)

# Fit a Gradient Boosting Model
gbm_model <- gbm(
  formula = y ~ .,
  data = df,
  distribution = "gaussian",
  n.trees = 100,
  interaction.depth = 2,
  shrinkage = 0.1,
  verbose = FALSE
)

n.trees = 100 → Builds 100 boosting iterations (trees).
interaction.depth = 2 → Limits each tree to 2 levels deep (restricts complexity).
shrinkage = 0.1 → Learning rate controls how much each tree contributes (prevents overfitting).

Partial Dependence Plot

Code

# Plot partial dependence for feature1
plot(
  gbm_model,
  i.var = "feature1",
  n.trees = 100,
  main = "Partial Dependence (GBM) - feature1"
)

What is a Partial Dependence Plot?

The PDP shows how the predicted outcome (y) changes as we vary a single feature (feature1), averaging over all other features in the dataset.

X-axis (feature1): The range of values for feature1 in the dataset.
Y-axis (Partial Dependence Score): The average effect of feature1 on the predicted target y, while holding all other features constant.

Interpretation

If the plot shows an upward trend, feature1 has positive relationship with y (i.e., increasing feature1 increases predicted value).
If the plot shows a downward trend, feature1 has negative impact (i.e., increasing feature1 decreases the predicted value).
If the plot is flat, feature1 does not significantly influence predictions.

XGBoost

High-performance gradient boosting method.
Handles missing values and complex variable interactions.
Widely used for injury risk, performance forecasting in sport.

XGBoost

Code

library(xgboost)
set.seed(123)
data(agaricus.train, package='xgboost')
bst <- xgboost(
  data = agaricus.train$data, 
  label = agaricus.train$label, 
  nrounds = 20, 
  objective = "binary:logistic", 
  verbose = 0
)

importance_matrix <- xgb.importance(model = bst)
xgb.plot.importance(importance_matrix, main = "XGBoost Feature Importance")

XGBoost - Feature Importance

Shows variables ranked by importance to the model’s predictions.
Essential for understanding what drives predictive decisions (e.g., injury risk).
Guides analysts in refining models and focusing data collection.

XGBoost - Hyperparameter Tuning

Essential to avoid overfitting and underperformance.
Parameters: learning rate, tree depth, number of trees.
Fine-tuning through cross-validation ensures optimal performance.

Interpretability in XGBoost

Offers insights into model decisions through feature importance.
SHAP values provide individual prediction explanations.

XGBoost Example

Demonstrates how a trained XGBoost model’s predictions change as one feature varies (partial dependence).

Code

# Plot partial dependence
ggplot(pd_df, aes(x = feature1, y = yhat)) +
  geom_line(color = "blue", size = 1) +
  labs(
    x = "feature1",
    y = "Predicted Probability",
    title = "Partial Dependence (XGBoost) - Feature1"
  ) +
  theme_minimal()

What it shows:

How model’s predicted probability varies when feature1 changes, keeping other features average.
Highlights which values of feature1 increase (or decrease) the chance of a positive class.

Model Evaluation and Selection

Introduction

Ensures reliability and real-world applicability.
Combines cross-validation, precision-recall, and ROC curves.
Balances accuracy, interpretability, and computational efficiency.

Nested Cross-validation

Provides unbiased estimates of model performance.
Separately optimises hyperparameters and validates predictions.
Reduces overfitting, essential for sport analytics decisions.

Bayesian Optimisation

Intelligent and efficient hyperparameter search.
Uses prior results to inform next parameter choices.
Optimal when computation is costly or dataset large and complex.

ROC Curves and AUC

Evaluate classifier performance independently of thresholds.
Useful in sport analytics for binary decisions (injury, win/loss).
Provides robust, objective performance comparisons.

ROC Curves and AUC

Code

library(pROC)
set.seed(123)
x <- rnorm(100)
y <- rbinom(100, 1, 0.5)
model <- glm(y ~ x, family="binomial")
roc_curve <- roc(y, predict(model, type="response"))
plot(roc_curve, col="blue", main="ROC Curve with AUC")

Code

auc(roc_curve)

Area under the curve: 0.498

ROC Curve and AUC (Logistic Regression)

ROC curve shows trade-off between true positives and false positives.
AUC summarises classifier’s predictive capability across all thresholds.
Essential for evaluating model performance, particularly for binary outcomes like injury prediction.

Demonstration

Dataset

Begin by creating a synthetic dataset simulating match outcomes between two teams. Features include team skills, home advantage, and match importance. A binary outcome (win/lose) is generated using a logistic function.

Code

set.seed(123)  # For reproducibility
n <- 500  # Number of matches

# Simulate team skill levels (values between 50 and 100)
teamA_skill <- round(runif(n, 50, 100),1)
teamB_skill <- round(runif(n, 50, 100),1)

# Simulate home advantage (0 or 1) and match importance (categorical)
home_advantage <- rbinom(n, 1, 0.5)
match_importance <- factor(sample(c("low", "medium", "high"), n, replace = TRUE))

# Map match importance to a numeric effect
importance_effect <- ifelse(match_importance == "low", 0,
                       ifelse(match_importance == "medium", 0.3, 0.6))

# Define a linear predictor that favors Team A based on skills, home advantage, and match importance
lin_pred <- (teamA_skill - teamB_skill) / 10 + 0.5 * home_advantage + importance_effect
prob <- 1 / (1 + exp(-lin_pred))  # Logistic transformation

# Generate binary outcome (1 = Team A wins, 0 = loses)
result <- rbinom(n, 1, prob)

# Create the data frame
sports_data <- data.frame(
  teamA_skill,
  teamB_skill,
  home_advantage = factor(home_advantage),
  match_importance,
  result = factor(result)
)

head(sports_data)

  teamA_skill teamB_skill home_advantage match_importance result
1        64.4        67.7              0             high      0
2        89.4        68.3              1           medium      1
3        70.4        64.4              0           medium      1
4        94.2        54.0              1           medium      1
5        97.0        68.3              1             high      1
6        52.3        58.9              0           medium      0

Data Splitting

For model evaluation, we split the dataset into training (70%) and testing (30%) sets.

set.seed(456)
train_indices <- sample(1:n, size = round(0.7 * n))
train_data <- sports_data[train_indices, ]
test_data <- sports_data[-train_indices, ]

Splitting the data ensures that we train our models on one subset and test their performance on unseen data, helping us evaluate their generalisation.

Generalised Linear Model (GLM)

Generalised Linear Models (GLMs) extend linear regression to accommodate response variables that have non-normal distributions, such as binary or count data.
GLMs achieve flexibility by using a link function to relate the mean of the response variable to a linear combination of predictor variables.

Generalised Linear Model (GLM)

GLM (logistic regression in this case), models log-odds of winning based on predictors.

Code

glm_model <- glm(result ~ teamA_skill + teamB_skill + home_advantage + match_importance,
                 data = train_data, family = binomial)
summary(glm_model)


Call:
glm(formula = result ~ teamA_skill + teamB_skill + home_advantage + 
    match_importance, family = binomial, data = train_data)

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)             1.16348    0.97997   1.187  0.23512    
teamA_skill             0.10117    0.01297   7.802 6.07e-15 ***
teamB_skill            -0.10925    0.01293  -8.450  < 2e-16 ***
home_advantage1         0.77300    0.29104   2.656  0.00791 ** 
match_importancelow    -0.51460    0.35557  -1.447  0.14783    
match_importancemedium -0.82500    0.35909  -2.297  0.02159 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 479.66  on 349  degrees of freedom
Residual deviance: 301.69  on 344  degrees of freedom
AIC: 313.69

Number of Fisher Scoring iterations: 5

Code

# Prediction on test data
glm_preds <- predict(glm_model, test_data, type = "response")
glm_class <- ifelse(glm_preds > 0.5, 1, 0)
glm_accuracy <- mean(glm_class == as.numeric(as.character(test_data$result)))
cat("GLM Accuracy:", glm_accuracy, "\n")

GLM Accuracy: 0.7866667

The GLM estimates coefficients for each predictor. A probability threshold (0.5) is used to classify outcomes. This provides a baseline model for binary classification.

Regularised Linear Model (Lasso)

Lasso is a regression method that simplifies models by applying a penalty to shrink less important coefficients toward zero.
It essentially performs feature selection, reducing overfitting and enhancing model interpretability by retaining only the most relevant predictors.

Regularised Linear Model (Lasso)

Lasso applies L1 regularisation, shrinking less important coefficients to zero for better feature selection.

Code

library(glmnet)
# Create model matrices with dummy variables for factors
x_train <- model.matrix(result ~ teamA_skill + teamB_skill + home_advantage + match_importance, train_data)[,-1]
y_train <- as.numeric(as.character(train_data$result))
x_test <- model.matrix(result ~ teamA_skill + teamB_skill + home_advantage + match_importance, test_data)[,-1]

# Cross-validation to select lambda (regularisation strength)
cv_glmnet <- cv.glmnet(x_train, y_train, family = "binomial", alpha = 1)
glmnet_preds <- predict(cv_glmnet, newx = x_test, type = "response", s = "lambda.min")
glmnet_class <- ifelse(glmnet_preds > 0.5, 1, 0)
glmnet_accuracy <- mean(glmnet_class == as.numeric(as.character(test_data$result)))
cat("Regularised GLM (Lasso) Accuracy:", glmnet_accuracy, "\n")

Regularised GLM (Lasso) Accuracy: 0.78

Lasso reduces model complexity by penalising the absolute sum of coefficients, which helps in preventing overfitting and performing variable selection.

Random Forest

Random Forests build multiple decision trees and aggregate their predictions for improved accuracy.

Code

library(randomForest)
rf_model <- randomForest(result ~ teamA_skill + teamB_skill + home_advantage + match_importance,
                         data = train_data, ntree = 100)
rf_preds <- predict(rf_model, test_data)
rf_accuracy <- mean(rf_preds == test_data$result)
cat("Random Forest Accuracy:", rf_accuracy, "\n")

Random Forest Accuracy: 0.8

Support Vector Machines (SVM)

SVM finds the optimal hyperplane that separates classes by maximising the margin between them.

SVM code in R

Code

library(e1071)
svm_model <- svm(result ~ teamA_skill + teamB_skill + home_advantage + match_importance,
                 data = train_data, probability = TRUE)
svm_preds <- predict(svm_model, test_data, probability = TRUE)
svm_accuracy <- mean(svm_preds == test_data$result)
cat("SVM Accuracy:", svm_accuracy, "\n")

SVM Accuracy: 0.7533333

SVM is effective in high-dimensional spaces. It transforms the data and finds a decision boundary that maximises the separation margin between classes, which helps improve classification performance.

SVM Confusion Matrix

Code

library(caret)

# Train SVM model
library(e1071)
svm_model <- svm(result ~ teamA_skill + teamB_skill + home_advantage + match_importance,
                 data = train_data, probability = TRUE)

# Predict
svm_preds <- predict(svm_model, test_data, probability = TRUE)

# Confusion Matrix (predicted vs actual)
conf_mat <- confusionMatrix(svm_preds, test_data$result)
print(conf_mat)

Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 47 12
         1 25 66
                                        
               Accuracy : 0.7533        
                 95% CI : (0.6764, 0.82)
    No Information Rate : 0.52          
    P-Value [Acc > NIR] : 3.697e-09     
                                        
                  Kappa : 0.5024        
                                        
 Mcnemar's Test P-Value : 0.04852       
                                        
            Sensitivity : 0.6528        
            Specificity : 0.8462        
         Pos Pred Value : 0.7966        
         Neg Pred Value : 0.7253        
             Prevalence : 0.4800        
         Detection Rate : 0.3133        
   Detection Prevalence : 0.3933        
      Balanced Accuracy : 0.7495        
                                        
       'Positive' Class : 0

SVM ROC Curve

Code

library(ROCR)

# Obtain predicted probabilities for the "positive" class (assuming 2-class problem)
svm_probs <- attr(predict(svm_model, test_data, probability = TRUE), "probabilities")
# If positive class is second level in factor, e.g. 'Win'
predicted_probs <- svm_probs[,2]

# Create prediction
pred_obj <- prediction(predicted_probs, test_data$result)

# Performance object for TPR (true positive rate) vs FPR (false positive rate)
perf <- performance(pred_obj, "tpr", "fpr")

# Plot ROC
plot(perf, main = "SVM ROC Curve")

Code

# Compute AUC
auc_perf <- performance(pred_obj, "auc")
auc_value <- auc_perf@y.values[[1]]
cat("AUC:", auc_value, "\n")

AUC: 0.8479345

SVM: Feature Importance (Partial Dependence)

This shows how predictions change, on average, as each feature changes (while other features are held at typical values).

Code

library(pdp)

# Partial Dependence Plot for 'teamA_skill'
p1 <- partial(svm_model, 
              pred.var = "teamA_skill", 
              train = train_data)
plotPartial(p1, main="Partial Dependence on Team A Skill")

Code

# Partial Dependence Plot for 'teamB_skill'
p2 <- partial(svm_model, 
              pred.var = "teamB_skill", 
              train = train_data)
plotPartial(p2, main="Partial Dependence on Team B Skill")