Definition
Supervised learning involves training algorithms using labeled datasets…model learns to predict outcomes based on input-output examples.
Key Tasks
Primary tasks include classification (predicting discrete categories) and regression (predicting continuous values).
Training Data
Effective supervised learning depends on high-quality labeled data that accurately represent the relationships between inputs and outputs.
Algorithms
Common supervised learning algorithms include linear regression, logistic regression, support vector machines (SVMs), decision trees, and neural networks.
Model Evaluation
Models are evaluated using metrics (accuracy, precision, recall, F1-score, Mean Squared Error (MSE), and ROC-AUC.
Bias-Variance Trade-off
Core challenge is balancing bias (errors from overly simplistic models) and variance (errors due to overly complex models).
Overfitting and Underfitting
Overfitting - a model learns the training data too closely and fails to generalise.
Underfitting - the model fails to capture key patterns in the data.
Model Selection and Hyperparameter Tuning
Cross-validation and hyperparameter optimisation (e.g., grid search, random search) are commonly used methods to enhance model performance and generalisation.
Ensemble Methods
Combining multiple models (e.g., bagging, boosting, stacking) can significantly improve predictive performance by reducing variance and improving stability.
Traditional statistics focuses on inference and significance testing.
Machine learning prioritises accurate prediction and data-driven validation.
ML techniques include regularisation and hyperparameter tuning.
Regularisation penalises overfitting.
It prevents overfitting in predictive models.
Adds penalties for overly complex solutions.
Maintains model interpretability and predictive reliability.
High dimensionality occurs when too many features lead to unstable models.
Can cause overfitting, reducing prediction accuracy.
Regularisation and feature selection help manage complexity.
Occurs when predictors are highly correlated.
Makes coefficient estimates unreliable.
Ridge and Lasso regression effectively manage multicollinearity.
Sports data often exhibits autocorrelation (sequential dependency).
Traditional methods struggle with time-dependent data.
ML techniques like recurrent neural networks explicitly model these dependencies.
A regularised linear model = a linear model that has been regularised.
The aim to prevent overfitting by penalising complex models.
Ridge, Lasso, and Elastic Net are popular examples of these types of model.
Provide stable predictions for high-dimensional datasets.
‘Normal’ linear models minimise the sum of squared errors without constraints.
Regularised models (Ridge, Lasso, Elastic Net) add penalty terms to shrink coefficients.
As the number of features increases, regularisation helps prevent overfitting by controlling coefficient growth.
Uses an L2 penalty to reduce coefficients without eliminating variables.
Retains all predictors, managing correlated variables effectively.
Enhances stability and generalisation of the model.
Shows how coefficients change as regularisation strength (lambda) increases.
Each line represents one feature’s coefficient path.
Demonstrates coefficient shrinkage, improving stability and reducing overfitting.
Applies an L1 penalty, shrinking some coefficients exactly to zero.
Performs automatic feature selection, simplifying the model.
Particularly useful with high-dimensional, redundant datasets.
Illustrates how coefficients shrink and some become exactly zero as regularisation (lambda) increases.
Clearly visualises automatic feature selection.
Helps identify important features by seeing when coefficients reach zero.
Combines Lasso (L1) and Ridge (L2) penalties.
Balances feature selection with coefficient stability.
Effective with highly correlated predictor variables.
GLMs extend linear regression to model diverse data types.
They introduce link functions to handle non-normal responses.
GLMs extend ordinary linear regression to allow for different types of response variables (e.g., binary, count data) by using a link function and a distribution from the exponential family (e.g., logistic regression for binary outcomes, Poisson regression for count data).
GLMs do not inherently penalise large numbers of predictors (unlike RLMs).
Model complexity in GLMs is usually managed via feature selection, hypothesis testing, or AIC/BIC model selection, rather than explicit penalties (as in RLMs).
Specifically designed for binary classification problems.
Predicts probability of events (e.g., injury likelihood).
ROC Curve displays model’s true positive rate vs false positive rate.
The Area Under Curve (AUC) measures overall predictive accuracy.
Higher AUC indicates better performance in distinguishing outcomes.

Ideal for count-based data (e.g., goals scored, tackles).
Assumes events occur independently at a constant average rate.
Useful for forecasting discrete outcomes (like match scores).

Plots actual versus predicted count values (e.g., goals scored).
Shows the quality of model predictions in real scenarios.
Ideally, points lie along the diagonal, indicating accurate predictions.
Suitable for continuous, positively skewed data.
Useful for modelling metrics like recovery times or energy use.
Handles increasing variance with increasing response values effectively.
set.seed(123)
df <- data.frame(x = rnorm(100))
# Generate gamma-distributed response, depending on x
df$y <- rgamma(100, shape = 2, rate = 1/(exp(0.5 + 0.2 * df$x)))
# Fit a Gamma GLM
gamma_model <- glm(y ~ x, family = Gamma(link = "log"), data = df)
df$pred <- predict(gamma_model, type = "response")
# Plot actual vs. predicted
plot(df$x, df$y, pch = 19, col = "blue",
xlab = "x", ylab = "y", main = "Gamma Regression: Actual vs Predicted")
points(df$x, df$pred, col = "red", pch = 19)
legend("topleft", legend = c("Actual", "Predicted"),
pch = 19, col = c("blue", "red"))Crucial to optimise model performance.
Techniques include cross-validation, Bayesian optimisation, grid/random search.
Enhances generalisation and robustness of predictions.
Support Vector Machines (SVMs) are supervised learning models that find the optimal hyperplane to maximise the margin between classes in high-dimensional space, using kernel functions to handle non-linearly separable data.
Imagine you have apples and oranges scattered on a table, and you want to draw a straight line to separate them.
A Support Vector Machine finds the best possible line that keeps the apples on one side and the oranges on the other, making sure there’s as much space as possible between them.
If the fruits are all mixed up, SVM can bend the line to still separate them nicely.
Effective for capturing complex, non-linear relationships.
Uses decision boundaries to separate classes.
Ideal for complex prediction tasks beyond linear approaches.
‘Kernels’ map data into higher-dimensional spaces.
They enable linear separation of previously inseparable data.
Polynomial and RBF kernels are common and effective.
One type of kernel is the polynomial kernel.
This captures complex non-linear interactions between variables.
Adjusting polynomial degree (\(d\)) impacts flexibility and risk of overfitting.
Visualises non-linear decision boundary separating two classes.
Demonstrates effectiveness of polynomial kernels for complex data patterns.
Highlights how decision boundary captures intricate relationships between features.
Another form of kernel function
Employs Gaussian transformations for flexible, dynamic modelling.
Suitable for intricate/detailed data.
Sensitive to parameter settings (gamma), requiring careful tuning.
Displays decision boundaries created using the Radial Basis Function kernel.
Illustrates flexibility in handling non-linear, complex data structures.
Helps visualise impact of kernel parameter tuning (gamma).
The kernel trick helps Support Vector Machines (SVMs) find decision boundaries for complex datasets without needing to directly transform the data into higher dimensions.
Imagine you’re trying to separate red and blue dots on a piece of paper, but they’re arranged in a way that no straight line can split them. For example, red dots might form a circle around blue dots.
One way to solve this would be to lift the paper into 3D space, like turning it into a dome, and then slice it with a flat plane.
If you drop the sliced paper back down, that straight cut looks like a curved boundary in 2D.
Decision trees are simple, interpretable models but can overfit.
Random Forests and Gradient Boosting enhance prediction robustness of decision trees.
Random forests use multiple decision trees.
Each tree is trained on a random subset of the data (bootstrapping).
Each split in a tree is based on a random subset of features.
Final predictions come from an ensemble vote.
The following code trains a random forest model:
Displays ranked predictor variables based on their contribution to accuracy.
Helps identify key predictors influencing model outcomes.
Valuable for interpreting complex models and simplifying future analyses.
left daughter right daughter split var split point status prediction
1 2 3 Feature.4 -0.4625128 1 <NA>
2 4 5 Feature.5 0.1364994 1 <NA>
3 6 7 Feature.2 -1.5868402 1 <NA>
4 8 9 Feature.5 -0.4922513 1 <NA>
5 10 11 Feature.3 -1.5581840 1 <NA>
6 0 0 <NA> 0.0000000 -1 B
7 12 13 Feature.2 0.6691770 1 <NA>
8 0 0 <NA> 0.0000000 -1 B
9 14 15 Feature.4 -1.7206651 1 <NA>
10 0 0 <NA> 0.0000000 -1 A
11 0 0 <NA> 0.0000000 -1 B
12 16 17 Feature.3 -1.1501542 1 <NA>
13 18 19 Feature.1 -0.5229608 1 <NA>
14 20 21 Feature.5 -0.2887192 1 <NA>
15 0 0 <NA> 0.0000000 -1 A
16 22 23 Feature.1 0.8742761 1 <NA>
17 24 25 Feature.3 0.9827250 1 <NA>
18 26 27 Feature.5 -1.2995618 1 <NA>
19 0 0 <NA> 0.0000000 -1 A
20 0 0 <NA> 0.0000000 -1 A
21 0 0 <NA> 0.0000000 -1 B
22 0 0 <NA> 0.0000000 -1 B
23 0 0 <NA> 0.0000000 -1 A
24 28 29 Feature.2 0.1765166 1 <NA>
25 30 31 Feature.4 1.2443571 1 <NA>
26 0 0 <NA> 0.0000000 -1 B
27 0 0 <NA> 0.0000000 -1 A
28 32 33 Feature.4 1.8976717 1 <NA>
29 34 35 Feature.1 1.4439359 1 <NA>
30 0 0 <NA> 0.0000000 -1 B
31 0 0 <NA> 0.0000000 -1 A
32 36 37 Feature.1 -1.4124151 1 <NA>
33 0 0 <NA> 0.0000000 -1 B
34 38 39 Feature.3 0.3359271 1 <NA>
35 0 0 <NA> 0.0000000 -1 A
36 0 0 <NA> 0.0000000 -1 B
37 0 0 <NA> 0.0000000 -1 A
38 0 0 <NA> 0.0000000 -1 B
39 0 0 <NA> 0.0000000 -1 A
In a random forest, each tree is trained on a bootstrap sample (a random subset of the data, with replacement).
About 1/3rd of the training data is left out for each tree
These are called out-of-bag (OOB) samples.
These OOB samples act as a validation set, allowing us to estimate model performance without needing a separate test set.
The OOB error = average classification error on these OOB samples across all trees.
It provides an internal cross-validation, meaning you don’t need to set aside a test dataset.
It helps determine how many trees are needed before the model stabilises (i.e., when additional trees stop improving performance).
X-Axis: Number of Trees (ntree) - Represents number of trees in the random forest. - Starts at 1 and increases to total number of trees specified (ntree=100 by default).
Y-Axis: OOB Error Rate - The OOB error rate decreases as more trees are added. - It stabilises at some point, meaning additional trees do not significantly improve the model. - Black line is overall OOB error - Red line is class 1 error (first class in dataset) - Green line is class 2 error (second class in dataset)
library(ggplot2)
library(randomForest)
# Generate a 2-feature dataset
set.seed(123)
df <- data.frame(
Feature1 = rnorm(200),
Feature2 = rnorm(200),
Class = factor(sample(c("A", "B"), 200, replace=TRUE))
)
# Train a Random Forest with only 2 features
rf2 <- randomForest(Class ~ Feature1 + Feature2, data=df, ntree=100, mtry=1)
# Create a grid of points to predict across feature space
grid <- expand.grid(
Feature1 = seq(min(df$Feature1), max(df$Feature1), length=100),
Feature2 = seq(min(df$Feature2), max(df$Feature2), length=100)
)
# Predict class labels for the grid
grid$Prediction <- predict(rf2, newdata=grid)
# Plot decision boundary and original points
ggplot(df, aes(x=Feature1, y=Feature2)) +
geom_tile(data=grid, aes(fill=Prediction), alpha=0.3) +
geom_point(aes(color=Class), size=2) +
labs(title="Random Forest Decision Boundary") +
theme_minimal()The figure is a 2D classification plot with three key visual elements:
Background Color (Decision Boundary)
The background is shaded according to the predicted class (A or B).
The random forest divides the feature space into regions where each class is most likely to be predicted.
This shows the decision boundary of the random forest model.
Points (Original Data)
The scatter points represent the actual training data.
Each point is coloured based on its true class label (Class A or Class B).
This allows us to see whether the predicted decision boundary aligns well with the actual data.
Shading (Prediction Confidence)
The background colour is determined by the predictions from the random forest model.
Areas where Feature1 and Feature2 lead to the same predicted class are filled with that class’s colour.
The sharper or more irregular the decision boundary, the more complex the model’s decision regions.
Gradient Boosting is an ensemble learning method that builds sequential decision trees, where each tree corrects the mistakes of the previous one.
Unlike Random Forests (which train trees in parallel), GBM trees are built iteratively, focusing on reducing the residual errors from previous trees.
library(gbm)
set.seed(123)
# Synthetic regression data
df <- data.frame(matrix(rnorm(500), ncol=5))
colnames(df) <- paste0("feature", 1:5)
df$y <- rnorm(100)
# Fit a Gradient Boosting Model
gbm_model <- gbm(
formula = y ~ .,
data = df,
distribution = "gaussian",
n.trees = 100,
interaction.depth = 2,
shrinkage = 0.1,
verbose = FALSE
)n.trees = 100 → Builds 100 boosting iterations (trees).
interaction.depth = 2 → Limits each tree to 2 levels deep (restricts complexity).
shrinkage = 0.1 → Learning rate controls how much each tree contributes (prevents overfitting).
The PDP shows how the predicted outcome (y) changes as we vary a single feature (feature1), averaging over all other features in the dataset.
X-axis (feature1): The range of values for feature1 in the dataset.
Y-axis (Partial Dependence Score): The average effect of feature1 on the predicted target y, while holding all other features constant.
Interpretation
If the plot shows an upward trend, feature1 has positive relationship with y (i.e., increasing feature1 increases predicted value).
If the plot shows a downward trend, feature1 has negative impact (i.e., increasing feature1 decreases the predicted value).
If the plot is flat, feature1 does not significantly influence predictions.
High-performance gradient boosting method.
Handles missing values and complex variable interactions.
Widely used for injury risk, performance forecasting in sport.
library(xgboost)
set.seed(123)
data(agaricus.train, package='xgboost')
bst <- xgboost(
data = agaricus.train$data,
label = agaricus.train$label,
nrounds = 20,
objective = "binary:logistic",
verbose = 0
)
importance_matrix <- xgb.importance(model = bst)
xgb.plot.importance(importance_matrix, main = "XGBoost Feature Importance")Shows variables ranked by importance to the model’s predictions.
Essential for understanding what drives predictive decisions (e.g., injury risk).
Guides analysts in refining models and focusing data collection.
Essential to avoid overfitting and underperformance.
Parameters: learning rate, tree depth, number of trees.
Fine-tuning through cross-validation ensures optimal performance.
Offers insights into model decisions through feature importance.
SHAP values provide individual prediction explanations.
Demonstrates how a trained XGBoost model’s predictions change as one feature varies (partial dependence).
What it shows:
How model’s predicted probability varies when feature1 changes, keeping other features average.
Highlights which values of feature1 increase (or decrease) the chance of a positive class.
Ensures reliability and real-world applicability.
Combines cross-validation, precision-recall, and ROC curves.
Balances accuracy, interpretability, and computational efficiency.
Provides unbiased estimates of model performance.
Separately optimises hyperparameters and validates predictions.
Reduces overfitting, essential for sport analytics decisions.
Intelligent and efficient hyperparameter search.
Uses prior results to inform next parameter choices.
Optimal when computation is costly or dataset large and complex.
Evaluate classifier performance independently of thresholds.
Useful in sport analytics for binary decisions (injury, win/loss).
Provides robust, objective performance comparisons.
Area under the curve: 0.498
ROC curve shows trade-off between true positives and false positives.
AUC summarises classifier’s predictive capability across all thresholds.
Essential for evaluating model performance, particularly for binary outcomes like injury prediction.
Begin by creating a synthetic dataset simulating match outcomes between two teams. Features include team skills, home advantage, and match importance. A binary outcome (win/lose) is generated using a logistic function.
set.seed(123) # For reproducibility
n <- 500 # Number of matches
# Simulate team skill levels (values between 50 and 100)
teamA_skill <- round(runif(n, 50, 100),1)
teamB_skill <- round(runif(n, 50, 100),1)
# Simulate home advantage (0 or 1) and match importance (categorical)
home_advantage <- rbinom(n, 1, 0.5)
match_importance <- factor(sample(c("low", "medium", "high"), n, replace = TRUE))
# Map match importance to a numeric effect
importance_effect <- ifelse(match_importance == "low", 0,
ifelse(match_importance == "medium", 0.3, 0.6))
# Define a linear predictor that favors Team A based on skills, home advantage, and match importance
lin_pred <- (teamA_skill - teamB_skill) / 10 + 0.5 * home_advantage + importance_effect
prob <- 1 / (1 + exp(-lin_pred)) # Logistic transformation
# Generate binary outcome (1 = Team A wins, 0 = loses)
result <- rbinom(n, 1, prob)
# Create the data frame
sports_data <- data.frame(
teamA_skill,
teamB_skill,
home_advantage = factor(home_advantage),
match_importance,
result = factor(result)
)
head(sports_data) teamA_skill teamB_skill home_advantage match_importance result
1 64.4 67.7 0 high 0
2 89.4 68.3 1 medium 1
3 70.4 64.4 0 medium 1
4 94.2 54.0 1 medium 1
5 97.0 68.3 1 high 1
6 52.3 58.9 0 medium 0
For model evaluation, we split the dataset into training (70%) and testing (30%) sets.
Splitting the data ensures that we train our models on one subset and test their performance on unseen data, helping us evaluate their generalisation.
Generalised Linear Models (GLMs) extend linear regression to accommodate response variables that have non-normal distributions, such as binary or count data.
GLMs achieve flexibility by using a link function to relate the mean of the response variable to a linear combination of predictor variables.
GLM (logistic regression in this case), models log-odds of winning based on predictors.
Call:
glm(formula = result ~ teamA_skill + teamB_skill + home_advantage +
match_importance, family = binomial, data = train_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.16348 0.97997 1.187 0.23512
teamA_skill 0.10117 0.01297 7.802 6.07e-15 ***
teamB_skill -0.10925 0.01293 -8.450 < 2e-16 ***
home_advantage1 0.77300 0.29104 2.656 0.00791 **
match_importancelow -0.51460 0.35557 -1.447 0.14783
match_importancemedium -0.82500 0.35909 -2.297 0.02159 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 479.66 on 349 degrees of freedom
Residual deviance: 301.69 on 344 degrees of freedom
AIC: 313.69
Number of Fisher Scoring iterations: 5
GLM Accuracy: 0.7866667
The GLM estimates coefficients for each predictor. A probability threshold (0.5) is used to classify outcomes. This provides a baseline model for binary classification.
Lasso applies L1 regularisation, shrinking less important coefficients to zero for better feature selection.
library(glmnet)
# Create model matrices with dummy variables for factors
x_train <- model.matrix(result ~ teamA_skill + teamB_skill + home_advantage + match_importance, train_data)[,-1]
y_train <- as.numeric(as.character(train_data$result))
x_test <- model.matrix(result ~ teamA_skill + teamB_skill + home_advantage + match_importance, test_data)[,-1]
# Cross-validation to select lambda (regularisation strength)
cv_glmnet <- cv.glmnet(x_train, y_train, family = "binomial", alpha = 1)
glmnet_preds <- predict(cv_glmnet, newx = x_test, type = "response", s = "lambda.min")
glmnet_class <- ifelse(glmnet_preds > 0.5, 1, 0)
glmnet_accuracy <- mean(glmnet_class == as.numeric(as.character(test_data$result)))
cat("Regularised GLM (Lasso) Accuracy:", glmnet_accuracy, "\n")Regularised GLM (Lasso) Accuracy: 0.78
Lasso reduces model complexity by penalising the absolute sum of coefficients, which helps in preventing overfitting and performing variable selection.
Random Forests build multiple decision trees and aggregate their predictions for improved accuracy.
Random Forest Accuracy: 0.8
SVM finds the optimal hyperplane that separates classes by maximising the margin between them.
SVM Accuracy: 0.7533333
SVM is effective in high-dimensional spaces. It transforms the data and finds a decision boundary that maximises the separation margin between classes, which helps improve classification performance.
library(caret)
# Train SVM model
library(e1071)
svm_model <- svm(result ~ teamA_skill + teamB_skill + home_advantage + match_importance,
data = train_data, probability = TRUE)
# Predict
svm_preds <- predict(svm_model, test_data, probability = TRUE)
# Confusion Matrix (predicted vs actual)
conf_mat <- confusionMatrix(svm_preds, test_data$result)
print(conf_mat)Confusion Matrix and Statistics
Reference
Prediction 0 1
0 47 12
1 25 66
Accuracy : 0.7533
95% CI : (0.6764, 0.82)
No Information Rate : 0.52
P-Value [Acc > NIR] : 3.697e-09
Kappa : 0.5024
Mcnemar's Test P-Value : 0.04852
Sensitivity : 0.6528
Specificity : 0.8462
Pos Pred Value : 0.7966
Neg Pred Value : 0.7253
Prevalence : 0.4800
Detection Rate : 0.3133
Detection Prevalence : 0.3933
Balanced Accuracy : 0.7495
'Positive' Class : 0
library(ROCR)
# Obtain predicted probabilities for the "positive" class (assuming 2-class problem)
svm_probs <- attr(predict(svm_model, test_data, probability = TRUE), "probabilities")
# If positive class is second level in factor, e.g. 'Win'
predicted_probs <- svm_probs[,2]
# Create prediction
pred_obj <- prediction(predicted_probs, test_data$result)
# Performance object for TPR (true positive rate) vs FPR (false positive rate)
perf <- performance(pred_obj, "tpr", "fpr")
# Plot ROC
plot(perf, main = "SVM ROC Curve")AUC: 0.8479345
This shows how predictions change, on average, as each feature changes (while other features are held at typical values).

