ARIMA Models

B1705, Week Eight

Introduction

Time series analysis

  • Time series analysis is the study of data points collected or recorded at successive time intervals.
  • Many real-world datasets follow temporal patterns.
  • Stationarity is critical: data should have a constant mean and variance over time.

Forecasting models

  • ARIMA (AutoRegressive Integrated Moving Average) is a widely used forecasting model for time series data.
  • Other forecasting models include State-Space models (e.g., Kalman Filters), but ARIMA remains standard due to its effectiveness.

Original data

From this series, we’d like to predict what might happen in the future in terms of goals scored.

ARIMA Models

ARIMA Components

  • Autoregression (AR): Incorporates past values to predict future values.

  • Integration (I): Differencing is used to make a series stationary.

  • Moving Average (MA): Uses past error terms to refine predictions.

Autoregression (AR)

  • The Autoregressive (AR) part of ARIMA uses past values of the time series to predict future values.
  • Represented as AR(\(p\)), where \(p\) is the number of past observations used for forecasting.

Autoregression (AR)

Example: AR(1) model predicts the current value as a weighted sum of the previous value.

Visualisation

We can overlay the original series (blue) with its lag (red) to show the relationship between past and present values.

ggplot() + 
  geom_line(aes(x = time(ts_data), y = ts_data), color = "blue") +
  geom_line(aes(x = time(ts_lagged), y = ts_lagged), color = "red", linetype = "dashed") +
  ggtitle("Original vs Lagged Data") + xlab("Time") + ylab("Goals Scored")

Integration (I) Component

  • ‘Integration’ refers to the number of times the data needs to be differenced to achieve stationarity.

  • Remember: differencing removes trends and stabilises the mean, making ARIMA applicable.

  • Represented as I(\(d\)), where \(d\) is the number of times differencing is applied.

Check for stationarity

Code
library(ggplot2)
library(forecast)
library(tseries)


# Perform Augmented Dickey-Fuller (ADF) test
adf_result <- adf.test(ts_data, alternative = "stationary")

# Print test result
print(adf_result)

    Augmented Dickey-Fuller Test

data:  ts_data
Dickey-Fuller = -2.4407, Lag order = 4, p-value = 0.3935
alternative hypothesis: stationary

Note: If the ADF test suggests non-stationarity (p >0.05), differencing is required.

Need to apply differencing

# Compute first-order differencing
ts_diff <- diff(ts_data)

# Plot differenced data
autoplot(ts_diff) + ggtitle("Integrated (I) Component - Differenced Data")
# Perform Augmented Dickey-Fuller (ADF) test
adf_result <- adf.test(ts_diff, alternative = "stationary")

# Print test result
print(adf_result)

    Augmented Dickey-Fuller Test

data:  ts_diff
Dickey-Fuller = -4.3961, Lag order = 4, p-value = 0.01
alternative hypothesis: stationary

Additional Visualisation

Compare the original and differenced series to see how differencing removes trends.

Code
ggplot() +
  geom_line(aes(x = time(ts_data), y = ts_data), color = "blue") +
  geom_line(aes(x = time(ts_diff), y = ts_diff), color = "red", linetype = "dashed") +
  ggtitle("Original vs Differenced Data") + xlab("Time") + ylab("Goals Scored")

Moving Average (MA) Component

  • The Moving Average (MA) component uses past forecast errors to correct future forecasts.
  • Represented as MA(\(q\)), where \(q\) is the number of past errors included in the model.
  • Helps to smooth out short-term fluctuations.
Code
# Example of residuals plot
fit <- auto.arima(ts_data)
res <- residuals(fit)
autoplot(res) + ggtitle("Moving Average (MA) Component - Residuals")

Note: If ACF has a sharp cutoff, the data likely follows an MA model.

ACF vs PACF Visualisations

  • ACF measures correlation between observations at different lags, helping to identify the order of MA terms.
  • PACF measures direct correlation between an observation and a lagged version of itself, helping to identify AR terms.
# ACF and PACF plots
ggAcf(ts_diff) + ggtitle("ACF: Differenced Data")
ggPacf(ts_diff) + ggtitle("PACF: Differenced Data")

Additional Visualisation

Highlight significant lags in ACF/PACF plots.

ggAcf(ts_diff) + geom_hline(yintercept = c(-0.2, 0.2), linetype = "dashed", color = "red") + ggtitle("ACF with Significance Lines")
ggPacf(ts_diff) + geom_hline(yintercept = c(-0.2, 0.2), linetype = "dashed", color = "red") + ggtitle("PACF with Significance Lines")

Summary

  • ARIMA is a powerful tool for forecasting non-seasonal time series.
  • It consists of three components:
    • Autoregression (AR): Uses past values to predict future values.
    • Integration (I): Differencing is applied to make data stationary.
    • Moving Average (MA): Past errors are used to refine forecasts.

Summary

  • Model terms (\(p\), \(d\), \(q\)) are identified using ACF and PACF.
  • Residual analysis ensures the model is well-fitted.
  • ARIMA can be extended to SARIMA for seasonal data or ARIMAX for multivariate data.
  • Automated model selection using auto.arima() simplifies forecasting.

Extended Example


    Augmented Dickey-Fuller Test

data:  ts_data
Dickey-Fuller = -7.8448, Lag order = 4, p-value = 0.01
alternative hypothesis: stationary

Step 1: Visualising the Time Series

autoplot(ts_data) + ggtitle("Monthly Goals Scored Over Time")
  • Look for trends, seasonality, or patterns that indicate the need for differencing.
  • If the mean or variance changes over time, the series is not stationary.

Step 2: Checking for Stationarity (Dickey-Fuller Test)

adf.test(ts_data)

    Augmented Dickey-Fuller Test

data:  ts_data
Dickey-Fuller = -7.8448, Lag order = 4, p-value = 0.01
alternative hypothesis: stationary
  • If p-value > 0.05, the data is non-stationary, and differencing is required.
  • Apply first-order differencing.

Step 3: Applying Differencing (Integration Component)

ts_diff <- diff(ts_data)
autoplot(ts_diff) + ggtitle("First-Order Differenced Time Series")

Re-test stationarity after differencing:

adf.test(ts_diff)

    Augmented Dickey-Fuller Test

data:  ts_diff
Dickey-Fuller = -9.9208, Lag order = 4, p-value = 0.01
alternative hypothesis: stationary
  • If p-value < 0.05, the data is now stationary. Otherwise, try second-order differencing.
ts_diff2 <- diff(ts_diff)
autoplot(ts_diff2) + ggtitle("Second-Order Differenced Time Series")

Step 4: Identifying AR and MA Terms (ACF & PACF)

  • Autocorrelation Function (ACF) helps determine the number of MA (\(q\)) terms.
  • Partial Autocorrelation Function (PACF) helps determine the number of AR (\(p\)) terms.
ggAcf(ts_diff) + ggtitle("ACF: Differenced Data")
ggPacf(ts_diff) + ggtitle("PACF: Differenced Data")
  • If ACF cuts off after lag \(q\), use MA(\(q\)).

  • If PACF cuts off after lag \(p\), use AR(\(p\)).

  • If both decay slowly, more differencing may be needed.

Step 5: Selecting the Best ARIMA Model

  • Use auto.arima() to automatically select \(p\), \(d\), and \(q\) based on AIC/BIC.
Code
fit <- auto.arima(ts_data)
summary(fit)
Series: ts_data 
ARIMA(1,0,0)(2,1,0)[12] with drift 

Coefficients:
         ar1     sar1     sar2   drift
      0.7055  -0.7818  -0.3557  0.2954
s.e.  0.0685   0.0907   0.0969  0.0137

sigma^2 = 1.067:  log likelihood = -159.08
AIC=328.15   AICc=328.74   BIC=341.56

Training set error measures:
                      ME      RMSE       MAE        MPE     MAPE      MASE
Training set -0.01375009 0.9614931 0.7330779 -0.0949486 2.885972 0.2105412
                     ACF1
Training set -0.001185935

Using AIC and BIC

  • AIC (Akaike Information Criterion): Lower values indicate a better model.
  • BIC (Bayesian Information Criterion): Penalises complexity to avoid overfitting.
AIC(fit)
[1] 328.1503
BIC(fit)
[1] 341.561

Step 6: Checking Model Residuals

  • Residuals should resemble white noise (random, normally distributed errors).
res <- residuals(fit)
autoplot(res) + ggtitle("Residuals Over Time")
ggAcf(res) + ggtitle("ACF of Residuals")

Ljung-Box Test for autocorrelation:

Box.test(res, type = "Ljung-Box", lag = 10)

    Box-Ljung test

data:  res
X-squared = 7.3297, df = 10, p-value = 0.694

Note: If p-value > 0.05, residuals are random (good model fit).

Step 7: Forecasting Future Values

  • Generate a 12-month forecast:
future_forecast <- forecast(fit, h = 12)
autoplot(future_forecast) + ggtitle("Forecasted Monthly Goals")
  • Interpretation: Shaded region represents prediction intervals.

Step 8: Evaluating Model Performance

  • RMSE (Root Mean Squared Error): Measures forecast accuracy.
  • MAPE (Mean Absolute Percentage Error): Measures percentage error.
library(Metrics)
fitted_vals <- fitted(fit)
actual_vals <- as.numeric(ts_data)
rmse(actual_vals, fitted_vals)
[1] 0.9614931
mape(actual_vals, fitted_vals)
[1] 0.02885972
  • Lower values indicate better model performance.

Step 9: Alternative ARIMA Approaches

Seasonal ARIMA (SARIMA)

  • Handles seasonal patterns by adding \((P, D, Q, S)\) terms.
sarima_fit <- auto.arima(ts_data, seasonal = TRUE)
summary(sarima_fit)
Series: ts_data 
ARIMA(1,0,0)(2,1,0)[12] with drift 

Coefficients:
         ar1     sar1     sar2   drift
      0.7055  -0.7818  -0.3557  0.2954
s.e.  0.0685   0.0907   0.0969  0.0137

sigma^2 = 1.067:  log likelihood = -159.08
AIC=328.15   AICc=328.74   BIC=341.56

Training set error measures:
                      ME      RMSE       MAE        MPE     MAPE      MASE
Training set -0.01375009 0.9614931 0.7330779 -0.0949486 2.885972 0.2105412
                     ACF1
Training set -0.001185935
sarima_forecast <- forecast(sarima_fit, h = 12)
autoplot(sarima_forecast) + ggtitle("Seasonal ARIMA Forecast")

ARIMAX: Incorporating External Variables

economic_index <- rnorm(length(ts_data), mean = 50, sd = 10)
xreg_fit <- auto.arima(ts_data, xreg = economic_index)
summary(xreg_fit)
Series: ts_data 
Regression with ARIMA(1,0,0)(2,1,0)[12] errors 

Coefficients:
         ar1     sar1     sar2   drift    xreg
      0.7040  -0.8353  -0.3823  0.2958  0.0109
s.e.  0.0688   0.0947   0.0989  0.0130  0.0064

sigma^2 = 1.039:  log likelihood = -157.69
AIC=327.38   AICc=328.21   BIC=343.47

Training set error measures:
                      ME      RMSE       MAE       MPE     MAPE      MASE
Training set -0.01321115 0.9442364 0.7126028 -0.089611 2.820851 0.2046607
                     ACF1
Training set -0.007536587

Summary

  • Determine Stationarity: Use ADF test and differencing.
  • Identify AR and MA terms: Use ACF/PACF plots.
  • Select the Best Model: Use AIC/BIC.
  • Check Residuals: Ensure no autocorrelation.
  • Forecast Future Values: Use forecast().
  • Evaluate Performance: RMSE & MAPE.
  • Extend ARIMA: Use SARIMA for seasonality or ARIMAX for exogenous variables.

Other Uses for TSA

Anomaly Detection

  • Used in fraud detection and financial modeling.
  • Red dots highlight unusual events in the dataset.
  • Shows how ARIMA can be used beyond forecasting for detecting unexpected behavior.

ARIMA-Based Scenario Analysis

  • This compares baseline forecast (using the existing ARIMA model to predict future performance) with an Improved Performance Scenario.

  • Simulates what would happen if the team increased their goal-scoring rate by, say, 10%.

Dynamic Win Probability Forecasting

  • We’ll forecast the probability of a team winning based on past performance, treating goal differentials as a time series.

What is Goal Differential and why use it?

  • Goal differential = Goals Scored - Goals Conceded
  • This metric is a strong indicator of team strength over time.
  • Instead of just looking at goals scored, we model relative dominance in matches.

Applying ARIMA to forecast goal differentials

  • We assume goal differential follows a time-dependent structure, meaning past performances affect future outcomes.

  • ARIMA learns patterns in performance trends (e.g., a team improving or declining over time).

  • Forecasting goal differentials helps predict if a team is expected to maintain, improve, or decline.

Converting goal differential forecasts into win probabilities

  • We can use a normal distribution (pnorm) to estimate win probability.

  • If goal differential is positive, the team is expected to win.

  • The higher the forecasted goal differential, the greater the win probability.

Outcome

  • Transforms raw ARIMA predictions into “actionable insights”.

  • Instead of just forecasting a number, we translate it into real-world decision-making (e.g., “How likely is my team to win?”).

  • A dynamic probability curve shows expected chances of winning in future matches.

  • The blue line represents projected probabilities, while red points highlight key changes.

You could…

  • Add opponent strength: e.g., include other teams’ performance as external regressors (ARIMAX model).
  • Predict more than wins: extend to forecasting expected points per match or final league position.