Stationarity and Differencing

B1705, Week Seven

Overview

Introduction

Time series analysis can be really useful.
Many statistical models assume the data is stationary.
A stationary time series has constant statistical properties over time.

Introduction

In sports data, stationarity critical for analysing performance trends/predicting future outcomes/detecting anomalies.
- For example, when analysing an athlete’s seasonal performance, ensuring stationarity can help make meaningful comparisons over different periods.

What is stationarity?

A time series is stationary if its statistical properties (mean, variance, and autocorrelation) remain constant over time.
Stationarity is essential because many statistical techniques assume that the underlying data-generating process doesn’t change over time.

Visual Representation

A non-stationary series often exhibits trends or changing variance
A stationary series fluctuates around a constant mean.

Key characteristics of stationary series

Constant Mean: The average value remains stable.
Constant Variance: The dispersion around the mean does not change.
Autocorrelation Structure Does Not Change: The relationship between observations remains stable over time.

Why is stationarity important in sport?

Performance Metrics: Stationarity helps compare player performances across different seasons without biases due to trends or seasonality.
Betting Markets: Betting models assume stationarity to make reliable probability estimates.
Injury Recovery Analysis: Monitoring stationary performance metrics helps assess how an athlete is recovering from an injury over time.

Types of Stationarity

We identify various types of stationarity in time series data, including strict, trend, and difference stationarity.

Strict stationarity

the statistical properties of the process generating the time series data do not depend on time at all.
All moments of the series, such as mean, variance, autocorrelation, etc., are constant over time and don’t depend on the specific time at which the series is observed.

Trend stationarity

when a time series can have a deterministic trend (linear or otherwise), but the series becomes strictly stationary once this trend is removed.
fluctuations around the trend do not depend on time, and their statistical properties remain constant.
can be made stationary through detrending, i.e., by subtracting the estimated trend from the data.

Difference stationarity

more complex, and involves time series that become stationary once they are differenced a certain number of times.
common example is an integrated series of order one, \(I(1)\), which requires one differencing to achieve stationarity.

Tests for Stationarity

To determine stationarity, statistical tests like the Augmented Dickey-Fuller (ADF) test can be applied.
If the test fails to reject the null hypothesis, the series is likely non-stationary.
So p <0.05, series is stationary.


    Augmented Dickey-Fuller Test

data:  ts_stationary
Dickey-Fuller = -4.713, Lag order = 4, p-value = 0.01
alternative hypothesis: stationary

Transformations for Achieving Stationarity

If a time series is non-stationary, various transformations can be applied:

- Log transformation: Stabilises variance.

- Differencing: Removes trends by subtracting consecutive observations.

These transformations are especially useful in sport where performance metrics may show increasing trends due to improved training methods over time.

The role of differencing

Differencing is a key method for making a time series stationary.

First-order differencing removes linear trends.
Second-order differencing removes quadratic trends. This is particularly useful in sports analytics when dealing with seasonal variations in team or player performance.

Stationarity in Multivariate Time Series Data

Introduction

When analysing multiple time series together, such as the relationship between two teams’ performance over time, stationarity is crucial for meaningful comparisons and modeling.

Cointegration

Two or more non-stationary series can be combined in a way that results in a stationary series, indicating a long-term equilibrium relationship.
In sport, this can be useful in modeling competitive team rivalries or player comparisons.

Johansen’s Test

Johansen’s test determines the number of cointegrating relationships in a multivariate system.


###################### 
# Johansen-Procedure # 
###################### 

Test type: maximal eigenvalue statistic (lambda max) , with linear trend 

Eigenvalues (lambda):
[1] 0.4043094 0.0155186

Values of teststatistic and critical values of test:

          test 10pct  5pct  1pct
r <= 1 |  1.53  6.50  8.18 11.65
r = 0  | 50.77 12.91 14.90 19.19

Eigenvectors, normalised to first column:
(These are the cointegration relations)

          ts_y1.l2    ts_y2.l2
ts_y1.l2  1.000000  1.00000000
ts_y2.l2 -1.036377 -0.06746412

Weights W:
(This is the loading matrix)

           ts_y1.l2    ts_y2.l2
ts_y1.d -0.06177683 -0.05586172
ts_y2.d  1.13444082 -0.06325464

Interpretation

r = 0 (No cointegration): The test statistic (50.77) is greater than the 5% critical value (14.90), so we reject the null hypothesis. This suggests that at least one cointegration relationship exists.
r ≤ 1 (At most one cointegration relationship): The test statistic (1.53) is less than the 5% critical value (8.18), so we fail to reject the null hypothesis. This means exactly one cointegration relationship exists.

Vector Autoregression (VAR)

If no cointegration exists, a Vector Autoregression (VAR) model can be used to analyse interdependencies among multiple stationary time series.


VAR Estimation Results:
========================= 
Endogenous variables: ts_y1, ts_y2 
Deterministic variables: const 
Sample size: 98 
Log Likelihood: -331.281 
Roots of the characteristic polynomial:
0.9462 0.3214 0.3214 0.05576
Call:
VAR(y = data_var, p = 2, type = "const")


Estimation results for equation ts_y1: 
====================================== 
ts_y1 = ts_y1.l1 + ts_y2.l1 + ts_y1.l2 + ts_y2.l2 + const 

         Estimate Std. Error t value Pr(>|t|)    
ts_y1.l1  0.94911    0.11146   8.516 2.81e-13 ***
ts_y2.l1  0.01319    0.04908   0.269   0.7887    
ts_y1.l2 -0.06675    0.11925  -0.560   0.5770    
ts_y2.l2  0.05460    0.04893   1.116   0.2673    
const     0.23285    0.13914   1.673   0.0976 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Residual standard error: 0.9254 on 93 degrees of freedom
Multiple R-Squared: 0.8528, Adjusted R-squared: 0.8465 
F-statistic: 134.7 on 4 and 93 DF,  p-value: < 2.2e-16 


Estimation results for equation ts_y2: 
====================================== 
ts_y2 = ts_y1.l1 + ts_y2.l1 + ts_y1.l2 + ts_y2.l2 + const 

         Estimate Std. Error t value Pr(>|t|)    
ts_y1.l1  1.11676    0.25635   4.356  3.4e-05 ***
ts_y2.l1 -0.12708    0.11288  -1.126    0.263    
ts_y1.l2 -0.04558    0.27429  -0.166    0.868    
ts_y2.l2 -0.04436    0.11253  -0.394    0.694    
const     0.09358    0.32004   0.292    0.771    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Residual standard error: 2.129 on 93 degrees of freedom
Multiple R-Squared: 0.5069, Adjusted R-squared: 0.4857 
F-statistic: 23.91 on 4 and 93 DF,  p-value: 1.288e-13 



Covariance matrix of residuals:
       ts_y1  ts_y2
ts_y1 0.8564 0.7706
ts_y2 0.7706 4.5307

Correlation matrix of residuals:
       ts_y1  ts_y2
ts_y1 1.0000 0.3912
ts_y2 0.3912 1.0000

Conclusion

Stationarity is crucial for time series modeling.
Differencing and transformations help achieve stationarity.
Cointegration tests are necessary for multivariate analysis.
VAR models are used when series are not cointegrated.

Implementation in R

Stationarity?

Plotting a Time Series

library(tseries)
library(ggplot2)
library(forecast)

set.seed(123)
ts_data <- cumsum(rnorm(100))  # Simulated random walk

autoplot(ts(ts_data), main="Non-Stationary Time Series")

The plot shows a non-stationary time series with an upward trend.
The mean and variance are not constant over time.

Tests for Stationarity

Augmented Dickey-Fuller (ADF) Test

adf_test <- adf.test(ts_data)
print(adf_test)


    Augmented Dickey-Fuller Test

data:  ts_data
Dickey-Fuller = -1.8871, Lag order = 4, p-value = 0.6234
alternative hypothesis: stationary

If p-value < 0.05, the series is stationary.
If p-value ≥ 0.05, the series is non-stationary, requiring transformations.

Transformations for Achieving Stationarity

Log Transformation

ts_log <- log(ts_data - min(ts_data) + 1)

Code

library(ggplot2)
library(forecast)
library(gridExtra)

set.seed(123)
ts_data <- cumsum(rnorm(100)) 
ts_log <- log(ts_data - min(ts_data) + 1)  # Log transformation

p1 <- autoplot(ts(ts_data)) + ggtitle("Original Time Series")
p2 <- autoplot(ts(ts_log)) + ggtitle("Log-Transformed Series")

grid.arrange(p1, p2, ncol = 2)

Log transformation stabilises variance.
The new series should have reduced heteroscedasticity.

The Role of Differencing

ts_diff <- diff(ts_data)
autoplot(ts(ts_diff), main="Differenced Time Series")

adf_test_diff <- adf.test(ts_diff)
print(adf_test_diff)


    Augmented Dickey-Fuller Test

data:  ts_diff
Dickey-Fuller = -4.5735, Lag order = 4, p-value = 0.01
alternative hypothesis: stationary

If p-value < 0.05, the differenced series is now stationary.
If still non-stationary, consider second-order differencing.

Stationarity in Multivariate Time Series Data

Cointegration

library(urca)
ts_y1 <- cumsum(rnorm(100))  # Simulated series 1
ts_y2 <- cumsum(rnorm(100))  # Simulated series 2
coint_test <- ca.jo(cbind(ts_y1, ts_y2), type="trace", ecdet="none", K=2)
summary(coint_test)


###################### 
# Johansen-Procedure # 
###################### 

Test type: trace statistic , with linear trend 

Eigenvalues (lambda):
[1] 0.0798509295 0.0002690991

Values of teststatistic and critical values of test:

         test 10pct  5pct  1pct
r <= 1 | 0.03  6.50  8.18 11.65
r = 0  | 8.18 15.66 17.95 23.52

Eigenvectors, normalised to first column:
(These are the cointegration relations)

            ts_y1.l2  ts_y2.l2
ts_y1.l2  1.00000000  1.000000
ts_y2.l2 -0.08558116 -4.289539

Weights W:
(This is the loading matrix)

           ts_y1.l2      ts_y2.l2
ts_y1.d -0.09647098  0.0004160002
ts_y2.d -0.03281248 -0.0012974178

If the trace statistic exceeds the critical value, we reject the null hypothesis and conclude that a cointegration relationship exists.

Johansen’s Test

johansen_test <- ca.jo(cbind(ts_y1, ts_y2), type="eigen", ecdet="none", K=2)
summary(johansen_test)


###################### 
# Johansen-Procedure # 
###################### 

Test type: maximal eigenvalue statistic (lambda max) , with linear trend 

Eigenvalues (lambda):
[1] 0.0798509295 0.0002690991

Values of teststatistic and critical values of test:

         test 10pct  5pct  1pct
r <= 1 | 0.03  6.50  8.18 11.65
r = 0  | 8.16 12.91 14.90 19.19

Eigenvectors, normalised to first column:
(These are the cointegration relations)

            ts_y1.l2  ts_y2.l2
ts_y1.l2  1.00000000  1.000000
ts_y2.l2 -0.08558116 -4.289539

Weights W:
(This is the loading matrix)

           ts_y1.l2      ts_y2.l2
ts_y1.d -0.09647098  0.0004160002
ts_y2.d -0.03281248 -0.0012974178

Look at the maximal eigenvalue statistic.
If the test statistic is greater than the critical value, we reject the null hypothesis and conclude at least one cointegration relationship exists.

Vector Autoregression (VAR)

library(vars)
data_var <- cbind(ts_y1, ts_y2)
var_model <- VAR(data_var, p=2, type="const")
summary(var_model)


VAR Estimation Results:
========================= 
Endogenous variables: ts_y1, ts_y2 
Deterministic variables: const 
Sample size: 98 
Log Likelihood: -260.941 
Roots of the characteristic polynomial:
1.006 0.9141 0.1742 0.07368
Call:
VAR(y = data_var, p = 2, type = "const")


Estimation results for equation ts_y1: 
====================================== 
ts_y1 = ts_y1.l1 + ts_y2.l1 + ts_y1.l2 + ts_y2.l2 + const 

         Estimate Std. Error t value Pr(>|t|)    
ts_y1.l1  0.79251    0.10343   7.663 1.71e-11 ***
ts_y2.l1  0.08099    0.10382   0.780   0.4373    
ts_y1.l2  0.11143    0.09977   1.117   0.2669    
ts_y2.l2 -0.07452    0.10934  -0.681   0.4973    
const    -0.92240    0.35747  -2.580   0.0114 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Residual standard error: 0.9468 on 93 degrees of freedom
Multiple R-Squared: 0.8797, Adjusted R-squared: 0.8745 
F-statistic:   170 on 4 and 93 DF,  p-value: < 2.2e-16 


Estimation results for equation ts_y2: 
====================================== 
ts_y2 = ts_y1.l1 + ts_y2.l1 + ts_y1.l2 + ts_y2.l2 + const 

         Estimate Std. Error t value Pr(>|t|)    
ts_y1.l1  0.09588    0.10208   0.939    0.350    
ts_y2.l1  1.02739    0.10247  10.026   <2e-16 ***
ts_y1.l2 -0.12999    0.09847  -1.320    0.190    
ts_y2.l2 -0.01901    0.10792  -0.176    0.861    
const    -0.20407    0.35283  -0.578    0.564    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Residual standard error: 0.9345 on 93 degrees of freedom
Multiple R-Squared: 0.904,  Adjusted R-squared: 0.8998 
F-statistic: 218.8 on 4 and 93 DF,  p-value: < 2.2e-16 



Covariance matrix of residuals:
        ts_y1   ts_y2
ts_y1 0.89648 0.02794
ts_y2 0.02794 0.87335

Correlation matrix of residuals:
        ts_y1   ts_y2
ts_y1 1.00000 0.03158
ts_y2 0.03158 1.00000

The coefficients explain how each variable depends on its past values and other variables in the system.
Model selection (choosing p) can be done using criteria like AIC or BIC.