Augmented Dickey-Fuller Test
data: ts_stationary
Dickey-Fuller = -4.713, Lag order = 4, p-value = 0.01
alternative hypothesis: stationary
In sports data, stationarity critical for analysing performance trends/predicting future outcomes/detecting anomalies.
A time series is stationary if its statistical properties (mean, variance, and autocorrelation) remain constant over time.
Stationarity is essential because many statistical techniques assume that the underlying data-generating process doesn’t change over time.
We identify various types of stationarity in time series data, including strict, trend, and difference stationarity.
the statistical properties of the process generating the time series data do not depend on time at all.
All moments of the series, such as mean, variance, autocorrelation, etc., are constant over time and don’t depend on the specific time at which the series is observed.
when a time series can have a deterministic trend (linear or otherwise), but the series becomes strictly stationary once this trend is removed.
fluctuations around the trend do not depend on time, and their statistical properties remain constant.
can be made stationary through detrending, i.e., by subtracting the estimated trend from the data.
more complex, and involves time series that become stationary once they are differenced a certain number of times.
common example is an integrated series of order one, \(I(1)\), which requires one differencing to achieve stationarity.
To determine stationarity, statistical tests like the Augmented Dickey-Fuller (ADF) test can be applied.
If the test fails to reject the null hypothesis, the series is likely non-stationary.
So p <0.05, series is stationary.
Augmented Dickey-Fuller Test
data: ts_stationary
Dickey-Fuller = -4.713, Lag order = 4, p-value = 0.01
alternative hypothesis: stationary
If a time series is non-stationary, various transformations can be applied:
- Log transformation: Stabilises variance.
- Differencing: Removes trends by subtracting consecutive observations.
These transformations are especially useful in sport where performance metrics may show increasing trends due to improved training methods over time.
Differencing is a key method for making a time series stationary.
First-order differencing removes linear trends.
Second-order differencing removes quadratic trends. This is particularly useful in sports analytics when dealing with seasonal variations in team or player performance.
Two or more non-stationary series can be combined in a way that results in a stationary series, indicating a long-term equilibrium relationship.
In sport, this can be useful in modeling competitive team rivalries or player comparisons.
Johansen’s test determines the number of cointegrating relationships in a multivariate system.
######################
# Johansen-Procedure #
######################
Test type: maximal eigenvalue statistic (lambda max) , with linear trend
Eigenvalues (lambda):
[1] 0.4043094 0.0155186
Values of teststatistic and critical values of test:
test 10pct 5pct 1pct
r <= 1 | 1.53 6.50 8.18 11.65
r = 0 | 50.77 12.91 14.90 19.19
Eigenvectors, normalised to first column:
(These are the cointegration relations)
ts_y1.l2 ts_y2.l2
ts_y1.l2 1.000000 1.00000000
ts_y2.l2 -1.036377 -0.06746412
Weights W:
(This is the loading matrix)
ts_y1.l2 ts_y2.l2
ts_y1.d -0.06177683 -0.05586172
ts_y2.d 1.13444082 -0.06325464
r = 0 (No cointegration): The test statistic (50.77) is greater than the 5% critical value (14.90), so we reject the null hypothesis. This suggests that at least one cointegration relationship exists.
r ≤ 1 (At most one cointegration relationship): The test statistic (1.53) is less than the 5% critical value (8.18), so we fail to reject the null hypothesis. This means exactly one cointegration relationship exists.
If no cointegration exists, a Vector Autoregression (VAR) model can be used to analyse interdependencies among multiple stationary time series.
VAR Estimation Results:
=========================
Endogenous variables: ts_y1, ts_y2
Deterministic variables: const
Sample size: 98
Log Likelihood: -331.281
Roots of the characteristic polynomial:
0.9462 0.3214 0.3214 0.05576
Call:
VAR(y = data_var, p = 2, type = "const")
Estimation results for equation ts_y1:
======================================
ts_y1 = ts_y1.l1 + ts_y2.l1 + ts_y1.l2 + ts_y2.l2 + const
Estimate Std. Error t value Pr(>|t|)
ts_y1.l1 0.94911 0.11146 8.516 2.81e-13 ***
ts_y2.l1 0.01319 0.04908 0.269 0.7887
ts_y1.l2 -0.06675 0.11925 -0.560 0.5770
ts_y2.l2 0.05460 0.04893 1.116 0.2673
const 0.23285 0.13914 1.673 0.0976 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9254 on 93 degrees of freedom
Multiple R-Squared: 0.8528, Adjusted R-squared: 0.8465
F-statistic: 134.7 on 4 and 93 DF, p-value: < 2.2e-16
Estimation results for equation ts_y2:
======================================
ts_y2 = ts_y1.l1 + ts_y2.l1 + ts_y1.l2 + ts_y2.l2 + const
Estimate Std. Error t value Pr(>|t|)
ts_y1.l1 1.11676 0.25635 4.356 3.4e-05 ***
ts_y2.l1 -0.12708 0.11288 -1.126 0.263
ts_y1.l2 -0.04558 0.27429 -0.166 0.868
ts_y2.l2 -0.04436 0.11253 -0.394 0.694
const 0.09358 0.32004 0.292 0.771
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.129 on 93 degrees of freedom
Multiple R-Squared: 0.5069, Adjusted R-squared: 0.4857
F-statistic: 23.91 on 4 and 93 DF, p-value: 1.288e-13
Covariance matrix of residuals:
ts_y1 ts_y2
ts_y1 0.8564 0.7706
ts_y2 0.7706 4.5307
Correlation matrix of residuals:
ts_y1 ts_y2
ts_y1 1.0000 0.3912
ts_y2 0.3912 1.0000
The plot shows a non-stationary time series with an upward trend.
The mean and variance are not constant over time.
If p-value < 0.05, the series is stationary.
If p-value ≥ 0.05, the series is non-stationary, requiring transformations.
ts_log <- log(ts_data - min(ts_data) + 1)
Log transformation stabilises variance.
The new series should have reduced heteroscedasticity.
library(urca)
ts_y1 <- cumsum(rnorm(100)) # Simulated series 1
ts_y2 <- cumsum(rnorm(100)) # Simulated series 2
coint_test <- ca.jo(cbind(ts_y1, ts_y2), type="trace", ecdet="none", K=2)
summary(coint_test)
######################
# Johansen-Procedure #
######################
Test type: trace statistic , with linear trend
Eigenvalues (lambda):
[1] 0.0798509295 0.0002690991
Values of teststatistic and critical values of test:
test 10pct 5pct 1pct
r <= 1 | 0.03 6.50 8.18 11.65
r = 0 | 8.18 15.66 17.95 23.52
Eigenvectors, normalised to first column:
(These are the cointegration relations)
ts_y1.l2 ts_y2.l2
ts_y1.l2 1.00000000 1.000000
ts_y2.l2 -0.08558116 -4.289539
Weights W:
(This is the loading matrix)
ts_y1.l2 ts_y2.l2
ts_y1.d -0.09647098 0.0004160002
ts_y2.d -0.03281248 -0.0012974178
######################
# Johansen-Procedure #
######################
Test type: maximal eigenvalue statistic (lambda max) , with linear trend
Eigenvalues (lambda):
[1] 0.0798509295 0.0002690991
Values of teststatistic and critical values of test:
test 10pct 5pct 1pct
r <= 1 | 0.03 6.50 8.18 11.65
r = 0 | 8.16 12.91 14.90 19.19
Eigenvectors, normalised to first column:
(These are the cointegration relations)
ts_y1.l2 ts_y2.l2
ts_y1.l2 1.00000000 1.000000
ts_y2.l2 -0.08558116 -4.289539
Weights W:
(This is the loading matrix)
ts_y1.l2 ts_y2.l2
ts_y1.d -0.09647098 0.0004160002
ts_y2.d -0.03281248 -0.0012974178
Look at the maximal eigenvalue statistic.
If the test statistic is greater than the critical value, we reject the null hypothesis and conclude at least one cointegration relationship exists.
library(vars)
data_var <- cbind(ts_y1, ts_y2)
var_model <- VAR(data_var, p=2, type="const")
summary(var_model)
VAR Estimation Results:
=========================
Endogenous variables: ts_y1, ts_y2
Deterministic variables: const
Sample size: 98
Log Likelihood: -260.941
Roots of the characteristic polynomial:
1.006 0.9141 0.1742 0.07368
Call:
VAR(y = data_var, p = 2, type = "const")
Estimation results for equation ts_y1:
======================================
ts_y1 = ts_y1.l1 + ts_y2.l1 + ts_y1.l2 + ts_y2.l2 + const
Estimate Std. Error t value Pr(>|t|)
ts_y1.l1 0.79251 0.10343 7.663 1.71e-11 ***
ts_y2.l1 0.08099 0.10382 0.780 0.4373
ts_y1.l2 0.11143 0.09977 1.117 0.2669
ts_y2.l2 -0.07452 0.10934 -0.681 0.4973
const -0.92240 0.35747 -2.580 0.0114 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9468 on 93 degrees of freedom
Multiple R-Squared: 0.8797, Adjusted R-squared: 0.8745
F-statistic: 170 on 4 and 93 DF, p-value: < 2.2e-16
Estimation results for equation ts_y2:
======================================
ts_y2 = ts_y1.l1 + ts_y2.l1 + ts_y1.l2 + ts_y2.l2 + const
Estimate Std. Error t value Pr(>|t|)
ts_y1.l1 0.09588 0.10208 0.939 0.350
ts_y2.l1 1.02739 0.10247 10.026 <2e-16 ***
ts_y1.l2 -0.12999 0.09847 -1.320 0.190
ts_y2.l2 -0.01901 0.10792 -0.176 0.861
const -0.20407 0.35283 -0.578 0.564
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9345 on 93 degrees of freedom
Multiple R-Squared: 0.904, Adjusted R-squared: 0.8998
F-statistic: 218.8 on 4 and 93 DF, p-value: < 2.2e-16
Covariance matrix of residuals:
ts_y1 ts_y2
ts_y1 0.89648 0.02794
ts_y2 0.02794 0.87335
Correlation matrix of residuals:
ts_y1 ts_y2
ts_y1 1.00000 0.03158
ts_y2 0.03158 1.00000
The coefficients explain how each variable depends on its past values and other variables in the system.
Model selection (choosing p) can be done using criteria like AIC or BIC.