The following objects are masked from 'package:base':
as.Date, as.Date.numeric
library(nlme) # For GLS models
2 Generate a Dataset
We’ll start by creating a synthetic dataset to simulate a real-world sports scenario.
The dataset includes predictors (experience, training hours, team support, injuries, and motivation) and we want to use these predictors to estimate player performance.
set.seed(123)n <-100# N observations to createplayer_experience <-runif(n, 1, 15) # Years of experience, ranging from 1-15training_hours <-rnorm(n, mean=10, sd=3) # Weekly training hoursteam_support <-sample(1:5, n, replace=TRUE) # Categorical variable: Team support rating (1-5)injuries <-rpois(n, lambda=2) # Count of injuriesmotivation <-rnorm(n, mean=7, sd=1.5) # A motivation scale (1-10)# Generate performance scores based on predictors with noiseperformance <-50+3* player_experience +2.5* training_hours +5* team_support -4* injuries +2* motivation +rnorm(n, mean=0, sd=5)# Create data framesports_data <-data.frame(player_experience, training_hours, team_support, injuries, motivation, performance)
As always, we would begin by examining the relationships between different predictors (e.g., experience) and our outcome variable (performance). For example:
# Visualise correlation between experience and performanceggplot(sports_data, aes(x=player_experience, y=performance)) +geom_point() +geom_smooth(method="lm", se=FALSE, color="blue") +labs(title="Player Experience vs Performance", x="Experience (years)", y="Performance")
`geom_smooth()` using formula = 'y ~ x'
3 Fit a Multiple Regression Model
However, the strength of multiple regression is that it allows us to model the relationship between a dependent variable (performance) and multiple independent variables.
3.1 Model code
Here’s the basic model we can use.
It shows that we are regressing performance (dependent variable) on five independent variables: player_experience, training_hours, team_support, injuries, and motivation.
model <-lm(performance ~ player_experience + training_hours + team_support + injuries + motivation, data=sports_data) # create the regression modelsummary(model) # print a summary of the model
At the top, we can see information about the residuals. Recall that residuals represent the difference between actual and predicted values.
The key percentiles (Min, 1Q, Median, 3Q, Max) give an idea of how the residuals are distributed.
If the median is close to zero, the residuals are symmetrically distributed.
Large minimum or maximum values indicate potential outliers.
3.2.2 Coefficients
Next, we see the model coefficients.
The ‘Estimate’ is the estimated impact of each predictor on performance.
For example, if player_experience = 2.92, for every additional year of experience, performance increases by 2.92 points, holding all else constant.
The statistical significance (p-values, shown as Pr(>|t|) indicates whether each predictor significantly affects performance (*** means p-value < 0.001, which is highly significant).
If a variable has a p-value > 0.05, it may not be a meaningful predictor.
In this case all predictors are statistically significant (p-values < 0.001), meaning they have a real impact on performance.
3.2.3 Residual standard error (RSE)
What is RSE?
Residual Standard Error (RSE) measures the average deviation of the observed values from the fitted regression line. It quantifies the typical size of the residuals.
In this case, an RSE of 5.24 means that, on average, the predicted performance values differ from the actual values by about 5.24 points.
While R² tells us how well the model explains variation, RSE gives an absolutemeasure of model fit in the original units of performance.
Guidelines for Interpretation
Lower RSE is better, indicating that the model’s predictions are closer to actual values. Compare RSE with the range of response variable (performance):
If performance ranges from 40 to 100, an RSE of 5.24 is quite reasonable (~5-10% of the range).
If RSE is too high, the model may not be explaining enough variance, or there might be missing predictors.
How to Improve RSE?
Include additional relevant predictors.
Check and remove influential outliers.
Consider non-linear transformations if relationships are non-linear.
3.3 R-Squared and Adjusted R-Squared
These numbers tell us some important information about how effective our model is in predicting the outcome (in this case, performance).
R-squared (0.905) means that the model explains 90% of the variance in performance.
Thed adjusted R-squared (0.899) adjusts for the number of predictors and is slightly lower but still high, indicating a strong model fit.
3.3.1 Interpretation
Higher values (close to 1) indicate that the model explains most of the variation in performance.
If R-squared were much lower (e.g., 0.3), it would indicate a weak model with limited explanatory power.
3.4 F-statistic
The F-statistic tests whether at least one predictor significantly improves the model.
A p-value < 2.2e-16 indicates the model is statistically significant overall.
3.5 Summary of model interpretation
All our predictors significantlyimpact performance.
Player experience, training hours, and team support have the strongest positive effects.
Injuries negatively affect performance.
Motivation contributes positively but less strongly than other factors.
The model explains 90% of the variance in performance, suggesting a good fit.
The overall model is statistically significant (p-value < 0.001).
4 Check Assumptions of Regression
Before interpreting the model, we should check key assumptions to ensure the validity of our findings.
4.1 Linearity (Residuals vs Fitted Plot)
Assumption: The relationship between independent variables and the dependent variable is linear.
Interpretation: The residuals vs fitted plot should show no clear pattern; a random scatter suggests linearity holds.
plot(model, which=1)
4.2 Multicollinearity (Variance Inflation Factor)
In multiple regression, we assume that our independent variables are not too highly correlated with each other (multicollinearity).
A Variance Inflation Factor (VIF) greater than 10 indicates severemulticollinearity, which can distort coefficient estimates.
5 Sidenote: Interaction Effects (Experience * Training Hours)
5.1 What are interaction effects?
Interaction effects occur when the effect of one predictor variable on the dependent variable depends on the value of another predictor. In other words, the relationship between player_experience and performance might be different depending on training_hours.
For example:
If training_hours is high, experience might have a stronger positive impact on performance.
If training_hours is low, experience might not contribute as much to performance.
If an interaction exists, it means that the combined effect of experience and training is not simply additive; one influences how the other works.
5.2 How do we test for interaction?
We can include an interactionterm in our regression model using the * operator.
player_experience * training_hours expands to:
player_experience
training_hours
player_experience:training_hours (the interaction term)
The coefficient for player_experience (2.56) tells us the effect of experience when training_hours is zero.
The coefficient for training_hours (2.24) tells us the effect of training when experience is zero.
The interaction term (player_experience:training_hours= 0.04) is NOT statistically significant (p > 0.05), indicating that experience and training together do NOT impact performance in a non-additive way.
A positive interaction coefficient (0.04) means that higher experience amplifies the effect of training. However, this interaction effect is not significant.
5.4 Summary of interaction effects
If interaction effects exist, failing to account for them can mislead our interpretation. The effect of one predictor can change based on another predictor, leading to incorrect conclusions if ignored.
If the interaction term is significant, we must include it in the model. If it’s not significant, the simpler model without interaction should be preferred (as is the case here).
6 Model Estimation Techniques
Different estimation techniques are used to handle different statistical issues:
Ordinary Least Squares (OLS): The default method used in lm().
Generalised Least Squares (GLS): Used when errors are correlated.
Weighted Least Squares (WLS): Accounts for heteroscedasticity.
6.1 Ordinary Least Squares (OLS)
OLS is the default estimation technique within the ‘lm()’ function and should be used unless you suspect there are issues within your data.
6.2 Generalised Least Squares (GLS)
GLS handles situations where we have correlated residuals:
In our example here, using different estimation techniques doesn’t change the message regarding the significance of our predictor variables, but it does change the R2 etc.
7 Model Diagnostics
Model diagnostics help identify issues that could affect model validity.
For example:
Residual Analysis examines normality and homoscedasticity.
Influence Measures detect outliers that may impact results.
Specification Error Tests check if important variables are missing.
7.1 Residual Analysis (Normality and Homoscedasticity)
The Q-Q plot checks if residuals follow a normal distribution. If points fall roughly along a straight line, normality holds.
par(mfrow=c(1,2))plot(model, which=2) # Q-Q Plot
The Scale-Location plot checks for homoscedasticity (constant variance of residuals). If the red trend line is relatively flat, homoscedasticity holds.
plot(model, which=3) # Scale-Location Plot
7.2 Influence Measures (Detecting Outliers)
Influence plots help identify observations that have a significant effect on the model.
If certain points are far from others, they may be outliers influencing regression results.