Multiple Regression - Demonstration

In this demonstration, I’ll walk-through a multiple regression analysis using appropriate R code.

1 Load Relevant Libraries

# Load required libraries
library(ggplot2)
library(car)

Loading required package: carData

library(MASS)
library(lmtest)

Loading required package: zoo


Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

library(nlme)  # For GLS models

2 Generate a Dataset

We’ll start by creating a synthetic dataset to simulate a real-world sports scenario.

The dataset includes predictors (experience, training hours, team support, injuries, and motivation) and we want to use these predictors to estimate player performance.

set.seed(123)
n <- 100  # N observations to create
player_experience <- runif(n, 1, 15)  # Years of experience, ranging from 1-15
training_hours <- rnorm(n, mean=10, sd=3)  # Weekly training hours
team_support <- sample(1:5, n, replace=TRUE)  # Categorical variable: Team support rating (1-5)
injuries <- rpois(n, lambda=2)  # Count of injuries
motivation <- rnorm(n, mean=7, sd=1.5)  # A motivation scale (1-10)

# Generate performance scores based on predictors with noise
performance <- 50 + 3 * player_experience + 2.5 * training_hours + 
               5 * team_support - 4 * injuries + 2 * motivation + rnorm(n, mean=0, sd=5)

# Create data frame
sports_data <- data.frame(player_experience, training_hours, team_support, injuries, motivation, performance)

As always, we would begin by examining the relationships between different predictors (e.g., experience) and our outcome variable (performance). For example:

# Visualise correlation between experience and performance
ggplot(sports_data, aes(x=player_experience, y=performance)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE, color="blue") +
  labs(title="Player Experience vs Performance", x="Experience (years)", y="Performance")

`geom_smooth()` using formula = 'y ~ x'

3 Fit a Multiple Regression Model

However, the strength of multiple regression is that it allows us to model the relationship between a dependent variable (performance) and multiple independent variables.

3.1 Model code

Here’s the basic model we can use.

It shows that we are regressing performance (dependent variable) on five independent variables: player_experience, training_hours, team_support, injuries, and motivation.

model <- lm(performance ~ player_experience + training_hours + team_support + injuries + motivation, data=sports_data) # create the regression model
summary(model) # print a summary of the model


Call:
lm(formula = performance ~ player_experience + training_hours + 
    team_support + injuries + motivation, data = sports_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.4340  -3.2130   0.5119   3.3799  12.2918 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        53.6517     3.9952  13.429  < 2e-16 ***
player_experience   2.9232     0.1332  21.947  < 2e-16 ***
training_hours      2.5315     0.1868  13.551  < 2e-16 ***
team_support        4.8071     0.3446  13.948  < 2e-16 ***
injuries           -3.8626     0.3801 -10.162  < 2e-16 ***
motivation          1.5688     0.3588   4.372 3.18e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.239 on 94 degrees of freedom
Multiple R-squared:  0.9045,    Adjusted R-squared:  0.8994 
F-statistic: 178.1 on 5 and 94 DF,  p-value: < 2.2e-16

3.2 Model interpretation

How do we interpret this output?

3.2.1 Residuals

At the top, we can see information about the residuals. Recall that residuals represent the difference between actual and predicted values.

The key percentiles (Min, 1Q, Median, 3Q, Max) give an idea of how the residuals are distributed.
If the median is close to zero, the residuals are symmetrically distributed.
Large minimum or maximum values indicate potential outliers.

3.2.2 Coefficients

Next, we see the model coefficients.

The ‘Estimate’ is the estimated impact of each predictor on performance.

For example, if player_experience = 2.92, for every additional year of experience, performance increases by 2.92 points, holding all else constant.

The statistical significance (p-values, shown as Pr(>|t|) indicates whether each predictor significantly affects performance (*** means p-value < 0.001, which is highly significant).

If a variable has a p-value > 0.05, it may not be a meaningful predictor.

In this case all predictors are statistically significant (p-values < 0.001), meaning they have a real impact on performance.

3.2.3 Residual standard error (RSE)

What is RSE?

Residual Standard Error (RSE) measures the average deviation of the observed values from the fitted regression line. It quantifies the typical size of the residuals.

In this case, an RSE of 5.24 means that, on average, the predicted performance values differ from the actual values by about 5.24 points.

While R² tells us how well the model explains variation, RSE gives an absolute measure of model fit in the original units of performance.

Guidelines for Interpretation

Lower RSE is better, indicating that the model’s predictions are closer to actual values. Compare RSE with the range of response variable (performance):

If performance ranges from 40 to 100, an RSE of 5.24 is quite reasonable (~5-10% of the range).
If RSE is too high, the model may not be explaining enough variance, or there might be missing predictors.

How to Improve RSE?

Include additional relevant predictors.
Check and remove influential outliers.
Consider non-linear transformations if relationships are non-linear.

3.3 R-Squared and Adjusted R-Squared

These numbers tell us some important information about how effective our model is in predicting the outcome (in this case, performance).

R-squared (0.905) means that the model explains 90% of the variance in performance.

Thed adjusted R-squared (0.899) adjusts for the number of predictors and is slightly lower but still high, indicating a strong model fit.

3.3.1 Interpretation

Higher values (close to 1) indicate that the model explains most of the variation in performance.
If R-squared were much lower (e.g., 0.3), it would indicate a weak model with limited explanatory power.

3.4 F-statistic

The F-statistic tests whether at least one predictor significantly improves the model.

A p-value < 2.2e-16 indicates the model is statistically significant overall.

3.5 Summary of model interpretation

All our predictors significantly impact performance.
Player experience, training hours, and team support have the strongest positive effects.
Injuries negatively affect performance.
Motivation contributes positively but less strongly than other factors.
The model explains 90% of the variance in performance, suggesting a good fit.
The overall model is statistically significant (p-value < 0.001).

4 Check Assumptions of Regression

Before interpreting the model, we should check key assumptions to ensure the validity of our findings.

4.1 Linearity (Residuals vs Fitted Plot)

Assumption: The relationship between independent variables and the dependent variable is linear.

Interpretation: The residuals vs fitted plot should show no clear pattern; a random scatter suggests linearity holds.

plot(model, which=1)

4.2 Multicollinearity (Variance Inflation Factor)

In multiple regression, we assume that our independent variables are not too highly correlated with each other (multicollinearity).

A Variance Inflation Factor (VIF) greater than 10 indicates severe multicollinearity, which can distort coefficient estimates.

vif(model)

player_experience    training_hours      team_support          injuries 
         1.018647          1.054350          1.007815          1.013168 
       motivation 
         1.063537

5 Sidenote: Interaction Effects (Experience * Training Hours)

5.1 What are interaction effects?

Interaction effects occur when the effect of one predictor variable on the dependent variable depends on the value of another predictor. In other words, the relationship between player_experience and performance might be different depending on training_hours.

For example:

If training_hours is high, experience might have a stronger positive impact on performance.
If training_hours is low, experience might not contribute as much to performance.
If an interaction exists, it means that the combined effect of experience and training is not simply additive; one influences how the other works.

5.2 How do we test for interaction?

We can include an interaction term in our regression model using the * operator.

player_experience * training_hours expands to:

player_experience
training_hours
player_experience:training_hours (the interaction term)

model_interaction <- lm(performance ~ player_experience * training_hours + team_support + injuries + motivation, data=sports_data)
summary(model_interaction)


Call:
lm(formula = performance ~ player_experience * training_hours + 
    team_support + injuries + motivation, data = sports_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-13.389  -2.991   0.545   3.298  12.329 

Coefficients:
                                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      56.04021    5.11760  10.950  < 2e-16 ***
player_experience                 2.56224    0.49978   5.127 1.60e-06 ***
training_hours                    2.24372    0.42717   5.253 9.47e-07 ***
team_support                      4.80929    0.34546  13.921  < 2e-16 ***
injuries                         -3.83504    0.38274 -10.020  < 2e-16 ***
motivation                        1.62818    0.36828   4.421 2.66e-05 ***
player_experience:training_hours  0.03630    0.04843   0.750    0.455    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.252 on 93 degrees of freedom
Multiple R-squared:  0.9051,    Adjusted R-squared:  0.899 
F-statistic: 147.8 on 6 and 93 DF,  p-value: < 2.2e-16

5.3 Interpreting the output

The coefficient for player_experience (2.56) tells us the effect of experience when training_hours is zero.
The coefficient for training_hours (2.24) tells us the effect of training when experience is zero.
The interaction term (player_experience:training_hours= 0.04) is NOT statistically significant (p > 0.05), indicating that experience and training together do NOT impact performance in a non-additive way.

A positive interaction coefficient (0.04) means that higher experience amplifies the effect of training. However, this interaction effect is not significant.

5.4 Summary of interaction effects

If interaction effects exist, failing to account for them can mislead our interpretation. The effect of one predictor can change based on another predictor, leading to incorrect conclusions if ignored.

If the interaction term is significant, we must include it in the model. If it’s not significant, the simpler model without interaction should be preferred (as is the case here).

6 Model Estimation Techniques

Different estimation techniques are used to handle different statistical issues:

Ordinary Least Squares (OLS): The default method used in lm().
Generalised Least Squares (GLS): Used when errors are correlated.
Weighted Least Squares (WLS): Accounts for heteroscedasticity.

6.1 Ordinary Least Squares (OLS)

OLS is the default estimation technique within the ‘lm()’ function and should be used unless you suspect there are issues within your data.

6.2 Generalised Least Squares (GLS)

GLS handles situations where we have correlated residuals:

gls_model <- gls(performance ~ player_experience + training_hours + team_support + injuries + motivation, data=sports_data)
summary(gls_model)

Generalized least squares fit by REML
  Model: performance ~ player_experience + training_hours + team_support +      injuries + motivation 
  Data: sports_data 
       AIC      BIC    logLik
  626.8707 644.6737 -306.4353

Coefficients:
                     Value Std.Error    t-value p-value
(Intercept)       53.65168  3.995170  13.429137       0
player_experience  2.92325  0.133199  21.946545       0
training_hours     2.53151  0.186819  13.550607       0
team_support       4.80714  0.344641  13.948243       0
injuries          -3.86258  0.380089 -10.162306       0
motivation         1.56884  0.358829   4.372117       0

 Correlation: 
                  (Intr) plyr_x trnng_ tm_spp injurs
player_experience -0.373                            
training_hours    -0.593  0.054                     
team_support      -0.332  0.011  0.037              
injuries          -0.112 -0.027 -0.108  0.024       
motivation        -0.785  0.131  0.200  0.080 -0.033

Standardized residuals:
        Min          Q1         Med          Q3         Max 
-2.56408941 -0.61325485  0.09770125  0.64510595  2.34608687 

Residual standard error: 5.239269 
Degrees of freedom: 100 total; 94 residual

6.3 Weighted Least Squares (WLS)

WLS accounts for heteroscedasticity within the data:

weights <- 1 / abs(residuals(model))
wls_model <- lm(performance ~ player_experience + training_hours + team_support + injuries + motivation, weights=weights, data=sports_data)
summary(wls_model)


Call:
lm(formula = performance ~ player_experience + training_hours + 
    team_support + injuries + motivation, data = sports_data, 
    weights = weights)

Weighted Residuals:
    Min      1Q  Median      3Q     Max 
-3.6709 -1.7755  0.4871  1.8013  3.4592 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       53.38771    1.70789   31.26   <2e-16 ***
player_experience  2.91290    0.05629   51.75   <2e-16 ***
training_hours     2.56368    0.09347   27.43   <2e-16 ***
team_support       4.83638    0.15930   30.36   <2e-16 ***
injuries          -3.97195    0.21423  -18.54   <2e-16 ***
motivation         1.59119    0.13546   11.75   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.065 on 94 degrees of freedom
Multiple R-squared:  0.9805,    Adjusted R-squared:  0.9795 
F-statistic: 945.5 on 5 and 94 DF,  p-value: < 2.2e-16

In our example here, using different estimation techniques doesn’t change the message regarding the significance of our predictor variables, but it does change the R2 etc.

7 Model Diagnostics

Model diagnostics help identify issues that could affect model validity.

For example:

Residual Analysis examines normality and homoscedasticity.
Influence Measures detect outliers that may impact results.
Specification Error Tests check if important variables are missing.

7.1 Residual Analysis (Normality and Homoscedasticity)

The Q-Q plot checks if residuals follow a normal distribution. If points fall roughly along a straight line, normality holds.

par(mfrow=c(1,2))
plot(model, which=2)  # Q-Q Plot

The Scale-Location plot checks for homoscedasticity (constant variance of residuals). If the red trend line is relatively flat, homoscedasticity holds.

plot(model, which=3)  # Scale-Location Plot

7.2 Influence Measures (Detecting Outliers)

Influence plots help identify observations that have a significant effect on the model.

If certain points are far from others, they may be outliers influencing regression results.

influencePlot(model)

      StudRes        Hat       CookD
20  0.5224792 0.12501668 0.006551291
22 -2.0548211 0.09011153 0.067383106
46 -2.5304695 0.06558727 0.070836800
86 -2.6909048 0.03175097 0.037110601
89 -0.1662877 0.19130100 0.001101578

7.3 Specification Error Test (RESET Test)

The RESET test checks if important variables are missing or if the functional form is incorrect.

A significant p-value suggests potential misspecification issues. In this case, we appear to be OK!

resettest(model)


    RESET test

data:  model
RESET = 1.1621, df1 = 2, df2 = 92, p-value = 0.3174