Logistic Regression - Demonstration

In this demonstration, I’ll walk through a logistic regression analysis using appropriate R code.

1 Load Relevant Libraries

# Load required libraries
library(ggplot2)
library(caret)
Loading required package: lattice
library(pROC)
Type 'citation("pROC")' for a citation.

Attaching package: 'pROC'
The following objects are masked from 'package:stats':

    cov, smooth, var
library(ROCR)

2 Generate a Dataset

We’ll start by creating a synthetic dataset to simulate a real-world scenario.

The dataset includes predictors (experience, training hours, team support, injuries, and motivation), and we want to predict whether a player passes or fails based on these predictors.

Note that this outcome is binary outcome, therefore it’s appropriate to use logistic rather than linear regression.

Code
rm(list=ls())

set.seed(123)
n <- 200  # Number of observations

# Predictors
player_experience <- runif(n, 1, 15)
training_hours <- rnorm(n, mean = 10, sd = 3)
team_support <- sample(1:5, n, replace=TRUE)
injuries <- rpois(n, lambda=2)
motivation <- rnorm(n, mean = 7, sd = 1.5)

# We aim for a roughly balanced pass/fail distribution while
# maintaining significant effects for some predictors.

# Adjusted intercept and coefficients
# The chosen values aim for an average logit ~ 0
# at typical predictor values, giving ~50% pass/fail.
logit_prob <- -8 + 0.3 * player_experience + 0.25 * training_hours + 
              0.5 * team_support - 0.25 * injuries + 0.3 * motivation

prob_pass <- 1 / (1 + exp(-logit_prob))

# Binary outcome: pass (1) or fail (0)
pass_fail <- rbinom(n, 1, prob_pass)

# Create data frame
sports_data <- data.frame(
  player_experience, training_hours, team_support, 
  injuries, motivation, pass_fail = factor(pass_fail)
)

# Label the factor levels
levels(sports_data$pass_fail) <- c("Fail", "Pass")

3 Exploratory Data Analysis

Before modeling, we always want to explore our data.

For example, I’m interested in looking at the number of fails vs. the number of passes:

table(sports_data$pass_fail)

Fail Pass 
  99  101 

We can also visualise the relationship between predictors and the outcome.

Code
# Boxplot with jitter
ggplot(sports_data, aes(x=pass_fail, y=player_experience, fill=pass_fail)) +
  geom_violin(trim=FALSE, alpha=0.5) +  
  geom_boxplot(width=0.1, outlier.shape=NA) + 
  geom_jitter(width=0.1, alpha=0.3, color="black") +  
  labs(title="Example: Player Experience vs Pass/Fail", x="Pass/Fail", y="Experience (years)") +
  theme_minimal() +
  theme(legend.position="none")

Code
ggplot(sports_data, aes(x=pass_fail, y=motivation, fill=pass_fail)) +
  geom_violin(trim=FALSE, alpha=0.5) +  
  geom_boxplot(width=0.1, outlier.shape=NA) + 
  geom_jitter(width=0.1, alpha=0.3, color="black") +  
  labs(title="Example: Player Motivation vs Pass/Fail", x="Pass/Fail", y="Motivation") +
  theme_minimal() +
  theme(legend.position="none")

4 Fit a Logistic Regression Model

Logistic regression models the probability that a binary outcome equals 1 (pass).

The following code models the probability of a player passing (binary outcome pass_fail) based on five predictor variables: player_experience, training_hours, team_support, injuries, and motivation.

The family = binomial argument specifies logistic regression, which estimates how each predictor affects the log-odds of passing.

# Fit logistic regression model
logit_model <- glm(pass_fail ~ player_experience + training_hours + team_support + injuries + motivation, 
                   data = sports_data, family = binomial)
summary(logit_model)

Call:
glm(formula = pass_fail ~ player_experience + training_hours + 
    team_support + injuries + motivation, family = binomial, 
    data = sports_data)

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -8.85233    1.54291  -5.737 9.61e-09 ***
player_experience  0.32130    0.05466   5.878 4.16e-09 ***
training_hours     0.29553    0.06870   4.302 1.69e-05 ***
team_support       0.37943    0.12975   2.924  0.00345 ** 
injuries          -0.26856    0.13758  -1.952  0.05094 .  
motivation         0.38471    0.12333   3.119  0.00181 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 277.24  on 199  degrees of freedom
Residual deviance: 197.23  on 194  degrees of freedom
AIC: 209.23

Number of Fisher Scoring iterations: 5

5 Model Interpretation

5.1 Coefficients

The Estimate represents the change in the log-odds of passing for a one-unit increase in the predictor.

  • A positive value (e.g., 0.32130 for player_experience) increases the likelihood of passing.

  • A negative value (e.g., -0.26856 for injuries) decreases the likelihood of passing.

The Std. Error measures the variability of the coefficient estimates.

  • Smaller values indicate more precise estimates.

The z value is the test statistic for each coefficient, calculated as Estimate / Std. Error.

  • Larger absolute values suggest stronger effects.

Pr(>|z|) (p-value) indicates the statistical significance of each predictor.

5.2 Model Fit Statistics

  • Null Deviance (277.24) measures the model fit with no predictors (just the intercept).

  • Residual Deviance (197.23) measures model fit after including predictors. A lower value compared to the null deviance suggests that the model improves prediction.

  • AIC (209.23): The Akaike Information Criterion, balancing model fit and complexity. Remember that lower AIC values indicate a better model.

5.3 Iterations

  • The number of Fisher Scoring Iterations (5). This shows how many iterations it took for the model to converge to the best solution.

5.4 Odds Ratios

Odds ratios help us interpret the coefficients more intuitively. They aren’t produced automatically, but we can calculate them as follows:

exp(coef(logit_model))  # Convert log-odds to odds ratios
      (Intercept) player_experience    training_hours      team_support 
     0.0001430481      1.3789192101      1.3438325098      1.4614569113 
         injuries        motivation 
     0.7644831381      1.4691824445 

These values are the exponentiated coefficients from the logistic regression model, representing odds ratios.

Odds Ratio > 1: Increases the odds of passing. In our example:

  • player_experience = 1.38: For each additional year of experience, the odds of passing increase by 38%, holding other factors constant.

  • motivation = 1.47: Higher motivation increases the odds of passing by 47% per unit increase.

Odds Ratio < 1: Decreases the odds of passing. In our example:

  • injuries = 0.76: More injuries reduce the odds of passing. Specifically, each additional injury reduces the odds by 23.5% (1 - 0.764).

(Intercept) = 0.00014: Represents the odds of passing when all predictors are zero, but effectively irrelevant here since zero isn’t meaningful for most predictors (can you have zero motivation?)

6 Model Diagnostics

To check our model, we can use it to create predicted outcomes, and compare these predictions with what actually happened.

6.1 Predicted Probabilities

This code produces predictions (pass/fail) based on the model we created earlier.

sports_data$predicted_prob <- predict(logit_model, type = "response")

# Plot predicted probabilities
ggplot(sports_data, aes(x=predicted_prob, fill=pass_fail)) +
  geom_histogram(binwidth=0.05, position="dodge") +
  labs(title="Predicted Probability Distribution", x="Predicted Probability", fill="Pass/Fail")

What this does

  • This calculates the predicted probabilities of passing for each observation in the dataset based on the fitted logistic regression model (logit_model).

  • The predict() function generates predictions from the model:

    • logit_model: The logistic regression model we’ve trained.

    • type = "response": This tells R to return the probabilities (ranging from 0 to 1) rather than the raw log-odds.

How it works

  • For each player in sports_data, the model uses the values of predictors (player_experience, training_hours, etc.) to calculate the probability of the player passing.

  • These probabilities are stored in a new column called predicted_prob in the sports_data dataframe.

Interpreting the Output

  • A value close to 1 means a high probability of passing.

  • A value close to 0 means a high probability of failing.

  • For example:

    • predicted_prob = 0.85 → 85% chance of passing.

    • predicted_prob = 0.30 → 30% chance of passing (70% chance of failing).

6.2 ROC Curve and AUC

  • The ROC curve evaluates model performance.

  • AUC (Area Under the Curve) indicates model quality.

roc_curve <- roc(sports_data$pass_fail, sports_data$predicted_prob)
Setting levels: control = Fail, case = Pass
Setting direction: controls < cases
plot(roc_curve, main="ROC Curve")

auc(roc_curve)
Area under the curve: 0.8409

In this code, we calculate the Receiver Operating Characteristic (ROC) curve using the roc() function from the pROC package.

  • Inputs:

    • sports_data$pass_fail: The actual outcomes (Pass/Fail).

    • sports_data$predicted_prob: The predicted probabilities of passing from the logistic regression model.

  • Purpose: The ROC curve plots the True Positive Rate (Sensitivity) against the False Positive Rate (1 - Specificity) at different threshold levels, helping evaluate the model’s performance.

Then we plot the ROC curve.

  • Interpretation:

    • The closer the curve is to the top-left corner, the better the model’s performance (indicating high sensitivity and specificity).

    • A diagonal line from (0,0) to (1,1) represents a model with no discriminatory power (random guessing).

Finally, we calculate the AUC.

  • The Area Under the Curve (AUC), which quantifies the overall performance of the model.

  • Interpretation of AUC:

    • AUC = 1.0: Perfect model.

    • AUC > 0.9: Excellent model.

    • AUC between 0.7 and 0.9: Good model.

    • AUC around 0.5: No better than random guessing.

7 Model Evaluation

7.1 Confusion Matrix

We’ll encounter confusion matrices a lot during this module. Here’s an example of how they can be used to evaluate the output of a logistic regression:

# Adjust predicted_class to match the factor levels of pass_fail
predicted_class <- ifelse(sports_data$predicted_prob > 0.5, "Pass", "Fail")

# Convert to factors with the same levels
predicted_class <- factor(predicted_class, levels = c("Fail", "Pass"))

# Confusion Matrix
confusionMatrix(predicted_class, sports_data$pass_fail)
Confusion Matrix and Statistics

          Reference
Prediction Fail Pass
      Fail   74   21
      Pass   25   80
                                          
               Accuracy : 0.77            
                 95% CI : (0.7054, 0.8264)
    No Information Rate : 0.505           
    P-Value [Acc > NIR] : 1.201e-14       
                                          
                  Kappa : 0.5398          
                                          
 Mcnemar's Test P-Value : 0.6583          
                                          
            Sensitivity : 0.7475          
            Specificity : 0.7921          
         Pos Pred Value : 0.7789          
         Neg Pred Value : 0.7619          
             Prevalence : 0.4950          
         Detection Rate : 0.3700          
   Detection Prevalence : 0.4750          
      Balanced Accuracy : 0.7698          
                                          
       'Positive' Class : Fail            
                                          

7.1.1 Confusion Matrix

  • True Negatives (TN = 74): Number of correctly predicted Fails.

  • False Negatives (FN = 21): Number of incorrectly predicted Fails (when it was actually Pass).

  • False Positives (FP = 25): Number of incorrectly predicted Passes (when it was actually Fail).

  • True Positives (TP = 80): Number of correctly predicted Passes.

7.1.2 Model Performance Metrics

Accuracy (0.77) gives us the proportion of correctly classified cases:

  • Indicates that 77% of predictions were correct.

95% CI (0.7054, 0.8264)

  • Confidence interval showing that the true accuracy likely falls between 70.5% and 82.6%.

No Information Rate (0.505)

  • Accuracy you’d get by always predicting the majority class (baseline performance).

P-Value [Acc > NIR] (1.201e-14)

  • Tests if the model’s accuracy is significantly better than random guessing.

  • In this case, a very small p-value indicates high statistical significance.

7.1.3 Agreement Metrics

Kappa (0.5398) measures agreement between predictions and actual outcomes, adjusted for chance.

  • 0.5–0.6: Suggests moderate agreement.

  • Note that values closer to 1 indicate strong agreement.

McNemar’s Test (p = 0.6583):

  • Tests for bias in classification errors.

  • A non-significant p-value (> 0.05) suggests no significant bias in error distribution.

7.1.4 Class-Specific Metrics

Sensitivity (0.7475): (Recall/True Positive Rate)

  • Proportion of actual Fails correctly predicted:

  • Model correctly identifies ~75% of actual Fails.

Specificity (0.7921): (True Negative Rate)

  • Proportion of actual Passes correctly predicted:

  • Model correctly identifies ~79% of actual Passes.

Positive Predictive Value (0.7789): (Precision)

  • Probability that predicted Fails are actually Fails.

Negative Predictive Value (0.7619)

  • Probability that predicted Passes are actually Passes.

7.1.5 Prevalence and Detection Rates

Prevalence (0.4950)

  • Proportion of the dataset that belongs to the Fail class.

Detection Rate (0.3700)

  • Proportion of the entire dataset that was correctly classified as Fail.

Detection Prevalence (0.4750)

  • Proportion of observations predicted as Fail by the model.

Balanced Accuracy (0.7698):

  • Average of sensitivity and specificity, useful for imbalanced datasets

7.1.6 Interpretation Summary

  • 77% Accuracy: Suggests that our model offers strong predictive performance.

  • Balanced Sensitivity & Specificity: Our model seems to handle both classes (Pass/Fail) well.

  • Kappa (0.54): There is moderate agreement beyond chance.

  • Our model gives significant improvement over random guessing (p < 0.001).

7.2 Other plots

There are a number of different plot types we can use to extend our understanding of the logistic regression model.

Some examples are:

7.2.1 Calibration Plot (Reliability Curve)

This shows how well the predicted probabilities align with actual outcomes.

# Calibration plot to compare predicted probabilities vs actual outcomes
library(ResourceSelection)
ResourceSelection 0.3-6      2023-06-27
hoslem.test(as.numeric(sports_data$pass_fail) - 1, sports_data$predicted_prob)

    Hosmer and Lemeshow goodness of fit (GOF) test

data:  as.numeric(sports_data$pass_fail) - 1, sports_data$predicted_prob
X-squared = 7.0686, df = 8, p-value = 0.5293
# Create bins for calibration
sports_data$prob_bin <- cut(sports_data$predicted_prob, breaks=seq(0,1,by=0.1), include.lowest=TRUE)

# Plot observed vs predicted probabilities
ggplot(sports_data, aes(x=prob_bin, fill=pass_fail)) +
  geom_bar(position="fill") +
  labs(title="Calibration Plot: Predicted vs Observed Outcomes",
       x="Predicted Probability Bins", y="Proportion of Pass/Fail") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle=45, hjust=1))

Interpretation:

  • If the model is well-calibrated, the proportion of actual “Pass” outcomes should increase consistently with predicted probabilities.

7.2.2 Density Plot of Predicted Probabilities

This helps see the separation between “Pass” and “Fail” predictions.

# Density plot for visualising separation of predicted probabilities
ggplot(sports_data, aes(x=predicted_prob, fill=pass_fail)) +
  geom_density(alpha=0.5) +
  labs(title="Density Plot of Predicted Probabilities",
       x="Predicted Probability", fill="Actual Outcome") +
  theme_minimal()

Interpretation:

  • Good Model: Distinct peaks for “Pass” and “Fail” with minimal overlap.

  • Poor Model: High overlap between the two distributions.

7.2.3 Precision-Recall Curve

Useful when dealing with imbalanced data, highlighting performance on positive predictions.

Unlike the ROC curve, which focuses on the trade-off between true positive and false positive rates, the PR curve focuses on precision and recall.

# Precision-Recall Curve
pred <- prediction(sports_data$predicted_prob, sports_data$pass_fail)
pr <- performance(pred, measure = "prec", x.measure = "rec")

# Plotting Precision-Recall curve
plot(pr, main="Precision-Recall Curve", col="blue", lwd=2)
abline(h=0.5, col="red", lty=2)  # Baseline for random guessing

Interpretation:

  • The curve should stay above the random baseline (0.5) for good performance.

8 Interaction Effects

We can also explore interaction effects.

logit_interaction_model <- glm(pass_fail ~ player_experience * training_hours + team_support + injuries + motivation, 
                               data = sports_data, family = binomial)
summary(logit_interaction_model)

Call:
glm(formula = pass_fail ~ player_experience * training_hours + 
    team_support + injuries + motivation, family = binomial, 
    data = sports_data)

Coefficients:
                                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)                      -6.95181    2.00170  -3.473 0.000515 ***
player_experience                 0.05275    0.19951   0.264 0.791492    
training_hours                    0.10716    0.15054   0.712 0.476548    
team_support                      0.38320    0.13078   2.930 0.003388 ** 
injuries                         -0.25784    0.13902  -1.855 0.063648 .  
motivation                        0.38244    0.12263   3.119 0.001816 ** 
player_experience:training_hours  0.02734    0.01998   1.368 0.171168    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 277.24  on 199  degrees of freedom
Residual deviance: 195.27  on 193  degrees of freedom
AIC: 209.27

Number of Fisher Scoring iterations: 5

This model includes an interaction term between player_experience and training_hours, represented as player_experience*training_hours. It helps identify if the effect of one variable on the outcome depends on the level of the other variable.

8.1 Interpretation

8.1.1 Main Effects

  • Team Support (p = 0.003) and Motivation (p = 0.0018) are significant predictors, meaning they have a strong, independent effect on the likelihood of passing.
  • Injuries are marginally significant (p ≈ 0.06), suggesting a potential negative effect on passing.

8.1.2 Interaction Effect (player_experience * training_hours)

  • Estimate = 0.02734: Suggests that as both player_experience and training_hours increase together, there’s a slight positive effect on the likelihood of passing.
  • p-value = 0.171: This is NOT statistically significant (p > 0.05), meaning there’s no strong evidence that the effect of experience on passing depends on training hours (or vice versa).
  • Practical Meaning: The relationship between experience and passing doesn’t change much at different levels of training hours.

8.1.3 Key Takeaways

  • The interaction between experience and training hours is not significant, suggesting that their combined effect is close to additive (no meaningful synergy).

  • The main effects of team support and motivation are important drivers of passing success.

  • We might consider removing the interaction term to simplify the model, as it doesn’t improve predictive power significantly.