Introduction

Multiple regression allows us to relate a numerical response variable to one or more numerical or categorical predictors.

We can use multiple regression models to understand relationships, assess differences, and make predictions.

But what about a situation where the response of interest is categorical and binary?

Today’s Data: A Night to Remember

On April 15, 1912 the famous ocean liner Titanic sank in the North Atlantic after striking an iceberg on its maiden voyage. The dataset titanic.csv contains the survival status and other attributes of individuals on the titanic.

We are interested in investigating the variables that contribute to passenger survival. Do women and children really come first?

Data and Packages

Today we’re using a variety of packages we’ve used before.

Let’s load our data and then look at it.

titanic <- read_csv("~/titanic.csv")
glimpse(titanic)
## Rows: 887
## Columns: 6
## $ pclass   <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3, 2…
## $ name     <chr> "Mr. Owen Harris Braund", "Mrs. John Bradley (Florence Briggs…
## $ sex      <chr> "male", "female", "female", "female", "male", "male", "male",…
## $ age      <dbl> 22, 38, 26, 35, 35, 27, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55,…
## $ fare     <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, 21…
## $ survived <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0…

Exploratory Data Analysis

The linear model with multiple predictors

\[ y = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k + \epsilon \]

\[ \hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k \]

Denote by \(p\) the probability of death and consider the model below.

\[ p = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k + \epsilon\]

Can you see any problems with this approach?

Linear Regression?

lm_survival <- lm(survived ~ age + sex, data = titanic)
tidy(lm_survival)
## # A tibble: 3 × 5
##   term         estimate std.error statistic  p.value
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  0.752     0.0356      21.1   2.88e-80
## 2 age         -0.000343  0.000979    -0.350 7.26e- 1
## 3 sexmale     -0.551     0.0289     -19.1   3.50e-68

Visualizing the Model

Diagnostics

This isn’t helpful! We need to develop a new tool.

Preliminaries

Odds are sometimes expressed as X : Y and read X to Y.

It is the ratio of successes to failures, where values larger than 1 favor a success and values smaller than 1 favor a failure.

If \(P(A) = 1/2\), what are the odds of \(A\)?

If \(P(B) = 1/3\) what are the odds of \(B\)?

An odds ratio is a ratio of odds.

More Preliminaries

\[\text{logit}(p) = \text{log}\left(\frac{p}{1-p}\right)\]

The logit takes a value of \(p\) between 0 and 1 and outputs a value between \(-\infty\) and \(\infty\).

The inverse logit (logistic) takes a value between \(-\infty\) and \(\infty\) and outputs a value between 0 and 1.

\[\text{inverse logit}(x) = \frac{e^x}{1+e^x} = \frac{1}{1 + e^{-x}}\]

There is a one-to-one relationship between probabilities and log-odds. If we create a model using the log-odds we can “work backwards” using the logistic function to obtain probabilities between 0 and 1.

Logistic Regression Model

\[\text{log}\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_k x_{k}\]

Use the inverse logit to find the expression for \(p\).

\[p = \frac{e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_k x_{k}}}{1 + e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_k x_{k}}}\]

We can use the logistic regression model to obtain predicted probabilities of success for a binary response variable.

Logistic Regression Model

We can handle fitting the model via computer either in a tidymodels framework or by using the glm function.

fit_1 <- logistic_reg() %>%
  set_engine("glm") %>%
  fit(as.factor(survived) ~ sex + age, data = titanic, family = "binomial")
  
fit_1 %>%
  tidy()
## # A tibble: 3 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  1.11      0.208       5.34  9.05e- 8
## 2 sexmale     -2.50      0.168     -14.9   3.24e-50
## 3 age         -0.00206   0.00586    -0.351 7.25e- 1
logit_mod <- glm(survived ~ sex + age, data = titanic, family = "binomial")
tidy(logit_mod)
## # A tibble: 3 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  1.11      0.208       5.34  9.05e- 8
## 2 sexmale     -2.50      0.168     -14.9   3.24e-50
## 3 age         -0.00206   0.00586    -0.351 7.25e- 1

And use augment to find predicted log-odds.

pred_log_odds <- augment(logit_mod)

The Estimated Logistic Regression Model

tidy(logit_mod)
## # A tibble: 3 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  1.11      0.208       5.34  9.05e- 8
## 2 sexmale     -2.50      0.168     -14.9   3.24e-50
## 3 age         -0.00206   0.00586    -0.351 7.25e- 1

\[\text{log}\left(\frac{\hat{p}}{1-\hat{p}}\right) = 1.11 - 2.50~sex - 0.00206~age\] \[\hat{p} = \frac{e^{1.11 - 2.50~sex - 0.00206~age}}{{1+e^{1.11 - 2.50~sex - 0.00206~age}}}\]

Interpreting coefficients

\[\text{log}\left(\frac{\hat{p}}{1-\hat{p}}\right) = 1.11 - 2.50~sex - 0.00206~age\]

Holding sex constant, for every additional year of age, we expect the log-odds of survival to decrease by approximately 0.002.

Holding age constant, we expect males to have a log-odds of survival that is 2.50 less than females.

Interpreting coefficients

\[\frac{\hat{p}}{1-\hat{p}} = e^{1.11 - 2.50~sex - 0.00206~age}\]

tidy(logit_mod) %>%
  mutate(estimate= exp(estimate))
## # A tibble: 3 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   3.05     0.208       5.34  9.05e- 8
## 2 sexmale       0.0821   0.168     -14.9   3.24e-50
## 3 age           0.998    0.00586    -0.351 7.25e- 1

Holding sex constant, for every one year increase in age, the odds of survival is expected to be multiplied by \(e^{-0.00206} = 0.998\).

Holding age constant, the odds of survival for males is \(e^{-2.50} = 0.082\) times the odds of survival for females.

Predicted Probabilities

tibble(
  age = rep(0:80, times = 2),
  sex = rep(c("male", "female"), each = 81)
  ) %>% 
  augment(logit_mod, newdata = .) %>% 
  mutate(p = exp(.fitted) / (1 + exp(.fitted))) %>% 
  ggplot(aes(x = age, y = p, color = sex)) +
  geom_line() + 
  scale_color_viridis_d()

Specific probabilities

tibble(
  age = 31,
  sex = "male") %>% 
  augment(logit_mod, newdata = .) %>%
  mutate(p = exp(.fitted) / (1 + exp(.fitted))) %>%
  pull("p")
## [1] 0.1900059

Model Fit

glance(logit_mod)
## # A tibble: 1 × 8
##   null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
##           <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
## 1         1183.     886  -458.  922.  936.     916.         884   887
Weaknesses
  • Logistic regression has assumptions: independence and linearity in the log-odds (some other methods require fewer assumptions)
  • If the predictors are correlated, coefficient estimates may be unreliable
Strengths
  • Can transform to odds ratios or predicated probabilities for interpretation of coefficients.
  • Handles numerical and categorical predictors
  • Can quantify uncertainty around a prediction
  • Can extend to more than 2 categories (multinomial regression)

Practice Problems

  1. Please add fare to the model. Interpret the coefficients for your variables using odds ratios.

  2. What is the predicted probability of survival for a 40 year old man who paid 100 pounds? What if it went up to 500 pounds?

  3. Set age as being equal to its mean value. Then, create a predicted probability plot showing the effect of fare price for men and women. Describe what you see.

Sources