Learning goals
- Gain proficiency with multiple linear regression
- Review what a dummy variable is
To begin, we’ll work with a dataset with information on the price of sports cars new set of pokemon data.
sports_car_prices <- read_csv("sportscars.csv")
pokemon <- read_csv("pokemon150.csv")
The linear model with one predictor
- Previously, we were interested in the \[\beta_0\] (population parameter for the intercept) and the \[\beta_1\] (population parameter for the slope) in the following model:
\[ \hat{y} = \beta_0 + \beta_1~x + \epsilon \]
\[ \hat{y} = b_0 + b_1~x \]
The linear model with multiple predictors
\[ \hat{y} = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k +\epsilon \] - Sample model that we use to estimate the population model:
\[ \hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k \]
An example
The file sportscars.csv
contains prices for Porsche and Jaguar cars for sale on cars.com.
car
: car make (Jaguar or Porsche)
price
: price in USD
age
: age of the car in years
mileage
: previous miles driven
The linear model with a single predictor
prices_model <- linear_reg() %>%
set_engine("lm") %>%
fit(price ~ age, data = sports_car_prices)
tidy(prices_model)
But is the age the only variable that predicts price?
The linear model with multiple predictors
\[ \hat{y} = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k +\epsilon \]
- Sample model that we use to estimate the population model:
\[ \hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k \]
Let’s add a variable.
Multiple Regression
m_main <- linear_reg() %>%
set_engine("lm") %>%
fit(price ~ age + car, data = sports_car_prices)
m_main %>%
tidy() %>%
select(term, estimate)
Linear model:
\[ \widehat{price} = 44310 - 2487~age + 21648~carPorsche \]
\[ \widehat{price} = 44310 - 2487~age + 21648* 0 \\
=44310-2487*age\]
Porsche: \[ \widehat{price} = 44310 - 2487~age + 21648* 1 \\
=65958 - 2487~age\]
Rate of change in price as the age of the car increases does not depend on make of car (same slopes)
Porsches are consistently more expensive than Jaguars (different intercepts)
Main effects, numerical and categorical predictors
m_main %>%
tidy() %>%
select(term, estimate)
m_main_coefs <- m_main %>%
tidy() %>%
select(term, estimate)
m_main_coefs
All else held constant, for each additional year of a car’s age, the price of the car is predicted to decrease, on average, by $2,487.
All else held constant, Porsches are predicted, on average, to have a price that is $21,648 greater than Jaguars.
Jaguars that have an age of 0 are predicted, on average, to have a price of $44,310.
Adjusted R-Squared
The strength of the fit of a linear model is commonly evaluated using \(R^2\).
It tells us what percentage of the variability in the response variable is explained by the model. The remainder of the variability is unexplained.
Please recall:
- We can write explained variation using the following ratio of sums of squares:
\[ R^2 = 1 - \left( \frac{ SS\_{Error} }{ SS\_{Total} } \right) \]
where \(SS_{Error}\) is the sum of squared residuals and \(SS_{Total}\) is the total variance in the response variable.
Adjusted \(R^2\)
\[ R^2\_{adj} = 1 - \left( \frac{ SS\_{Error} }{ SS\_{Total} } \times \frac{n - 1}{n - k - 1} \right), \]
where \(n\) is the number of observations and \(k\) is the number of predictors in the model.
Adjusted \(R^2\) doesn’t increase if the new variable does not provide any new information or is completely unrelated and can even decrease.
This makes adjusted \(R^2\) a preferable metric for model selection in multiple regression models.
Let’s find the \(R^2\) and adjusted \(R^2\) for the m_main
model we built.
glance(m_main)$r.squared
## [1] 0.6071375
glance(m_main)$adj.r.squared
## [1] 0.5933529
Exercises
Now, let’s do some exercises with the Pokemon data.
Exercise 1)
Are height and weight correlated with a Pokemon’s hit points? Run a bivariate linear regression model for each of these. Do you find statistically significant results?
linear_reg() %>%
set_engine("lm") %>%
fit(hp ~ height_m, data = pokemon) %>%
tidy
linear_reg() %>%
set_engine("lm") %>%
fit(hp ~ weight_kg, data = pokemon) %>%
tidy
In each model, these predictors are found to be statistically significant.
Exercise 2)
Other variables may be correlated with hp
, e.g. a pokemon’s legendary status.
Do legendary pokemon have higher hp
than non-legendary pokemon? Compare mean hp
between groups to support your answer.
pokemon %>%
group_by(is_legendary) %>%
summarize(mean_hp = mean(hp), n = n())
Write down a model to predict a pokemon’s hitpoints based on their height, weight, legendary status (use \(x\), \(y\), \(\beta\) notation). Define each variable.
\(y\): hit points, (hp
) \(x_1\): height, (height_m
) \(x_2\): weight, (weight_kg
) \(x_3\): legendary, (is_legendary
)
\(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3\)
Exercise 3)
Use tidymodel
syntax to build a linear model with all three variables and estimate each \(\beta\). Then, find and interpret the adjusted \(R^2\).
# code here
pokemon_model <- linear_reg() %>%
set_engine("lm") %>%
fit(hp ~ height_m + weight_kg + is_legendary, data = pokemon)
pokemon_model %>%
tidy()
glance(pokemon_model)$r.squared
## [1] 0.3439075
glance(pokemon_model)$adj.r.squared
## [1] 0.3304261
Interpret the meaning of your estimates and write a brief description below.
Intercept: The intercept tells us the approximate hp of a pokemon with no height, no weight and no legendary status. This is meaningless in context.
Slopes:
All else held constant
the height_m
coefficient tells us the that pokemon 1 m taller have on average 5.1 more hp.
the weight_kg
coefficient tells us that a pokemon gains on average .08 hp per additional kg they weigh.
the is_legendary
coefficient tells us that legendary pokemon has on average 5.3 more hp than a non-legendary pokemon.
Adjusted \(R^2\): About 33% of the variation in Pokemon hit points is explained by these three variables (accounting for the penalty for using the adjusted \(R^2\).)
Exercise 4)
Some think that certain pokemon types have higher hp
than others.
First, explore where there is a different mean_hp
by type_1
.
pokemon %>%
group_by(type_1) %>%
summarize(mean_hp = mean(hp), n = n())
Then, please construct a linear model in R
to determine the effect of pokemon type on hp
.
# code here
linear_reg() %>%
set_engine("lm") %>%
fit(hp ~ type_1, data = pokemon) %>%
tidy()
Interpret the meaning of your estimates and write a brief description below. Are any types missing?
the intercept tells us the approximate baseline hp of a pokemon with bug type (set to the baseline by default since it’s the first variable).
the type_1
coefficients tell us the average contribution of that pokemon’s type on hp.
Because the category type_1
has many levels, it is encoded as several dummy variables. The idea is that each type is now yes/no for each pokemon.
Given the output of your code, write down the linear model (use \(x\), \(y\), \(\beta\) notation). Define each variable.
\(y = \beta_0 + x_1 \beta_1 + x_2 \beta_2 + x_3 \beta_3 + \ldots x_17 \beta_17\)
\(x_1\) through \(x_17\) are boolean variables telling us whether or not a pokemon is a dark type,…, all the way until water type.