library(tidyverse)
library(tidymodels)
library(scatterplot3d)

Reminder

Learning goals

To begin, we’ll work with a dataset with information on the price of sports cars new set of pokemon data.

sports_car_prices <- read_csv("sportscars.csv")
pokemon <- read_csv("pokemon150.csv")

The linear model with one predictor

  • Previously, we were interested in the \[\beta_0\] (population parameter for the intercept) and the \[\beta_1\] (population parameter for the slope) in the following model:

\[ \hat{y} = \beta_0 + \beta_1~x + \epsilon \]

  • Unfortunately, we can’t get these values

  • So we use sample statistics to estimate them:

\[ \hat{y} = b_0 + b_1~x \]

The linear model with multiple predictors

\[ \hat{y} = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k +\epsilon \] - Sample model that we use to estimate the population model:

\[ \hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k \]

An example

The file sportscars.csv contains prices for Porsche and Jaguar cars for sale on cars.com.

car: car make (Jaguar or Porsche)

price: price in USD

age: age of the car in years

mileage: previous miles driven

The linear model with a single predictor

prices_model <- linear_reg() %>%
  set_engine("lm") %>%
  fit(price ~ age, data = sports_car_prices)
tidy(prices_model)

But is the age the only variable that predicts price?

The linear model with multiple predictors

\[ \hat{y} = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k +\epsilon \]

  • Sample model that we use to estimate the population model:

\[ \hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k \]

Let’s add a variable.

Multiple Regression

m_main <- linear_reg() %>%
  set_engine("lm") %>%
  fit(price ~ age + car, data = sports_car_prices)
m_main %>%
  tidy() %>%
  select(term, estimate)

Linear model:

\[ \widehat{price} = 44310 - 2487~age + 21648~carPorsche \]

  • Plug in 0 for carPorsche to get the linear model for Jaguars.

  • Plug in 1 for carPorsche to get the linear model for Porsches.

  • Jaguar:

\[ \widehat{price} = 44310 - 2487~age + 21648* 0 \\ =44310-2487*age\]

  • Porsche: \[ \widehat{price} = 44310 - 2487~age + 21648* 1 \\ =65958 - 2487~age\]

  • Rate of change in price as the age of the car increases does not depend on make of car (same slopes)

  • Porsches are consistently more expensive than Jaguars (different intercepts)

Main effects, numerical and categorical predictors

m_main %>%
  tidy() %>%
  select(term, estimate)
m_main_coefs <- m_main %>%
  tidy() %>%
  select(term, estimate)
m_main_coefs
  • All else held constant, for each additional year of a car’s age, the price of the car is predicted to decrease, on average, by $2,487.

  • All else held constant, Porsches are predicted, on average, to have a price that is $21,648 greater than Jaguars.

  • Jaguars that have an age of 0 are predicted, on average, to have a price of $44,310.

Adjusted R-Squared

Please recall:

  • We can write explained variation using the following ratio of sums of squares:

\[ R^2 = 1 - \left( \frac{ SS\_{Error} }{ SS\_{Total} } \right) \]

where \(SS_{Error}\) is the sum of squared residuals and \(SS_{Total}\) is the total variance in the response variable.

Adjusted \(R^2\)

\[ R^2\_{adj} = 1 - \left( \frac{ SS\_{Error} }{ SS\_{Total} } \times \frac{n - 1}{n - k - 1} \right), \]

where \(n\) is the number of observations and \(k\) is the number of predictors in the model.

  • Adjusted \(R^2\) doesn’t increase if the new variable does not provide any new information or is completely unrelated and can even decrease.

  • This makes adjusted \(R^2\) a preferable metric for model selection in multiple regression models.

Let’s find the \(R^2\) and adjusted \(R^2\) for the m_main model we built.

glance(m_main)$r.squared
## [1] 0.6071375
glance(m_main)$adj.r.squared
## [1] 0.5933529

Exercises

Now, let’s do some exercises with the Pokemon data.

Exercise 1)

Are height and weight correlated with a Pokemon’s hit points? Run a bivariate linear regression model for each of these. Do you find statistically significant results?

linear_reg() %>%
  set_engine("lm") %>%
  fit(hp ~ height_m, data = pokemon) %>%
  tidy
 linear_reg() %>%
  set_engine("lm") %>%
  fit(hp ~ weight_kg, data = pokemon) %>%
  tidy

In each model, these predictors are found to be statistically significant.

Exercise 2)

Other variables may be correlated with hp, e.g. a pokemon’s legendary status.

Do legendary pokemon have higher hp than non-legendary pokemon? Compare mean hp between groups to support your answer.

pokemon %>%
  group_by(is_legendary) %>%
  summarize(mean_hp = mean(hp), n = n())

Write down a model to predict a pokemon’s hitpoints based on their height, weight, legendary status (use \(x\), \(y\), \(\beta\) notation). Define each variable.

\(y\): hit points, (hp) \(x_1\): height, (height_m) \(x_2\): weight, (weight_kg) \(x_3\): legendary, (is_legendary)

\(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3\)

Exercise 3)

Use tidymodel syntax to build a linear model with all three variables and estimate each \(\beta\). Then, find and interpret the adjusted \(R^2\).

# code here
pokemon_model <- linear_reg() %>%
  set_engine("lm") %>%
  fit(hp ~ height_m + weight_kg + is_legendary, data = pokemon) 
pokemon_model %>%
  tidy()
glance(pokemon_model)$r.squared
## [1] 0.3439075
glance(pokemon_model)$adj.r.squared
## [1] 0.3304261

Interpret the meaning of your estimates and write a brief description below.

Intercept: The intercept tells us the approximate hp of a pokemon with no height, no weight and no legendary status. This is meaningless in context.

Slopes:

All else held constant

  • the height_m coefficient tells us the that pokemon 1 m taller have on average 5.1 more hp.

  • the weight_kg coefficient tells us that a pokemon gains on average .08 hp per additional kg they weigh.

  • the is_legendary coefficient tells us that legendary pokemon has on average 5.3 more hp than a non-legendary pokemon.

Adjusted \(R^2\): About 33% of the variation in Pokemon hit points is explained by these three variables (accounting for the penalty for using the adjusted \(R^2\).)

Exercise 4)

Some think that certain pokemon types have higher hp than others.

First, explore where there is a different mean_hp by type_1.

pokemon %>%
  group_by(type_1) %>%
  summarize(mean_hp = mean(hp), n = n())

Then, please construct a linear model in R to determine the effect of pokemon type on hp.

# code here
linear_reg() %>%
  set_engine("lm") %>%
  fit(hp ~ type_1, data = pokemon) %>%
  tidy()

Interpret the meaning of your estimates and write a brief description below. Are any types missing?

  • the intercept tells us the approximate baseline hp of a pokemon with bug type (set to the baseline by default since it’s the first variable).

  • the type_1 coefficients tell us the average contribution of that pokemon’s type on hp.

Because the category type_1 has many levels, it is encoded as several dummy variables. The idea is that each type is now yes/no for each pokemon.

Given the output of your code, write down the linear model (use \(x\), \(y\), \(\beta\) notation). Define each variable.

\(y = \beta_0 + x_1 \beta_1 + x_2 \beta_2 + x_3 \beta_3 + \ldots x_17 \beta_17\)

\(x_1\) through \(x_17\) are boolean variables telling us whether or not a pokemon is a dark type,…, all the way until water type.