library(tidyverse)
library(tidymodels)
library(scatterplot3d)

Reminder

Lab Due on Friday at 11:59 PM.
Discussion of Peer Review.

Learning goals

Gain proficiency with multiple linear regression
Review what a dummy variable is

To begin, we’ll work with a dataset with information on the price of sports cars new set of pokemon data.

sports_car_prices <- read_csv("sportscars.csv")
pokemon <- read_csv("pokemon150.csv")

The linear model with one predictor

Previously, we were interested in the \[\beta_0\] (population parameter for the intercept) and the \[\beta_1\] (population parameter for the slope) in the following model:

\[ \hat{y} = \beta_0 + \beta_1~x + \epsilon \]

Unfortunately, we can’t get these values
So we use sample statistics to estimate them:

\[ \hat{y} = b_0 + b_1~x \]

The linear model with multiple predictors

\[ \hat{y} = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k +\epsilon \] - Sample model that we use to estimate the population model:

\[ \hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k \]

An example

The file sportscars.csv contains prices for Porsche and Jaguar cars for sale on cars.com.

car: car make (Jaguar or Porsche)

price: price in USD

age: age of the car in years

mileage: previous miles driven

The linear model with a single predictor

prices_model <- linear_reg() %>%
  set_engine("lm") %>%
  fit(price ~ age, data = sports_car_prices)
tidy(prices_model)

But is the age the only variable that predicts price?

The linear model with multiple predictors

\[ \hat{y} = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k +\epsilon \]

Sample model that we use to estimate the population model:

\[ \hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k \]

Let’s add a variable.

Multiple Regression

m_main <- linear_reg() %>%
  set_engine("lm") %>%
  fit(price ~ age + car, data = sports_car_prices)
m_main %>%
  tidy() %>%
  select(term, estimate)

Linear model:

\[ \widehat{price} = 44310 - 2487~age + 21648~carPorsche \]

Plug in 0 for carPorsche to get the linear model for Jaguars.
Plug in 1 for carPorsche to get the linear model for Porsches.
Jaguar:

\[ \widehat{price} = 44310 - 2487~age + 21648* 0 \\ =44310-2487*age\]

Porsche: \[ \widehat{price} = 44310 - 2487~age + 21648* 1 \\ =65958 - 2487~age\]
Rate of change in price as the age of the car increases does not depend on make of car (same slopes)
Porsches are consistently more expensive than Jaguars (different intercepts)

Main effects, numerical and categorical predictors

m_main %>%
  tidy() %>%
  select(term, estimate)

m_main_coefs <- m_main %>%
  tidy() %>%
  select(term, estimate)
m_main_coefs

All else held constant, for each additional year of a car’s age, the price of the car is predicted to decrease, on average, by $2,487.
All else held constant, Porsches are predicted, on average, to have a price that is $21,648 greater than Jaguars.
Jaguars that have an age of 0 are predicted, on average, to have a price of $44,310.

Adjusted R-Squared

The strength of the fit of a linear model is commonly evaluated using $R^2$.
It tells us what percentage of the variability in the response variable is explained by the model. The remainder of the variability is unexplained.

Please recall:

We can write explained variation using the following ratio of sums of squares:

\[ R^2 = 1 - \left( \frac{ SS\_{Error} }{ SS\_{Total} } \right) \]

where $SS_{Error}$ is the sum of squared residuals and $SS_{Total}$ is the total variance in the response variable.

Adjusted $R^2$

\[ R^2\_{adj} = 1 - \left( \frac{ SS\_{Error} }{ SS\_{Total} } \times \frac{n - 1}{n - k - 1} \right), \]

where $n$ is the number of observations and $k$ is the number of predictors in the model.

Adjusted $R^2$ doesn’t increase if the new variable does not provide any new information or is completely unrelated and can even decrease.
This makes adjusted $R^2$ a preferable metric for model selection in multiple regression models.

Let’s find the $R^2$ and adjusted $R^2$ for the m_main model we built.

glance(m_main)$r.squared

## [1] 0.6071375

glance(m_main)$adj.r.squared

## [1] 0.5933529

Exercises

Now, let’s do some exercises with the Pokemon data.

Exercise 1)

Are height and weight correlated with a Pokemon’s hit points? Run a bivariate linear regression model for each of these. Do you find statistically significant results?

linear_reg() %>%
  set_engine("lm") %>%
  fit(hp ~ height_m, data = pokemon) %>%
  tidy

 linear_reg() %>%
  set_engine("lm") %>%
  fit(hp ~ weight_kg, data = pokemon) %>%
  tidy

In each model, these predictors are found to be statistically significant.

Exercise 2)

Other variables may be correlated with hp, e.g. a pokemon’s legendary status.

Do legendary pokemon have higher hp than non-legendary pokemon? Compare mean hp between groups to support your answer.

pokemon %>%
  group_by(is_legendary) %>%
  summarize(mean_hp = mean(hp), n = n())

Write down a model to predict a pokemon’s hitpoints based on their height, weight, legendary status (use $x$, $y$, $\beta$ notation). Define each variable.

$y$: hit points, (hp) $x_1$: height, (height_m) $x_2$: weight, (weight_kg) $x_3$: legendary, (is_legendary)

$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3$

Exercise 3)

Use tidymodel syntax to build a linear model with all three variables and estimate each $\beta$. Then, find and interpret the adjusted $R^2$.

# code here
pokemon_model <- linear_reg() %>%
  set_engine("lm") %>%
  fit(hp ~ height_m + weight_kg + is_legendary, data = pokemon) 
pokemon_model %>%
  tidy()

glance(pokemon_model)$r.squared

## [1] 0.3439075

glance(pokemon_model)$adj.r.squared

## [1] 0.3304261

Interpret the meaning of your estimates and write a brief description below.

Intercept: The intercept tells us the approximate hp of a pokemon with no height, no weight and no legendary status. This is meaningless in context.

Slopes:

All else held constant

the height_m coefficient tells us the that pokemon 1 m taller have on average 5.1 more hp.
the weight_kg coefficient tells us that a pokemon gains on average .08 hp per additional kg they weigh.
the is_legendary coefficient tells us that legendary pokemon has on average 5.3 more hp than a non-legendary pokemon.

Adjusted $R^2$: About 33% of the variation in Pokemon hit points is explained by these three variables (accounting for the penalty for using the adjusted $R^2$.)

Exercise 4)

Some think that certain pokemon types have higher hp than others.

First, explore where there is a different mean_hp by type_1.

pokemon %>%
  group_by(type_1) %>%
  summarize(mean_hp = mean(hp), n = n())

Then, please construct a linear model in R to determine the effect of pokemon type on hp.

# code here
linear_reg() %>%
  set_engine("lm") %>%
  fit(hp ~ type_1, data = pokemon) %>%
  tidy()

Interpret the meaning of your estimates and write a brief description below. Are any types missing?

the intercept tells us the approximate baseline hp of a pokemon with bug type (set to the baseline by default since it’s the first variable).
the type_1 coefficients tell us the average contribution of that pokemon’s type on hp.

Because the category type_1 has many levels, it is encoded as several dummy variables. The idea is that each type is now yes/no for each pokemon.

Given the output of your code, write down the linear model (use $x$, $y$, $\beta$ notation). Define each variable.

$y = \beta_0 + x_1 \beta_1 + x_2 \beta_2 + x_3 \beta_3 + \ldots x_17 \beta_17$

$x_1$ through $x_17$ are boolean variables telling us whether or not a pokemon is a dark type,…, all the way until water type.

AE 20: Multiple regression I

STA 199

3/24/2022

Reminder

Learning goals

The linear model with one predictor

The linear model with multiple predictors

An example

The linear model with a single predictor

The linear model with multiple predictors

Multiple Regression

Main effects, numerical and categorical predictors

Adjusted R-Squared

Please recall:

Adjusted \(R^2\)

Exercises

Exercise 1)

Exercise 2)

Exercise 3)

Exercise 4)