Main ideas
- Review and expand upon concepts from our first three regression classes.
- Learn how to carry out and interpret multiple linear regressions.
- Learn how to assess the conditions for inference in regression.
We’ll use the tidyverse
, broom
, fivethirtyeight
, and viridis
packages again, as well as the car
package when calculate variance inflation factors (VIFs) to examine whether our models have multicollinearity.
Response Variable: Variable whose behavior or variation you are trying to understand, on the y-axis. Also called the dependent variable.
Explanatory Variable: Other variables that you want to use to explain the variation in the response, on the x-axis. Also called independent variables, predictors, or features.
Predicted value: Output of the model function
Residuals: Shows how far each case is from its predicted value
\[ \hat{y} = \beta_0 + \beta_1~x + \epsilon \]
Unfortunately, we can’t get these values
So we use sample statistics to estimate them:
\[ \hat{y} = b_0 + b_1~x \]
The regression line minimizes the sum of squared residuals.
Residuals: \(e_i = y_i - \hat{y}_i\),
The regression line minimizes \(\sum_{i = 1}^n e_i^2\).
Equivalently, minimizing \(\sum_{i = 1}^n [y_i - (b_0 + b_1~x_i)]^2\)
sports_car_prices <- read.csv("sportscars.csv")
The file sportscars.csv
contains prices for Porsche and Jaguar cars for sale on cars.com.
car
: car make (Jaguar or Porsche)
price
: price in USD
age
: age of the car in years
mileage
: previous miles driven
prices_model <- linear_reg() %>%
set_engine("lm") %>%
fit(price ~ age, data = sports_car_prices)
tidy(prices_model)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 53246. 3322. 16.0 5.70e-23
## 2 age -2149. 466. -4.62 2.22e- 5
But is the age the only variable that predicts price?
\[ \hat{y} = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k +\epsilon \]
\[ \hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k \]
Let’s add a variable.
Does the relationship between price and age depend on type of car?
## `geom_smooth()` using formula 'y ~ x'
m_main <- linear_reg() %>%
set_engine("lm") %>%
fit(price ~ age + car, data = sports_car_prices)
m_main %>%
tidy() %>%
select(term, estimate)
## # A tibble: 3 × 2
## term estimate
## <chr> <dbl>
## 1 (Intercept) 44310.
## 2 age -2487.
## 3 carPorsche 21648.
Linear model:
\[ \widehat{price} = 44310 - 2487~age + 21648~carPorsche \]
Plug in 0 for carPorsche
to get the linear model for Jaguars.
Plug in 1 for carPorsche
to get the linear model for Porsches.
Jaguar:
\[ \widehat{price} = 44310 - 2487~age + 21648* 0 \\ =44310-2487*age\]
Porsche: \[ \widehat{price} = 44310 - 2487~age + 21648* 1 \\ =65958 - 2487~age\]
Rate of change in price as the age of the car increases does not depend on make of car (same slopes)
Porsches are consistently more expensive than Jaguars (different intercepts)
m_main %>%
tidy() %>%
select(term, estimate)
## # A tibble: 3 × 2
## term estimate
## <chr> <dbl>
## 1 (Intercept) 44310.
## 2 age -2487.
## 3 carPorsche 21648.
m_main_coefs <- m_main %>%
tidy() %>%
select(term, estimate)
m_main_coefs
## # A tibble: 3 × 2
## term estimate
## <chr> <dbl>
## 1 (Intercept) 44310.
## 2 age -2487.
## 3 carPorsche 21648.
All else held constant, for each additional year of a car’s age, the price of the car is predicted to decrease, on average, by $2,487.
All else held constant, Porsches are predicted, on average, to have a price that is $21,648 greater than Jaguars.
Jaguars that have an age of 0 are predicted, on average, to have a price of $44,310.
Question: Why is our linear regression model different from what we got from geom_smooth(method = "lm")
?
car
is the only variable in our model that affects the intercept.
The model we specified assumes Jaguars and Porsches have the same slope and different intercepts.
What is the most appropriate model for these data?
Including an interaction effect in the model allows for different slopes, i.e. nonparallel lines.
This means that the relationship between an explanatory variable and the response depends on another explanatory variable.
We can accomplish this by adding an interaction variable. This is the product of two explanatory variables (also sometimes called an interaction term).
ggplot(data = sports_car_prices,
aes(x = age, y = price, color = car)) +
scale_color_viridis(discrete = TRUE, option = "D", name = "Type of Car") +
labs(x = "Age (years)", y = "Price (USD)", color = "Car Make") +
geom_point() +
geom_smooth(method="lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
m_int <- linear_reg() %>%
set_engine("lm") %>%
fit(price ~ age + car + age * car,
data = sports_car_prices)
m_int %>%
tidy() %>%
select(term, estimate)
## # A tibble: 4 × 2
## term estimate
## <chr> <dbl>
## 1 (Intercept) 56988.
## 2 age -5040.
## 3 carPorsche 6387.
## 4 age:carPorsche 2969.
\[\widehat{price} = 56988 - 5040~age + 6387~carPorsche + 2969~age \times carPorsche\]
\[\widehat{price} = 56988 - 5040~age + 6387~carPorsche + 2969~age \times carPorsche\]
Plug in 0 for carPorsche
to get the linear model for Jaguars.
Plug in 1 for carPorsche
to get the linear model for Porsches.
Jaguar:
\[\begin{align}\widehat{price} &= 56988 - 5040~age + 6387~carPorsche + 2969~age \times carPorsche \\ &= 56988 - 5040~age + 6387 \times 0 + 2969~age \times 0\\ &= 56988 - 5040~age\end{align}\]
\[\begin{align}\widehat{price} &= 56988 - 5040~age + 6387~carPorsche + 2969~age \times carPorsche \\ &= 56988 - 5040~age + 6387 \times 1 + 2969~age \times 1\\ &= 63375 - 2071~age\end{align}\]
\[\widehat{price} = 56988 - 5040~age\]
\[\widehat{price} = 63375 - 2071~age\]
Rate of change in price as the age of the car increases depends on the make of the car (different slopes).
Porsches are consistently more expensive than Jaguars (different intercepts).
Interpretation becomes trickier
Slopes conditional on values of explanatory variables
Can you? Yes
Should you? Probably not if you want to interpret these interactions in context of the data.
You don’t want the predictors to be too correlated with each other in a multiple regression model. When they are correlated with each other, you have mutlicollinearity. One way to diagnose multicollinearity is with variance inflation factors. There’s no specific cutoff, but a VIF of 10 if sometimes used as a cutoff.
Let’s see if we have multicollinearity in our first model. (Please do not run linear regressions this way on the exam, I’m doing it this way to demonstrate with car
.)
m_main_2 <- lm(price ~ age + car, data = sports_car_prices)
tibble(vif(m_main_2))
## # A tibble: 2 × 1
## `vif(m_main_2)`
## <dbl>
## 1 1.02
## 2 1.02
Now, let’s check if for the interactive model.
m_int_2 <- lm(price ~ age + car + age * car,
data = sports_car_prices)
tibble(vif(m_int_2))
## # A tibble: 3 × 1
## `vif(m_int_2)`
## <dbl>
## 1 7.27
## 2 3.85
## 3 11.2
Notice the VIFs here are higher. This is to be expected with an interactive model.
Question: Why do you think VIFs will be higher in interactive models?
Yes- this is because the interaction terms are made up of the other variables. So, they are likely to correlate with them. This is normal.
Let’s return to the candy rankings data:
Notice below that our p-value for the sugar percentile variable is REALLY CLOSE to 0.05. While arbitrary, this threshold is important in statistics and is sometimes used as the cutoff to determine whether results are publishable.
candy_rankings <- candy_rankings %>%
mutate(sugarpercent100 = sugarpercent * 100)
candy <- linear_reg() %>%
set_engine("lm") %>%
fit(winpercent ~ sugarpercent100 + chocolate, data = candy_rankings)
tidy(candy)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 38.3 2.55 15.0 3.31e-25
## 2 sugarpercent100 0.0857 0.0435 1.97 5.25e- 2
## 3 chocolateTRUE 18.3 2.47 7.40 1.06e-10
FiveThirtyEight has a interactive page showing how you can p-hack your way to publication.
How could we p-hack our way to a significant result for the sugar variable?
Try adding the variable for whether the variable is a hard candy.
candy <- linear_reg() %>%
set_engine("lm") %>%
fit(winpercent ~ sugarpercent100 + chocolate + hard, data = candy_rankings)
tidy(candy)
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 39.3 2.64 14.9 6.05e-25
## 2 sugarpercent100 0.0944 0.0437 2.16 3.36e- 2
## 3 chocolateTRUE 16.9 2.63 6.44 7.98e- 9
## 4 hardTRUE -4.98 3.41 -1.46 1.48e- 1
We now have statistically significant results for our main variable of interest! But there are some data ethics issues with this approach.
Question: Why is p-hacking problematic?
This can result in you getting statistically significant results when you may not actually have them, increasing the probability of a Type 1 Error, among other concerns.
mileage <- linear_reg() %>%
set_engine("lm") %>%
fit(price ~ age + mileage, data = sports_car_prices)
tidy(mileage)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 62950. 3176. 19.8 3.03e-27
## 2 age 516. 601. 0.859 3.94e- 1
## 3 mileage -0.695 0.122 -5.68 4.75e- 7
Slopes: All else equal, a one year increase in age predicts, on average, a $516 increase in price. This is not a statistically significant predictor.
All else equal, a one mile increase in mileage predicts a $0.695 decrease in price. This is a statistically significant predictor. This is a variable where you might want to do a one standard deviation increase interpretation.
Intercept: A car with 0 miles that is 0 years old has a predicted price of $62,950.
glance(mileage) %>%
pull(adj.r.squared)
## [1] 0.5167014
Accounting for the penalty for including more variables, about 51.7% of the variability in price can be explained by the predictors in the model.
mileage2 <- lm(price ~ age + mileage, data = sports_car_prices)
tibble(vif(mileage2))
## # A tibble: 2 × 1
## `vif(mileage2)`
## <dbl>
## 1 2.56
## 2 2.56
Now, please turn to the dataset in nycairquality.csv
. This file contains daily air quality measurements in New York from May to September 1973 and collected by the New York State Department of Conservation and the National Weather Service (Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A. (1983) Graphical Methods for Data Analysis. Belmont, CA: Wadsworth).
Ozone
: ozone (ppb)Solar.R
: solar radiation (langleys)Wind
: wind (mph)Temp
: temperature (degrees F)airquality <- read_csv("airquality.csv")
m_air <- linear_reg() %>%
set_engine("lm") %>%
fit(Ozone ~ Solar.R + Wind + Temp, data = airquality)
tidy(m_air)
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -64.3 23.1 -2.79 0.00623
## 2 Solar.R 0.0598 0.0232 2.58 0.0112
## 3 Wind -3.33 0.654 -5.09 0.00000152
## 4 Temp 1.65 0.254 6.52 0.00000000242
Slopes: All else equal, a one langley increase in radiation predicts, on average, a 0.06 ppb increase in ozone.
All else equal, a one mph increase in wind speed translates, on average to a 3.33 ppb decrease in ozone.
All else equal, a one degree Fahrenheit increase in temperature results, on average, in a 1.65 ppb increase in ozone.
Intercept: If all variables were set equal to zero, we would predict that the ozone level is -64.3.