Coming up

Main ideas

Packages

We’ll use the tidyverse, broom, fivethirtyeight, and viridis packages again, as well as the car package when calculate variance inflation factors (VIFs) to examine whether our models have multicollinearity.

Please recall

The linear model with a single predictor

\[ \hat{y} = \beta_0 + \beta_1~x + \epsilon \]

\[ \hat{y} = b_0 + b_1~x \]

Least squares regression

The regression line minimizes the sum of squared residuals.

Data: A Closer Look

sports_car_prices <- read.csv("sportscars.csv")

The file sportscars.csv contains prices for Porsche and Jaguar cars for sale on cars.com.

car: car make (Jaguar or Porsche)

price: price in USD

age: age of the car in years

mileage: previous miles driven

The linear model with a single predictor

prices_model <- linear_reg() %>%
  set_engine("lm") %>%
  fit(price ~ age, data = sports_car_prices)
tidy(prices_model)
## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   53246.     3322.     16.0  5.70e-23
## 2 age           -2149.      466.     -4.62 2.22e- 5

But is the age the only variable that predicts price?

The linear model with multiple predictors

\[ \hat{y} = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k +\epsilon \]

\[ \hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k \]

Let’s add a variable.

Price vs. age and type of car

Does the relationship between price and age depend on type of car?

## `geom_smooth()` using formula 'y ~ x'

Modeling with main effects

m_main <- linear_reg() %>%
  set_engine("lm") %>%
  fit(price ~ age + car, data = sports_car_prices)
m_main %>%
  tidy() %>%
  select(term, estimate)
## # A tibble: 3 × 2
##   term        estimate
##   <chr>          <dbl>
## 1 (Intercept)   44310.
## 2 age           -2487.
## 3 carPorsche    21648.

Linear model:

\[ \widehat{price} = 44310 - 2487~age + 21648~carPorsche \]

\[ \widehat{price} = 44310 - 2487~age + 21648* 0 \\ =44310-2487*age\]

Interpretation of main effects

Main effects, numerical and categorical predictors

m_main %>%
  tidy() %>%
  select(term, estimate)
## # A tibble: 3 × 2
##   term        estimate
##   <chr>          <dbl>
## 1 (Intercept)   44310.
## 2 age           -2487.
## 3 carPorsche    21648.
m_main_coefs <- m_main %>%
  tidy() %>%
  select(term, estimate)
m_main_coefs
## # A tibble: 3 × 2
##   term        estimate
##   <chr>          <dbl>
## 1 (Intercept)   44310.
## 2 age           -2487.
## 3 carPorsche    21648.

What went wrong?

Question: Why is our linear regression model different from what we got from geom_smooth(method = "lm")?

Interacting explanatory variables

Modeling with interaction effects

ggplot(data = sports_car_prices, 
       aes(x = age, y = price, color = car)) + 
    scale_color_viridis(discrete = TRUE, option = "D", name = "Type of Car") + 
  labs(x = "Age (years)", y = "Price (USD)", color = "Car Make") + 
  geom_point() + 
  geom_smooth(method="lm", se = FALSE) 
## `geom_smooth()` using formula 'y ~ x'

 m_int <- linear_reg() %>%
  set_engine("lm") %>%
  fit(price ~ age + car + age * car, 
            data = sports_car_prices) 
m_int %>%
  tidy() %>%
  select(term, estimate)
## # A tibble: 4 × 2
##   term           estimate
##   <chr>             <dbl>
## 1 (Intercept)      56988.
## 2 age              -5040.
## 3 carPorsche        6387.
## 4 age:carPorsche    2969.

\[\widehat{price} = 56988 - 5040~age + 6387~carPorsche + 2969~age \times carPorsche\]

Interpretation of interaction effects

\[\widehat{price} = 56988 - 5040~age + 6387~carPorsche + 2969~age \times carPorsche\]

\[\begin{align}\widehat{price} &= 56988 - 5040~age + 6387~carPorsche + 2969~age \times carPorsche \\ &= 56988 - 5040~age + 6387 \times 0 + 2969~age \times 0\\ &= 56988 - 5040~age\end{align}\]

\[\begin{align}\widehat{price} &= 56988 - 5040~age + 6387~carPorsche + 2969~age \times carPorsche \\ &= 56988 - 5040~age + 6387 \times 1 + 2969~age \times 1\\ &= 63375 - 2071~age\end{align}\]

Interpretation of interaction effects

\[\widehat{price} = 56988 - 5040~age\]

\[\widehat{price} = 63375 - 2071~age\]

Continuous by continuous interactions

Third order interactions

Multicollinearity

You don’t want the predictors to be too correlated with each other in a multiple regression model. When they are correlated with each other, you have mutlicollinearity. One way to diagnose multicollinearity is with variance inflation factors. There’s no specific cutoff, but a VIF of 10 if sometimes used as a cutoff.

Let’s see if we have multicollinearity in our first model. (Please do not run linear regressions this way on the exam, I’m doing it this way to demonstrate with car.)

m_main_2 <- lm(price ~ age + car, data = sports_car_prices)
tibble(vif(m_main_2))
## # A tibble: 2 × 1
##   `vif(m_main_2)`
##             <dbl>
## 1            1.02
## 2            1.02

Now, let’s check if for the interactive model.

m_int_2 <- lm(price ~ age + car + age * car, 
            data = sports_car_prices)
tibble(vif(m_int_2))
## # A tibble: 3 × 1
##   `vif(m_int_2)`
##            <dbl>
## 1           7.27
## 2           3.85
## 3          11.2

Notice the VIFs here are higher. This is to be expected with an interactive model.

Question: Why do you think VIFs will be higher in interactive models?

Yes- this is because the interaction terms are made up of the other variables. So, they are likely to correlate with them. This is normal.

P-Hacking

Let’s return to the candy rankings data:

Notice below that our p-value for the sugar percentile variable is REALLY CLOSE to 0.05. While arbitrary, this threshold is important in statistics and is sometimes used as the cutoff to determine whether results are publishable.

candy_rankings <- candy_rankings %>%
  mutate(sugarpercent100 = sugarpercent * 100)
candy <- linear_reg() %>%
  set_engine("lm") %>%
  fit(winpercent ~ sugarpercent100 + chocolate, data = candy_rankings) 
tidy(candy)
## # A tibble: 3 × 5
##   term            estimate std.error statistic  p.value
##   <chr>              <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)      38.3       2.55       15.0  3.31e-25
## 2 sugarpercent100   0.0857    0.0435      1.97 5.25e- 2
## 3 chocolateTRUE    18.3       2.47        7.40 1.06e-10

FiveThirtyEight has a interactive page showing how you can p-hack your way to publication.

How could we p-hack our way to a significant result for the sugar variable?

Try adding the variable for whether the variable is a hard candy.

candy <- linear_reg() %>%
  set_engine("lm") %>%
  fit(winpercent ~ sugarpercent100 + chocolate + hard, data = candy_rankings) 
tidy(candy)
## # A tibble: 4 × 5
##   term            estimate std.error statistic  p.value
##   <chr>              <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)      39.3       2.64       14.9  6.05e-25
## 2 sugarpercent100   0.0944    0.0437      2.16 3.36e- 2
## 3 chocolateTRUE    16.9       2.63        6.44 7.98e- 9
## 4 hardTRUE         -4.98      3.41       -1.46 1.48e- 1

We now have statistically significant results for our main variable of interest! But there are some data ethics issues with this approach.

Question: Why is p-hacking problematic?

This can result in you getting statistically significant results when you may not actually have them, increasing the probability of a Type 1 Error, among other concerns.

Practice

  1. Run and interpret a multiple regression with both age and mileage as predictors. Are both of these statistically significant predictors of the price of a car?
mileage <- linear_reg() %>%
  set_engine("lm") %>%
  fit(price ~ age + mileage, data = sports_car_prices)
tidy(mileage)
## # A tibble: 3 × 5
##   term         estimate std.error statistic  p.value
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept) 62950.     3176.       19.8   3.03e-27
## 2 age           516.      601.        0.859 3.94e- 1
## 3 mileage        -0.695     0.122    -5.68  4.75e- 7

Slopes: All else equal, a one year increase in age predicts, on average, a $516 increase in price. This is not a statistically significant predictor.

All else equal, a one mile increase in mileage predicts a $0.695 decrease in price. This is a statistically significant predictor. This is a variable where you might want to do a one standard deviation increase interpretation.

Intercept: A car with 0 miles that is 0 years old has a predicted price of $62,950.

  1. (To Review) Find and interpret the adjusted \(R^2\) for this model.
glance(mileage) %>% 
  pull(adj.r.squared)
## [1] 0.5167014

Accounting for the penalty for including more variables, about 51.7% of the variability in price can be explained by the predictors in the model.

  1. Examine the extent to which there is multicollinearity in this model.
mileage2 <- lm(price ~ age + mileage, data = sports_car_prices)
tibble(vif(mileage2))
## # A tibble: 2 × 1
##   `vif(mileage2)`
##             <dbl>
## 1            2.56
## 2            2.56

Now, please turn to the dataset in nycairquality.csv. This file contains daily air quality measurements in New York from May to September 1973 and collected by the New York State Department of Conservation and the National Weather Service (Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A. (1983) Graphical Methods for Data Analysis. Belmont, CA: Wadsworth).

airquality <- read_csv("airquality.csv")
  1. Please run and interpret a model with ozone in parts per billion as the response variable and solar radiation, wind, and temperature as the explanatory variables.
m_air <-  linear_reg() %>%
  set_engine("lm") %>%
  fit(Ozone ~ Solar.R + Wind + Temp, data = airquality)
tidy(m_air)
## # A tibble: 4 × 5
##   term        estimate std.error statistic       p.value
##   <chr>          <dbl>     <dbl>     <dbl>         <dbl>
## 1 (Intercept) -64.3      23.1        -2.79 0.00623      
## 2 Solar.R       0.0598    0.0232      2.58 0.0112       
## 3 Wind         -3.33      0.654      -5.09 0.00000152   
## 4 Temp          1.65      0.254       6.52 0.00000000242

Slopes: All else equal, a one langley increase in radiation predicts, on average, a 0.06 ppb increase in ozone.

All else equal, a one mph increase in wind speed translates, on average to a 3.33 ppb decrease in ozone.

All else equal, a one degree Fahrenheit increase in temperature results, on average, in a 1.65 ppb increase in ozone.

Intercept: If all variables were set equal to zero, we would predict that the ozone level is -64.3.