library(tidyverse)
library(tidymodels)
manhattan <- read_csv("manhattan.csv")

Learning goals

Estimation

Point estimate

A point estimate is a single value computed from the sample data to serve as the “best guess”, or estimate, for the population parameter.

Suppose we were interested in the population mean. What would be natural point estimate to use?

You would use the first option below, the sample mean.

Quantity Parameter Statistic
Mean \(\mu\) \(\bar{x}\)
Variance \(\sigma^2\) \(s^2\)
Standard deviation \(\sigma\) \(s\)
Median \(M\) \(\tilde{x}\)
Proportion \(p\) \(\hat{p}\)

What is the downside to using point estimates?

Confidence intervals

A plausible range of values for the population parameter is an interval estimate. One type of interval estimate is known as a confidence interval.

  • If we report a point estimate, we probably won’t hit the exact population parameter.

  • If we report a range of plausible values, we have a good chance at capturing the parameter.

Rent in Manhattan

On a given day in 2018, twenty one-bedroom apartments were randomly selected on Craigslist Manhattan from apartments listed as “by owner”. The data are in the manhattan data frame. We will use this sample to conduct inference on the typical rent of 1 bedroom apartments in Manhattan.

Variability of sample statistics

  • In order to construct a confidence interval we need to quantify the variability of our sample statistic.

  • For example, if we want to construct a confidence interval for a population mean, we need to come up with a plausible range of values around our observed sample mean.

  • This range will depend on how precise and how accurate our sample mean is as an estimate of the population mean.

  • Quantifying this requires a measurement of how much we would expect the sample mean to vary from sample to sample.

Suppose you randomly sample 50 students and 5 of them are left handed. If you were to take another random sample of 50 students, how many would you expect to be left handed? Would you be surprised if only 3 of them were left handed? Would you be surprised if 40 of them were left handed?

Quantifying the variability of a sample statistic

We can quantify the variability of sample statistics using

  1. simulation: via bootstrapping (today);

  2. theory: via Central Limit Theorem (later in the course).

Bootstrapping

Bootstrapping scheme

  1. Take a bootstrap sample - a random sample taken with replacement from the original sample, of the same size as the original sample.

  2. Calculate the bootstrap statistic - a statistic such as mean, median, proportion, slope, etc. computed from the bootstrap samples.

  3. Repeat steps (1) and (2) many times to create a bootstrap distribution - a distribution of bootstrap statistics.

  4. Calculate the bounds of the XX% confidence interval as the middle XX% of the bootstrap distribution.

Part 1: Drawing a bootstrap sample

Let’s start by using bootstrapping to estimate the mean rent of one-bedroom apartments in Manhattan.

Exercises

Part 1

What is the point estimate of the typical rent? Do you think this is the exact average rent for an apartment?

manhattan %>%
  summarize(mean_rent = mean(rent))

Part 2: Bootstrap confidence interval

We will use the infer package, included as part of tidymodels to calculate a 95% confidence interval for the mean rent of one-bedroom apartments in Manhattan.

We start by setting a seed to sure our analysis is reproducible. We’ll use 101121 to set our seed today but you can use any value you want on assignments unless we specify otherwise.

set.seed(101121)

Generating the bootstrap distribution

We can use R to take many bootstrap samples and generate a bootstrap distribution.

You can uncomment the lines and fill in the blanks to create the bootstrap distribution of sample means and save the results in the data frame boot_dist.

We will 100 reps for the in-class activity. (You will use about 15,000 reps for assignments outside of class.)

boot_dist <- manhattan #%>%
  #specify(______) %>%
  #generate(______) %>%
  #calculate(______)
boot_dist <- manhattan %>% 
  specify(response = rent) %>% 
  generate(reps = 100, type = "bootstrap") %>%
  calculate(stat = "mean")
  • How many rows are in boot_dist?
  • What does each row represent?
  • What are the variables in boot_dist? What do they mean?

Visualize the bootstrap distribution

Visualize the bootstrap distribution using a histogram. Describe the shape, center, and spread of this distribution.

# add code
boot_dist %>% 
  ggplot(aes(x = stat)) +
  geom_histogram(binwidth = 50)

Calculate the confidence interval

Uncomment the lines and fill in the blanks to construct the 95% bootstrap confidence interval for the mean rent of one-bedroom apartments in Manhattan.

#___ %>%
#  summarise(lower = quantile(______),
  #          upper = quantile(______))
boot_dist %>%
  summarise(
    lower = quantile(stat, 0.025),
    upper = quantile(stat, 0.975)
  )

Interpret the interval

Write the interpretation for the interval calculated above.

We are 95% confident that the true mean value of a one bedroom apartment in Manhattan in 2018 is between 2294.63 dollars and 2946.56 dollars.

Part 3: Changing the confidence level

#calculate a 90% confidence interval
boot_dist %>%
  summarise(
    lower = quantile(stat, 0.05),
    upper = quantile(stat, 0.95)
  )
#calculate a 99% confidence interval
boot_dist %>%
  summarise(
    lower = quantile(stat, 0.005),
    upper = quantile(stat, 0.995)
  ) 

A 90% interval is more precise, but a 99% interval is more accurate than 95% (i.e., more likely to include the true value).

Part 4: Additional practice- on your own or in groups

Next, use bootstrapping to estimate the median rent for one-bedroom apartments in Manhattan.

## add code
set.seed(101121)
boot_dist_median <- manhattan %>% 
  specify(response = rent) %>% 
  generate(reps = 100, type = "bootstrap") %>%
  calculate(stat = "median")
## add code
boot_dist_median %>%
  summarise(
    lower = quantile(stat, 0.04),
    upper = quantile(stat, 0.96)
  )

For next time:

The infer package is tough to learn (but once you do, you can do lots with it)! Here are two resources that I think you will find useful: