library(tidyverse)
library(tidymodels)
manhattan <- read_csv("manhattan.csv")
infer
to obtain a bootstrap distributionA point estimate is a single value computed from the sample data to serve as the “best guess”, or estimate, for the population parameter.
Suppose we were interested in the population mean. What would be natural point estimate to use?
You would use the first option below, the sample mean.
Quantity | Parameter | Statistic |
---|---|---|
Mean | \(\mu\) | \(\bar{x}\) |
Variance | \(\sigma^2\) | \(s^2\) |
Standard deviation | \(\sigma\) | \(s\) |
Median | \(M\) | \(\tilde{x}\) |
Proportion | \(p\) | \(\hat{p}\) |
What is the downside to using point estimates?
A plausible range of values for the population parameter is an interval estimate. One type of interval estimate is known as a confidence interval.
If we report a point estimate, we probably won’t hit the exact population parameter.
If we report a range of plausible values, we have a good chance at capturing the parameter.
On a given day in 2018, twenty one-bedroom apartments were randomly selected on Craigslist Manhattan from apartments listed as “by owner”. The data are in the manhattan
data frame. We will use this sample to conduct inference on the typical rent of 1 bedroom apartments in Manhattan.
In order to construct a confidence interval we need to quantify the variability of our sample statistic.
For example, if we want to construct a confidence interval for a population mean, we need to come up with a plausible range of values around our observed sample mean.
This range will depend on how precise and how accurate our sample mean is as an estimate of the population mean.
Quantifying this requires a measurement of how much we would expect the sample mean to vary from sample to sample.
Suppose you randomly sample 50 students and 5 of them are left handed. If you were to take another random sample of 50 students, how many would you expect to be left handed? Would you be surprised if only 3 of them were left handed? Would you be surprised if 40 of them were left handed?
We can quantify the variability of sample statistics using
simulation: via bootstrapping (today);
theory: via Central Limit Theorem (later in the course).
The term bootstrapping comes from the phrase “pulling oneself up by one’s bootstraps”, to help oneself without the aid of others.
In this case, we are estimating a population parameter, and we’ll accomplish it using data from only from the given sample.
This notion of saying something about a population parameter using only information from an observed sample is the crux of statistical inference, it is not limited to bootstrapping.
Here is a cool animation of the bootstrapping process.
Take a bootstrap sample - a random sample taken with replacement from the original sample, of the same size as the original sample.
Calculate the bootstrap statistic - a statistic such as mean, median, proportion, slope, etc. computed from the bootstrap samples.
Repeat steps (1) and (2) many times to create a bootstrap distribution - a distribution of bootstrap statistics.
Calculate the bounds of the XX% confidence interval as the middle XX% of the bootstrap distribution.
Let’s start by using bootstrapping to estimate the mean rent of one-bedroom apartments in Manhattan.
What is the point estimate of the typical rent? Do you think this is the exact average rent for an apartment?
manhattan %>%
summarize(mean_rent = mean(rent))
We will use the infer
package, included as part of tidymodels
to calculate a 95% confidence interval for the mean rent of one-bedroom apartments in Manhattan.
We start by setting a seed to sure our analysis is reproducible. We’ll use 101121 to set our seed today but you can use any value you want on assignments unless we specify otherwise.
set.seed(101121)
We can use R to take many bootstrap samples and generate a bootstrap distribution.
You can uncomment the lines and fill in the blanks to create the bootstrap distribution of sample means and save the results in the data frame boot_dist
.
We will 100 reps for the in-class activity. (You will use about 15,000 reps for assignments outside of class.)
boot_dist <- manhattan #%>%
#specify(______) %>%
#generate(______) %>%
#calculate(______)
boot_dist <- manhattan %>%
specify(response = rent) %>%
generate(reps = 100, type = "bootstrap") %>%
calculate(stat = "mean")
boot_dist
?boot_dist
? What do they mean?Visualize the bootstrap distribution using a histogram. Describe the shape, center, and spread of this distribution.
# add code
boot_dist %>%
ggplot(aes(x = stat)) +
geom_histogram(binwidth = 50)
Uncomment the lines and fill in the blanks to construct the 95% bootstrap confidence interval for the mean rent of one-bedroom apartments in Manhattan.
#___ %>%
# summarise(lower = quantile(______),
# upper = quantile(______))
boot_dist %>%
summarise(
lower = quantile(stat, 0.025),
upper = quantile(stat, 0.975)
)
Write the interpretation for the interval calculated above.
We are 95% confident that the true mean value of a one bedroom apartment in Manhattan in 2018 is between 2294.63 dollars and 2946.56 dollars.
#calculate a 90% confidence interval
boot_dist %>%
summarise(
lower = quantile(stat, 0.05),
upper = quantile(stat, 0.95)
)
#calculate a 99% confidence interval
boot_dist %>%
summarise(
lower = quantile(stat, 0.005),
upper = quantile(stat, 0.995)
)
A 90% interval is more precise, but a 99% interval is more accurate than 95% (i.e., more likely to include the true value).
Next, use bootstrapping to estimate the median rent for one-bedroom apartments in Manhattan.
boot_dist_median
. Why have I set a seed here again?## add code
set.seed(101121)
boot_dist_median <- manhattan %>%
specify(response = rent) %>%
generate(reps = 100, type = "bootstrap") %>%
calculate(stat = "median")
## add code
boot_dist_median %>%
summarise(
lower = quantile(stat, 0.04),
upper = quantile(stat, 0.96)
)
For next time:
The infer package is tough to learn (but once you do, you can do lots with it)! Here are two resources that I think you will find useful: