library(tidyverse)
library(tidymodels)
manhattan <- read_csv("manhattan.csv")
Population: a group of individuals or objects we are interested in studying
Parameter: a numerical quantity derived from the population (almost always unknown)
Sample: a subset of our population of interest
Statistic: a numerical quantity derived from a sample
Common population parameters of interest and their corresponding sample statistic:
Quantity | Parameter | Statistic |
---|---|---|
Mean | \(\mu\) | \(\bar{x}\) |
Variance | \(\sigma^2\) | \(s^2\) |
Standard deviation | \(\sigma\) | \(s\) |
Median | \(M\) | \(\tilde{x}\) |
Proportion | \(p\) | \(\hat{p}\) |
Statistical inference is the process of using sample data to make conclusions about the underlying population the sample came from.
Estimation: estimating an unknown parameter based on values from the sample at hand
Testing: evaluating whether our observed sample provides evidence for or against some claim about the population
We will now move to testing hypotheses.
Statistical hypothesis testing is the procedure that assesses evidence provided by the data in favor of or against some claim about the population (often about a population parameter or potential associations).
Example:
The state of North Carolina claims that students in 8th grade spent, on average, 200 minutes on Zoom each day in Spring 2021. What do you make of this statement? How would you evaluate the veracity of the claim?
Start with two hypotheses about the population: the null hypothesis and the alternative hypothesis.
Choose a (representative) sample, collect data, and analyze the data.
Figure out how likely it is to see data like what we observed, IF the null hypothesis were in fact true.
If our data would have been extremely unlikely if the null claim were true, then we reject it and deem the alternative claim worthy of further study. Otherwise, we cannot reject the null claim.
The null hypothesis (often denoted \(H_0\)) states that “nothing unusual is happening” or “there is no relationship,” etc.
On the other hand, the alternative hypothesis (often denoted \(H_1\) or \(H_A\)) states the opposite: that there is some sort of relationship (usually this is what we want to check or really think is happening).
In statistical hypothesis testing we always first assume that the null hypothesis is true and then see whether we reject or fail to reject this claim.
The null and alternative hypotheses are defined for parameters, not statistics.
What will our null and alternative hypotheses be for this example?
Expressed in symbols:
where \(\mu\) is the true population mean time spent on Zoom per day by 8th grade North Carolina students.
With these two hypotheses, we now take our sample and summarize the data.
zoom_time <- c(299, 192, 196, 218, 194, 250, 183, 218, 207,
209, 191, 189, 244, 233, 208, 216, 178, 209,
201, 173, 186, 209, 188, 231, 195, 200, 190,
199, 226, 238)
mean(zoom_time)
## [1] 209
The choice of summary statistic calculated depends on the type of data. In our example, we use the sample mean: \(\bar{x} = 209\).
Do you think this is enough evidence to conclude that the mean time is not 200 minutes?
Next, we calculate the probability of getting data like ours, or more extreme, if \(H_0\) were in fact actually true.
This is a conditional probability: Given that \(H_0\) is true (i.e., if \(\mu\) were actually 200), what would be the probability of observing \(\bar{x} = 209\)?” This probability is known as the p-value.
We reject the null hypothesis if this conditional probability is small enough.
If it is very unlikely to observe our data (or more extreme) if \(H_0\) were actually true, then that might give us enough evidence to suggest that it is actually false (and that \(H_1\) is true).
What is “small enough”?
We often consider a numeric cutpoint significance level defined prior to conducting the analysis.
Many analyses use \(\alpha = 0.05\). This means that if \(H_0\) were in fact true, we would expect to make the wrong decision only 5% of the time.
Case 1: \(\mbox{p-value} \ge \alpha\):
If the p-value is \(\alpha\) or greater, we say the results are not statistically significant and we fail to reject \(H_0\).
Importantly, we never “accept” the null hypothesis – we performed the analysis assuming that \(H_0\) was true to begin with and assessed the probability of seeing our observed data or more extreme under this assumption.
Case 2: \(\mbox{p-value} < \alpha\)
If the p-value is less than \(\alpha\), we say the results are statistically significant. In this case, we would make the decision to reject the null hypothesis.
Similarly, we never “accept” the alternative hypothesis.
“The p-value is the probability of observing data at least as favorable to the alternative hypothesis as our current data set, if the null hypothesis were true. We typically use a summary statistic of the data, in this section the sample proportion, to help compute the p-value and evaluate the hypotheses.” (Open Intro Stats, pg. 194)
“A p-value of 0.05 means the null hypothesis has a probability of only 5% of* being true.
“A p-value of 0.05 means there is a 95% chance or greater that the null hypothesis is incorrect”
p-values do not provide information on the probability that the null hypothesis is true given our observed data.
Again, a p-value is calculated assuming that \(H_0\) is true. It cannot be used to tell us how likely that assumption is correct. When we fail to reject the null hypothesis, we are stating that there is insufficient evidence to assert that it is false. This could be because…
… \(H_0\) actually is true!
… \(H_0\) is false, but we got unlucky and happened to get a sample that didn’t give us enough reason to say that \(H_0\) was false
Even more bad news, hypothesis testing does NOT give us the tools to determine which one of the two scenarios occurred.
Suppose we test a certain null hypothesis, which can be either true or false (we never know for sure!). We make one of two decisions given our data: either reject or fail to reject \(H_0\).
We have the following four scenarios:
Decision | \(H_0\) is true | \(H_0\) is false |
---|---|---|
Fail to reject \(H_0\) | Correct decision | Type II Error |
Reject \(H_0\) | Type I Error | Correct decision |
It is important to weigh the consequences of making each type of error.
In fact, \(\alpha\) is precisely the probability of making a Type I error. We will talk about this (and the associated probability of making a Type II error) in future lectures.
We’ll continue to work with the sample of Zoom screen-time data we obtained. To make things easier with the infer
functions, we’ll create a tibble with time
as a single variable.
zoom <- tibble(
time = c(299, 192, 196, 218, 194, 250, 183, 218, 207,
209, 191, 189, 244, 233, 208, 216, 178, 209,
201, 173, 186, 209, 188, 231, 195, 200, 190,
199, 226, 238))
zoom
## # A tibble: 30 × 1
## time
## <dbl>
## 1 299
## 2 192
## 3 196
## 4 218
## 5 194
## 6 250
## 7 183
## 8 218
## 9 207
## 10 209
## # … with 20 more rows
To obtain reproducible results, set the seed for the random number generation.
set.seed(1421)
Recall our hypothesis testing framework:
Start with two hypotheses about the population: the null hypothesis and the alternative hypothesis.
Choose a (representative) sample, collect data, and analyze the data.
Figure out how likely it is to see data like what we observed, IF the null hypothesis were in fact true.
If our data would have been extremely unlikely if the null claim were true, then we reject it and deem the alternative claim worthy of further study. Otherwise, we cannot reject the null claim.
We’ve already done items 1 and 2, where
\[H_0: \mu = 200\] \[H_1: \mu \neq 200\] For this study, let \(\alpha = 0.05\).
To tackle items 3 and 4, we’ll use a simulation-based approach with functions from infer
.
Recall that there is variability in the sampling distribution of the sample mean. We need to account for this in our statistical study. Just as we did for confidence intervals, we’ll use a bootstrap procedure here.
specify()
the variable of interest
set the null hypothesis with hypothesize()
generate()
the bootstrap samples
calculate()
the statistic of interest
null_dist <- zoom %>%
specify(response = time) %>%
hypothesize(null = "point", mu = 200) %>%
generate(reps = 100, type = "bootstrap") %>%
calculate(stat = "mean")
visualise(null_dist) +
labs(x = "Sample means", y = "Count", title = "Simulated null distribution")
What do you notice?
Next, we calculate the probability of getting data like ours, or more extreme,if \(H_0\) were in fact actually true.
Our observed sample mean is 209 minutes.
x_bar <- zoom %>%
summarise(mean_time = mean(time))
x_bar
## # A tibble: 1 × 1
## mean_time
## <dbl>
## 1 209
visualise(null_dist) +
shade_p_value(obs_stat = x_bar, direction = "two-sided") +
labs(x = "Sample mean", y = "Count")
In the context of this simulation-based approach, the p-value is the proportion of observations shaded light-red. To compute this, infer
provides a convenient function – get_p_value()
.
null_dist %>%
get_p_value(obs_stat = x_bar, direction = "two-sided")
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0.06
Given the calculated p-value and the specified \(\alpha\), what conclusion do you make?
On a given day in 2018, twenty one-bedroom apartments were randomly selected on Craigslist Manhattan from apartments listed as “by owner”. The data are in the manhattan
data frame. We will use this sample to conduct inference on the typical rent of one-bedroom apartments in Manhattan.
Suppose you are interested in whether the mean rent of one-bedroom apartments in Manhattan is actually less than $3000. Choose the correct null and alternative hypotheses.
Let’s use simulation-based methods to conduct the hypothesis test specified in Exercise 1. We’ll start by generating the null distribution.
Fill in the code and uncomment the lines below to generate then visualize null distribution.
set.seed(101321)
#null_dist_2 <- manhattan %>%
#specify(response = ______) %>%
#hypothesize(null = ______, mu = ______) %>%
#generate(reps = 100, type = "bootstrap") %>%
#calculate(stat = _____)
null_dist_2 <- manhattan %>%
specify(response = rent) %>%
hypothesize(null = "point", mu = 3000) %>%
generate(reps = 100, type = "bootstrap") %>%
calculate(stat = "mean")
visualize(null_dist_2)
Fill in the code and uncomment the lines below to calculate the p-value using the null distribution from Exercise 2.
mean_rent <- manhattan %>%
summarise(mean_rent = mean(rent)) %>%
pull()
#null_dist_2 %>%
# get_p_value(obs_stat = ___ , direction = "____")
null_dist_2 %>%
get_p_value(obs_stat = mean_rent , direction = "less")
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0.02
Fill in the direction in the code below and uncomment to visualize the shaded area used to calculate the p-value.
#visualize(null_dist_2) +
# shade_p_value(obs_stat = mean_rent, direction = "______")
visualize(null_dist_2) +
shade_p_value(obs_stat = mean_rent, direction = "less")
Let’s think about what’s happening when we run get_p_value
. Fill in the code below to calculate the p-value “manually” using some of the dplyr
functions we’ve learned.
#null_dist2 %>%
# filter(_____) %>%
# summarise(p_value = ______)
null_dist_2 %>%
filter(stat < 2625.8) %>%
summarise(p_value = n()/100)
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0.02
Use the p-value to make your conclusion using a significance level of 0.05. Remember, the conclusion has 3 components
Suppose instead you wanted to test the claim that the mean price of rent is not equal to $3000. Which of the following would change? Select all that apply.
Let’s test the claim in Exercise 5. Conduct the hypothesis test, then state your conclusion in the context of the data.
## add code
null_dist_2 %>%
get_p_value(obs_stat = mean_rent , direction = "two-sided")
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0.04
Create a new variable over2500
that indicates whether or not the rent is greater than $2500.
# add code
manhattan <- manhattan %>%
mutate(over2500 = ifelse(rent > 2500, "greater", "less"))
Suppose you are interested in testing whether a majority of one-bedroom apartments in Manhattan have rent greater than $2500.
State the null and alternative hypotheses.
Fill in the code to generate the null distribution.
#null_dist_3 <- ____ %>%
# specify(response = ____, success = "_____") %>%
# hypothesize(null = "point", p = ____) %>%
# generate(reps = 100, type = "simulate") %>%
# calculate(stat = "prop")
null_dist_3 <- manhattan %>%
specify(response = over2500, success = "greater") %>%
hypothesize(null = "point", p = 0.5) %>%
generate(reps = 100, type = "draw") %>%
calculate(stat = "prop")
# add code
p_hat <- manhattan %>%
count(over2500) %>%
mutate(probability = n / sum(n)) %>%
filter(over2500 == "greater") %>%
select(probability)
visualize(null_dist_3) +
shade_p_value(obs_stat = p_hat, direction = "greater") +
labs(x = "Sample Proportion", y = "Count")
# add code
null_dist_3 %>%
get_p_value(obs_stat = p_hat, direction = "greater")
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0.93