library(tidyverse)
library(tidymodels)
yawn <- read_csv("yawn.csv")
Use and understand simulation-based methods to …
Discussion of bootstrap, draw, and permute here.
Bootstrap: with replacement, for confidence intervals or for a single mean.
Draw: only for hypothesis testing for one proportion.
Permute: “shuffles” the data without replacement- see explanation here. You might use this for a difference in means or to test for independence (diff in props).
If you want to double check, I would recommend checking the vignette here. This is not an exhaustive list, just some common ones we use in this course.
A large university knows that about 70% of the full-time students are employed at least 5 hours per week. The members of the Statistics Department wonder if the same proportion of their students work at least 5 hours per week. They randomly sample 25 majors and find that 15 of the students work 5 or more hours each week. Use the code below to create a data frame of the results.
stats_work <- tibble(work_hours = c(rep("At least 5", 15),
rep("Less than 5", 10)))
State the null and alternative hypotheses.
Fill in the code to generate the null distribution.
set.seed(101821)
#null_dist <- ____ %>%
# specify(response = ____, success = "_____") %>%
# hypothesize(null = "point", p = ____) %>%
# generate(reps = 100, type = "draw") %>%
# calculate(stat = "prop")
null_dist <- stats_work %>%
specify(response = work_hours, success = "At least 5") %>%
hypothesize(null = "point", p = 0.7) %>%
generate(reps = 1000, type = "draw") %>%
calculate(stat = "prop")
# add code
prop_over5 <- stats_work %>%
count(work_hours) %>%
mutate(p = n / sum(n)) %>%
filter(work_hours == "At least 5") %>%
pull()
visualize(null_dist) +
shade_p_value(obs_stat = prop_over5, direction = "two-sided")
Note, the odd asymmetry we seem to be getting around 0.7 seems to be a result of the seed we are using.
# add code
null_dist %>%
get_p_value(obs_stat = prop_over5, direction = "two-sided")
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0.354
First let’s, watch the experiment from Mythbusters here [https://www.youtube.com/watch?v=mrr_UjNLbhE].
Let \(t\) be the treatment group who saw a person yawn, \(c\) be the control group who did not see anyone yawn,a nd \(p\) be the proportion of people who yawned.
We want to use simulation-based inference to assess whether or not yawning and seeing someone yawn are independent.
State the null and alternative hypotheses in words:
Select the appropriate null and alternative hypotheses written in mathematical notation:
Let’s use the data from the Mythbusters episode and simulation-based inference in R to test this claim. Based on their experiment, do yawning and seeing someone yawn appear to be dependent?
Evaluate this question using a simulation based approach. We will use the same null and alternative hypotheses as before. The data from Mythbusters is available in the yawn
data frame.
set.seed(101821)
#null_dist <- yawn %>%
# specify(response = ____, explanatory = _____, success = "yawn") %>%
# hypothesize(null = "______") %>%
# generate(100, type = "permute") %>%
# calculate(stat = "______",
# order = c("trmt", "ctrl"))
null_dist <- yawn %>%
specify(response = result, explanatory = group, success = "yawn") %>%
hypothesize(null = "independence") %>%
generate(100, type = "permute") %>%
calculate(stat = "diff in props",
order = c("trmt", "ctrl"))
# add code
visualize(null_dist) +
shade_p_value(obs_stat = 0.0441, direction = "greater")
null_dist %>%
filter(stat >= 0.0441) %>%
summarise(p_value = n()/nrow(null_dist))
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0.52
get_p_value(null_dist, 0.0441, "greater")
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0.52
# add code
Construct a 90% confidence interval for the difference in proportion of yawners between those who see someone else yawn and those who don’t.
#boot_dist <- yawn %>%
#specify(response = ______, explanatory = ______, success = "yawn") %>%
# generate(100, type = "bootstrap") %>%
# calculate(stat = "______",
# order = c("trmt", "ctrl"))
boot_dist <- yawn %>%
specify(response = result, explanatory = group, success = "yawn") %>%
generate(100, type = "bootstrap") %>%
calculate(stat = "diff in props",
order = c("trmt", "ctrl"))
# calculate the lower and upper bounds for the 90% ci
#boot_dist %>%
# summarise(lower = quantile(stat, ___),
# upper = quantile(stat, ___))
boot_dist %>%
summarise(lower = quantile(stat, 0.05),
upper = quantile(stat, 0.95))
## # A tibble: 1 × 2
## lower upper
## <dbl> <dbl>
## 1 -0.148 0.257
Interpret the interval in the context of the data.
Suppose you use the confidence interval to evaluate the hypotheses in Exercise 2. Is the conclusion drawn from the confidence interval consistent with the conclusion from the the hypothesis test?
Do you buy the conclusions from this experiment? Why or why not?
This question was adapted from Introduction to Modern Statistics.↩︎