Main ideas
Apply the ideas from last class to use the CLT to carry out hypothesis tests.
Learn how to use the
t_test
function.
Apply the ideas from last class to use the CLT to carry out hypothesis tests.
Learn how to use the t_test
function.
library(tidyverse)
library(tidymodels)
library(tidyverse)
ggplot(data = data.frame(x = c(809 - 140*3, 809 + 140*3)), aes(x = x)) +
stat_function(fun = dnorm, args = list(mean = 809, sd = 140),
color = "black") +
stat_function(fun = dnorm, args = list(mean = 809, sd = 140/sqrt(10)),
color = "red",lty = 2) + theme_bw() +
labs(title = "Black solid line = population dist., Red dotted line = sampling dist.")
For a population with a well-defined mean \(\mu\) and standard deviation \(\sigma\), these three properties hold for the distribution of sample average \(\bar{X}\), assuming certain conditions hold:
The mean of the sampling distribution is identical to the population mean \(\mu\),
The standard deviation of the distribution of the sample averages is \(\sigma/\sqrt{n}\), or the standard error (SE) of the mean, and
For \(n\) large enough (in the limit, as \(n \to \infty\)), the shape of the sampling distribution of means is approximately normal (Gaussian).
What are the conditions we need for the CLT to hold?
Independence: The sampled observations must be independent. This is difficult to check, but the following are useful guidelines:
Sample size / distribution:
Assuming the conditions for the CLT hold, \(\bar{X}\) approximately has the following distribution:
\[\mbox{Normal}\left(\mu, \sigma/\sqrt{n}\right)\]
Equivalently, we can define the quantity \(Z\), such that \(Z = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}}\), where \(Z\) has the following distribution: \[\mbox{Normal}\left(0, 1 \right)\]
Assuming the conditions for the CLT hold, \(\hat{p}\) approximately has the following distribution:
\[\mbox{Normal}\left(p, \sqrt{\frac{p(1-p)}{n}}\right)\]
We can standardize this in a similar way and define a quantity \(Z\) that is normally distributed with a mean of 0 and a standard deviation of 1.
Start with two hypotheses about the population: the null hypothesis and the alternative hypothesis.
Choose a (representative) sample, collect data, and analyze the data.
Figure out how likely it is to see data like what we observed, IF the null hypothesis were in fact true.
If our data would have been extremely unlikely if the null claim were true, then we reject it and deem the alternative claim worthy of further study. Otherwise, we cannot reject the null claim.
Suppose we test a certain null hypothesis, which can be either true or false (we never know for sure!). We make one of two decisions given our data: either reject or fail to reject \(H_0\).
We have the following four scenarios:
Decision | \(H_0\) is true | \(H_0\) is false |
---|---|---|
Fail to reject \(H_0\) | Correct decision | Type II Error |
Reject \(H_0\) | Type I Error | Correct decision |
It is important to weigh the consequences of making each type of error.
What changes now that we plan to use a CLT-based approach in doing our testing?
We no longer have to simulate the null distribution. The Central Limit Theorem gives us an approximation for the distribution of our point estimate under the null hypothesis.
Rather than work directly with the sampling distribution of the point estimates, we’ll use standardized versions that we’ll call test statistics.
For tests of \(\mu\):
\[t = \frac{\bar{x} - \mu_0}{s/\sqrt{n}},\]
where \(\mu_0\) is the value of \(\mu\) under the null hypothesis.
For tests of \(p\):
\[z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}},\]
where \(p_0\) is the value of \(p\) under the null hypothesis.
Recall step 3 of our testing framework: Figure out how likely it is to see data like what we observed, IF the null hypothesis were in fact true.
To do this:
Compute the test statistic’s value - all information is obtained from the sample data or value of the parameter under the null hypothesis.
To quantify how likely it is to see this test statistic value given the null hypothesis is true, compute the probability of obtaining a test statistic as extreme or more extreme than what we observed. This probability is calculated from a known distribution.
We will be using the pokemon
dataset, which contains information about 45 randomly selected Pokemon (from all generations). You may load in the dataset with the following code:
pokemon <- read_csv("pokemon.csv")
Let’s start by looking at the distribution of height_m
, the typical height in meters for Pokemon, using a visualization and summary statistics.
Please make a histogram and then find summary statistics.
ggplot(data = pokemon, aes(x = height_m)) +
geom_histogram(binwidth = 0.25, fill = "steelblue", color = "black") +
labs(x = "Height (in meters)",
y = "Distributon of Pokemon heights")
pokemon_sum <- pokemon %>%
summarise(x_bar = mean(height_m),
sd = sd(height_m),
n = n())
pokemon_sum
## # A tibble: 1 × 3
## x_bar sd n
## <dbl> <dbl> <int>
## 1 1.34 1.82 45
In the previous lecture, we were given the mean, \(\mu\), and standard deviation, \(\sigma\), of the population. That is unrealistic in practice (if we knew \(\mu\) and \(\sigma\) we wouldn’t need to do statistical inference!).
Today we will use our sample data and the Central Limit Theorem to draw conclusions about the \(\mu\), the mean height in the population of Pokemon.
What is the point estimate for \(\mu\), i.e., the “best guess” for the mean height of all Pokemon?
What is the point estimate for \(\sigma\), i.e., the “best guess” for the standard deviation of the distribution of Pokemon heights?
Before moving forward, let’s check the conditions required to apply the Central Limit Theorem. Are the following conditions met:
Yes, it is randomly selected and N is greater than 30.
Construct and interpret a 95% confidence interval for the mean height in meters (height_m
) of Pokemon species by using the Central Limit Theorem.
The formula for a t- confidence interval is:
\[\bar{x} \pm t^*_{n-1} * {s/\sqrt{n}} \]
df <- pokemon_sum$n - 1
t_star <- qt(0.975, df)
se <- pokemon_sum$sd / sqrt(pokemon_sum$n)
point_est <- pokemon_sum$x_bar
CI <- point_est + c(-1,1) * t_star * se
round(CI, 2)
## [1] 0.80 1.89
The average height of a human is 1.65 meters. Evaluate whether a randomly selected Pokemon species has a different mean height by using the Central Limit Theorem.
In doing so, state your null and alternative hypotheses, the distribution of your test statistic under the null hypothesis, your p-value, decision, and conclusion in context of the research problem. Please use the t_test
function here.
pokemon %>%
t_test(response = height_m,
mu = 1.65,
alternative = "two-sided",
conf_int = FALSE)
## # A tibble: 1 × 5
## statistic t_df p_value alternative estimate
## <dbl> <dbl> <dbl> <chr> <dbl>
## 1 -1.13 44 0.266 two.sided 1.34
Now evaluate whether a randomly-selected Pokemon species has a lower mean height by using the Central Limit Theorem. In doing so, state your null and alternative hypotheses, the distribution of your test statistic under the null hypothesis, your p-value, decision, and conclusion in context of the research problem.
pokemon %>%
t_test(response = height_m,
mu = 1.65,
alternative = "less",
conf_int = FALSE)
## # A tibble: 1 × 5
## statistic t_df p_value alternative estimate
## <dbl> <dbl> <dbl> <chr> <dbl>
## 1 -1.13 44 0.133 less 1.34
Suppose the true mean height among Pokemon species is 1.4 meters. In your conclusions from Exercises 4 and 5, did you make the correct decision, a Type 1 error, or a Type 2 error? Explain.
This is a Type II Error- we should have rejected but did not.