library(tidyverse)
library(tidymodels)
Each sample from the population yields a slightly different sample statistic (sample mean, sample proportion, etc.)
The variability of these sample statistics is measured by the standard error
Previously we quantified this value via simulation
Today we’ll discuss some of the theory underlying sampling distributions, particularly as they relate to sample means.
Statistical inference is the act of generalizing from a sample in order to make conclusions regarding a population. As part of this process, we quantify the degree of certainty we have.
We are interested in population parameters, which we do not observe. Instead, we must calculate statistics from our sample in order to learn about them.
Suppose we’re interested in the resting heart rate of students at Duke, and are able to do the following:
Take a random sample of size \(n\) from this population, and calculate the mean resting heart rate in this sample, \(\bar{X}_1\)
Put the sample back, take a second random sample of size \(n\), and calculate the mean resting heart rate from this new sample, \(\bar{X}_2\)
Put the sample back, take a third random sample of size \(n\), and calculate the mean resting heart rate from this sample, too…and so on.
After repeating this many times, we have a dataset that has the sample averages from the population: \(\bar{X}_1\), \(\bar{X}_2\), \(\cdots\), \(\bar{X}_K\) (assuming we took \(K\) total samples).
Question: Can we say anything about the distribution of these sample means? (Keep in mind, we don’t know what the underlying distribution of mean resting heart rate looks like in Duke students!)
As it turns out, we can…
For a population with a well-defined mean \(\mu\) and standard deviation \(\sigma\), these three properties hold for the distribution of sample average \(\bar{X}\),assuming certain conditions hold:
The mean of the sampling distribution is identical to the population mean \(\mu\),
The standard deviation of the distribution of the sample averages is \(\sigma/\sqrt{n}\), or the standard error (SE) of the mean, and
For \(n\) large enough (in the limit, as \(n \to \infty\)), the shape of the sampling distribution of means is approximately normal (Gaussian).
The normal distribution is unimodal and symmetric and is described by its density function:
If a random variable \(X\) follows the normal distribution, then \[f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left\{ -\frac{1}{2}\frac{(x - \mu)^2}{\sigma^2} \right\}\] where \(\mu\) is the mean and \(\sigma^2\) is the variance.
We often write \(N(\mu, \sigma^2)\) to describe this distribution.
We will talk about probability densities and using them to define probabilities later, but for now, just know that the normal distribution is the familiar “bell curve”:
The central limit theorem tells us that sample averages are normally distributed, if we have enough data. This is true even if our original variables are not normally distributed.
What are the conditions we need for the CLT to hold?
Independence: The sampled observations must be independent. This is difficult to check, but the following are useful guidelines:
Sample size / distribution:
Suppose the bone density for 65-year-old women is normally distributed with mean \(809 mg/cm^3\) and standard deviation of \(140 mg/cm^3\).
Let \(x\) be the bone density of 65-year-old women. We can write this distribution of \(x\) in mathematical notation as
\[x \sim N(809, 140)\]
What bone densities correspond to \(Q_1\) (25th percentile), \(Q_2\) (50th percentile), and \(Q_3\) (the 75th percentile) of this distribution? Use the qnorm()
function to calculate these values.
qnorm(0.25, 809, 140)
## [1] 714.5714
qnorm(0.5, 809, 140)
## [1] 809
qnorm(0.75, 809, 140)
## [1] 903.4286
The densities of three woods are below:
Plywood: 540 mg/cubic centimeter
Pine: 600 mg/cubic centimeter
Mahogany: 710 mg/cubic centimeter
What is the probability that a randomly selected 65-year-old woman has bones less dense than Pine?
Would you be surprised if a randomly selected 65-year-old woman had bone density less than Mahogany? What if she had bone density less than Plywood? Use the respective probabilities to support your response.
pnorm(600, 809, 140)
## [1] 0.06773729
pnorm(710, 809, 140)
## [1] 0.2397389
pnorm(540, 809, 140)
## [1] 0.02733885
Suppose you want to analyze the mean bone density for a group of 10 randomly selected 65-year-old women.
Are the conditions met use the Central Limit Theorem to define the distribution of \(\bar{x}\), the mean density of 10 randomly selected 65-year-old women?
What is the shape, center, and spread of the distribution of \(\bar{x}\), the mean bone density for a group of 10 randomly selected 65-year-old women?
Write the distribution of \(\bar{x}\) using mathematical notation.
\[\bar{x} \sim N(809, {140\over \sqrt(10)})\]
pnorm(600, 809, 140/sqrt(10))
## [1] 1.174428e-06
pnorm(710, 809, 140/sqrt(10))
## [1] 0.01266992
pnorm(540, 809, 140/sqrt(10))
## [1] 6.157391e-10
Please explain how your answers differ in Exercises 2 and 4.
The sampling distribution is going to be much narrower, so the probability of getting something in a tail is much less.