Main ideas

Coming Up

Lecture Notes and Exercises

We will use the packages below:

## Warning in system("timedatectl", intern = TRUE): running command 'timedatectl'
## had status 1

Bayes’ Rule

sta199 <- read_csv("sta199-fa21-year-major.csv")
b_given_a <- sta199 %>%
  filter(section == "Section 02") %>%
  count(year) %>%
  mutate(prob = n / sum(n)) %>%
  filter(year == "First-year")

a <- sta199 %>%
  count(section) %>%
  mutate(prob = n / sum(n)) %>%
  filter(section == "Section 02")
b <- sta199 %>%
  count(year) %>%
  mutate(prob = n / sum(n)) %>%
  filter(year == "First-year")

## [1] 0.2920354
0.4459459 * 0.2995951/ 0.4574899
## [1] 0.2920353
sta199 %>%
  filter(year == "First-year") %>%
  count(section) %>%
  mutate(prob = n / sum(n)) %>%
  filter(section == "Section 02")
## # A tibble: 1 × 3
##   section        n  prob
##   <chr>      <int> <dbl>
## 1 Section 02    33 0.292

We can also use the hypothetical 10000 here.

Sec2 Other Total
Total 10,000

Know prob(Fy)| prob(sect2) Want prob(sect2)|prob(fy)

The statistical process

Statistics is a process that converts data into useful information, whereby practitioners

  1. form a question of interest,

  2. collect and summarize data,

  3. and interpret the results.

The population of interest

The population is the group we’d like to learn something about. For example:

The research question of interest is what we want to answer - often relating one or more numerical quantities or summary statistics.

If we had data from every unit in the population, we could just calculate what we wanted and be done!

Sampling from the population

Unfortunately, we (usually) have to settle with a sample from the population.

Ideally, the sample is representative, allowing us to make conclusions that are .generalizable to the broader population of interest.

In order to make a formal statistical statement about the broader population of interest when all we have is a sample, we need to use the tools of probability and statistical inference.

Big picture

We’ll discuss a few population characteristics we’ll be interested in.

Explanatory and response variables

When we suspect one variable might causally affect another, we label the first variable the explanatory variable and the second the response variable Sometimes, you may also hear the term “independent variable” used instead of explanatory variable and “dependent variable” used instead of response variable.Whether or not we can actually make a causal connection will depend on the type of statistical study (more on this shortly).

\[\mbox{Explanatory Variable} \longrightarrow \mbox{Response Variable}\]

Do larger homes in good locations lead to higher home selling prices? What are the explanatory and response variables?

Population, parameter; sample, statistic

Population: a group of individuals or objects we are interested in studying

Parameter: a numerical quantity derived from the population (almost always unknown)

If we had data from every unit in the population, we could just calculate population parameters and be done! Unfortunately, we usually cannot do this.

Sample: a subset of our population of interest

Statistic: a numerical quantity derived from a sample: - Associate the statistic with the sample

Naturally, it makes sense to use the sample mean (and other quantities derived from the sample) to make generalizations about the population mean.

Statistical inference

Statistical inference is the process of using sample data to make conclusions about the underlying population the sample came from.

In the coming lectures we’ll discuss each of these inference approaches.

Before we get into this, let’s discuss ways samples can be obtained and what type of conclusions we’ll be be able to make and not make as a result of our statistical process.


Sampling strategies

  • In our discussions on probability, we considered randomly selecting individuals from studies, where each individual was equally likely to be selected. This form of random sampling is known as simple random sampling.

  • Stratified sampling divides the population into strata such that each strata is homogeneous. Then a simple random sample is applied within each stratum.

    • Can you think of a reason why we would employ this technique?

Sometimes this is done to make sure you get a sufficient sample from each demographic group and then you weight to the approximate percentage in the population.

  • Cluster sampling first partitions the population into clusters, where each cluster is representative of the population. A fixed number of clusters is selected and all observations within the cluster are included in the sample.

  • Multistage sampling is similar to cluster sampling, but rather than keep all observations in each cluster, only a random sample of observations is kept.


Suppose we are interested in estimating the malaria rate in a densely tropical portion of rural Indonesia. We learn that there are 30 villages in that part of the Indonesian jungle, each more or less similar to the next. Our goal is to test 150 individuals for malaria. What are the costs and benefits to using the four aforementioned sampling techniques?

  • Simple random sample: expensive, may not get good representation from all 30 villages

  • Stratified sample: not clear how to build strata on an individual basis. If strata are the villages, then some villages will be left out.

  • Cluster sample / multistage: these are the best options here.

Sample bias

  • The four sampling strategies help reduce bias in our sample. A biased sample can lead to erroneous conclusions.

  • Bias can still appear if the non-response rate is very high.

    • Is our sample representative of the population or is it representative of the population that “responded” to the survey?

Statistical studies and conclusions

Observational studies and experiments

  • Observational

    • Collect data in a way that does not interfere with how the data arise (“observe”)
    • Only establish an association
    • Data often cheaper and easier to collect
  • Experimental

    • Randomly assign subjects to treatments
    • Establish causal connections
    • Often more expensive
    • Sometimes it is impossible or unethical to design an experiment

Random sampling vs. random assignment

What do you think Pfizer did in their trials for the COVID-19 vaccine development?

It was an experimental design.

Confounding variables

A confounding variable is an an extraneous variable that affects both the explanatory and the response variable, and makes it seem like there is a relationship between them.

Identify the confounding variable in each of the following statements:

  1. As the amount of ice cream sales increases, the number of shark attacks also increases.

This both increase in the summer, when people go to the beach.

  1. The higher the number of firefighters at a fire is, the greater the amount of damage caused by that fire.

More firefighters are sent to more severe fires.

  1. Taller children are better at both reading and math compared to shorter children.

Taller children are also usually older and in more advanced grades.

One method to justify making causal conclusions from observational studies is to exhaust the search for confounding variables, there is no guarantee that all confounding variables can be examined or measured. Therefore, it is best to only discuss associations between variables from observational studies.


Polls and statistical terminology

Go to the Monmouth University Polling Institute website and select a poll of interest. Briefly read the poll results and methodology section at the end. Try and identify the following:

Here’s one example from earlier this year (use a different one)

  • Population of interest: US registered voters.

  • Parameter of interest: Proportion approving of President Biden’s job performance.

  • Sample: US registered voters.

  • Sample size: 802

  • Sample statistic: Proportion approving of President Biden’s job performance.

  • Sample statistic’s value: 51% or 0.51.

Link to poll:

Discuss your survey here

[Answers will vary.]

Confounding variables

A group of researchers decide to study the causes of heart disease by carrying out an observational study. The researchers find that the people in their study who ate lots of red meat also developed heart disease. They believe they have found a link (or ‘correlation’) between eating red meat and developing heart disease, and they (or those reading their research) might be tempted to conclude that eating lots of red meat is a cause of heart disease. However, before making a conclusion like this, the researchers must think about confounding factors (variables).

List three confounding factors that could be at play here.

There are many you could list, but three are:

  1. Amount of exercise

  2. Age

  3. Gender

Given that the above study was observational, what type of conclusion can be made?

We only can see correlations, but cannot make claims about causality.

