Learning goals

Introduction

library(tidyverse)
library(knitr)
sta199 <- read_csv("sta199-fa21-year-major.csv")

For this Application Exercise, we will look at the year in school and majors for students taking STA 199 in Fall 2021. The data set includes the following variables:

Definitions

Exercise 1

Let’s take a look at the majors. Note that we have categorized majors so that each student can only be in one major category.

sta199 %>% 
  distinct(major_category) %>% 
  kable
major_category
other
pubpol only
stats only
compsci only
undecided
stat + other major
econ only
sta199 %>% 
  count(major_category) %>%
  mutate(prop = n / sum(n)) %>%
  kable
major_category n prop
compsci only 40 0.1619433
econ only 15 0.0607287
other 98 0.3967611
pubpol only 38 0.1538462
stat + other major 36 0.1457490
stats only 10 0.0404858
undecided 10 0.0404858
sta199 %>% 
  count(major_category) %>%
  mutate(prop = n / sum(n)) %>%
  filter(major_category == "pubpol only")
## # A tibble: 1 × 3
##   major_category     n  prop
##   <chr>          <int> <dbl>
## 1 pubpol only       38 0.154
sta199 %>% 
  mutate(stats_any = ifelse(major_category == "stats only" | major_category == "stat + other major", 1, 0)) %>%
  summarize(mean_stats = mean(stats_any))
## # A tibble: 1 × 1
##   mean_stats
##        <dbl>
## 1      0.186
sta199 %>% 
  mutate(not_pub_pol = ifelse(major_category == "pubpol only", 0, 1)) %>%
  summarize(mean_not_pub_pol = mean(not_pub_pol))
## # A tibble: 1 × 1
##   mean_not_pub_pol
##              <dbl>
## 1            0.846

Exercise 2

Now let’s make a table looking at the relationship between year and major.

sta199 %>%
  count(year, major_category)
## # A tibble: 23 × 3
##    year       major_category         n
##    <chr>      <chr>              <int>
##  1 First-year compsci only           8
##  2 First-year econ only              6
##  3 First-year other                 39
##  4 First-year pubpol only           22
##  5 First-year stat + other major    26
##  6 First-year stats only             7
##  7 First-year undecided              5
##  8 Junior     compsci only           7
##  9 Junior     econ only              3
## 10 Junior     other                 12
## # … with 13 more rows

We’ll reformat the data into a contingency table, a table frequently used to study the association between two categorical variables. In this contingency table, each row will represent a year, each column will represent a major, and each cell is the number of students have a particular combination of year and major.

To make the contingency table, we will use a new function in dplyr called pivot_wider(). It will take the data frame produced by count() that is current in a “long” format and reshape it to be in a “wide” format.

We will also use the kable() function in the knitr package to neatly format our new table.

sta199 %>% 
  count(year, major_category) %>%
  pivot_wider(id_cols = c(year, major_category),#how we identify unique obs
              names_from = major_category, #how we will name the columns
              values_from = n, #values used for each cell
              values_fill = 0) %>% #how to fill cells with 0 observations 
  kable() # neatly display the results
year compsci only econ only other pubpol only stat + other major stats only undecided
First-year 8 6 39 22 26 7 5
Junior 7 3 12 4 1 0 0
Senior 2 0 5 1 1 0 0
Sophomore 23 6 42 11 8 3 5

Exercise 3

For each of the following exercises:

  1. Calculate the probability using the contingency table above.

  2. Then write code to check your answer using the sta199 data frame and dplyr functions.

sta199 %>% 
  count(year) %>%
  mutate(prop = n / sum(n)) %>%
  filter(year == "Sophomore")
## # A tibble: 1 × 3
##   year          n  prop
##   <chr>     <int> <dbl>
## 1 Sophomore    98 0.397
sta199 %>% 
  count(major_category) %>%
  mutate(prop = n / sum(n)) %>%
  filter(major_category == "compsci only")
## # A tibble: 1 × 3
##   major_category     n  prop
##   <chr>          <int> <dbl>
## 1 compsci only      40 0.162
sta199 %>% 
  mutate(comp_soph = ifelse(year == "Sophomore" | major_category == "compsci only", 1, 0)) %>%
  summarize(mean_stats = mean(comp_soph))
## # A tibble: 1 × 1
##   mean_stats
##        <dbl>
## 1      0.466
sta199 %>% 
  mutate(comp_soph = ifelse(year == "Sophomore" & major_category == "compsci only", 1, 0)) %>%
  summarize(mean_stats = mean(comp_soph))
## # A tibble: 1 × 1
##   mean_stats
##        <dbl>
## 1     0.0931

Resources