pivot_wider() and kable()library(tidyverse)
library(knitr)
sta199 <- read_csv("sta199-fa21-year-major.csv")
For this Application Exercise, we will look at the year in school and majors for students taking STA 199 in Fall 2021. The data set includes the following variables:
section: STA 199 sectionyear: Year in schoolmajor_category: Major / academic interest.
Let’s take a look at the majors. Note that we have categorized majors so that each student can only be in one major category.
sta199 %>%
distinct(major_category) %>%
kable
| major_category |
|---|
| other |
| pubpol only |
| stats only |
| compsci only |
| undecided |
| stat + other major |
| econ only |
sta199 %>%
count(major_category) %>%
mutate(prop = n / sum(n)) %>%
kable
| major_category | n | prop |
|---|---|---|
| compsci only | 40 | 0.1619433 |
| econ only | 15 | 0.0607287 |
| other | 98 | 0.3967611 |
| pubpol only | 38 | 0.1538462 |
| stat + other major | 36 | 0.1457490 |
| stats only | 10 | 0.0404858 |
| undecided | 10 | 0.0404858 |
sta199 %>%
count(major_category) %>%
mutate(prop = n / sum(n)) %>%
filter(major_category == "pubpol only")
## # A tibble: 1 × 3
## major_category n prop
## <chr> <int> <dbl>
## 1 pubpol only 38 0.154
sta199 %>%
mutate(stats_any = ifelse(major_category == "stats only" | major_category == "stat + other major", 1, 0)) %>%
summarize(mean_stats = mean(stats_any))
## # A tibble: 1 × 1
## mean_stats
## <dbl>
## 1 0.186
sta199 %>%
mutate(not_pub_pol = ifelse(major_category == "pubpol only", 0, 1)) %>%
summarize(mean_not_pub_pol = mean(not_pub_pol))
## # A tibble: 1 × 1
## mean_not_pub_pol
## <dbl>
## 1 0.846
Now let’s make a table looking at the relationship between year and major.
sta199 %>%
count(year, major_category)
## # A tibble: 23 × 3
## year major_category n
## <chr> <chr> <int>
## 1 First-year compsci only 8
## 2 First-year econ only 6
## 3 First-year other 39
## 4 First-year pubpol only 22
## 5 First-year stat + other major 26
## 6 First-year stats only 7
## 7 First-year undecided 5
## 8 Junior compsci only 7
## 9 Junior econ only 3
## 10 Junior other 12
## # … with 13 more rows
We’ll reformat the data into a contingency table, a table frequently used to study the association between two categorical variables. In this contingency table, each row will represent a year, each column will represent a major, and each cell is the number of students have a particular combination of year and major.
To make the contingency table, we will use a new function in dplyr called pivot_wider(). It will take the data frame produced by count() that is current in a “long” format and reshape it to be in a “wide” format.
We will also use the kable() function in the knitr package to neatly format our new table.
sta199 %>%
count(year, major_category) %>%
pivot_wider(id_cols = c(year, major_category),#how we identify unique obs
names_from = major_category, #how we will name the columns
values_from = n, #values used for each cell
values_fill = 0) %>% #how to fill cells with 0 observations
kable() # neatly display the results
| year | compsci only | econ only | other | pubpol only | stat + other major | stats only | undecided |
|---|---|---|---|---|---|---|---|
| First-year | 8 | 6 | 39 | 22 | 26 | 7 | 5 |
| Junior | 7 | 3 | 12 | 4 | 1 | 0 | 0 |
| Senior | 2 | 0 | 5 | 1 | 1 | 0 | 0 |
| Sophomore | 23 | 6 | 42 | 11 | 8 | 3 | 5 |
For each of the following exercises:
Calculate the probability using the contingency table above.
Then write code to check your answer using the sta199 data frame and dplyr functions.
sta199 %>%
count(year) %>%
mutate(prop = n / sum(n)) %>%
filter(year == "Sophomore")
## # A tibble: 1 × 3
## year n prop
## <chr> <int> <dbl>
## 1 Sophomore 98 0.397
sta199 %>%
count(major_category) %>%
mutate(prop = n / sum(n)) %>%
filter(major_category == "compsci only")
## # A tibble: 1 × 3
## major_category n prop
## <chr> <int> <dbl>
## 1 compsci only 40 0.162
sta199 %>%
mutate(comp_soph = ifelse(year == "Sophomore" | major_category == "compsci only", 1, 0)) %>%
summarize(mean_stats = mean(comp_soph))
## # A tibble: 1 × 1
## mean_stats
## <dbl>
## 1 0.466
sta199 %>%
mutate(comp_soph = ifelse(year == "Sophomore" & major_category == "compsci only", 1, 0)) %>%
summarize(mean_stats = mean(comp_soph))
## # A tibble: 1 × 1
## mean_stats
## <dbl>
## 1 0.0931
pivot_wider and pivot_longer