pivot_wider()
and kable()
library(tidyverse)
library(knitr)
sta199 <- read_csv("sta199-fa21-year-major.csv")
For this Application Exercise, we will look at the year in school and majors for students taking STA 199 in Fall 2021. The data set includes the following variables:
section
: STA 199 sectionyear
: Year in schoolmajor_category
: Major / academic interest.
Let’s take a look at the majors. Note that we have categorized majors so that each student can only be in one major category.
sta199 %>%
distinct(major_category) %>%
kable
major_category |
---|
other |
pubpol only |
stats only |
compsci only |
undecided |
stat + other major |
econ only |
sta199 %>%
count(major_category) %>%
mutate(prop = n / sum(n)) %>%
kable
major_category | n | prop |
---|---|---|
compsci only | 40 | 0.1619433 |
econ only | 15 | 0.0607287 |
other | 98 | 0.3967611 |
pubpol only | 38 | 0.1538462 |
stat + other major | 36 | 0.1457490 |
stats only | 10 | 0.0404858 |
undecided | 10 | 0.0404858 |
sta199 %>%
count(major_category) %>%
mutate(prop = n / sum(n)) %>%
filter(major_category == "pubpol only")
## # A tibble: 1 × 3
## major_category n prop
## <chr> <int> <dbl>
## 1 pubpol only 38 0.154
sta199 %>%
mutate(stats_any = ifelse(major_category == "stats only" | major_category == "stat + other major", 1, 0)) %>%
summarize(mean_stats = mean(stats_any))
## # A tibble: 1 × 1
## mean_stats
## <dbl>
## 1 0.186
sta199 %>%
mutate(not_pub_pol = ifelse(major_category == "pubpol only", 0, 1)) %>%
summarize(mean_not_pub_pol = mean(not_pub_pol))
## # A tibble: 1 × 1
## mean_not_pub_pol
## <dbl>
## 1 0.846
Now let’s make a table looking at the relationship between year and major.
sta199 %>%
count(year, major_category)
## # A tibble: 23 × 3
## year major_category n
## <chr> <chr> <int>
## 1 First-year compsci only 8
## 2 First-year econ only 6
## 3 First-year other 39
## 4 First-year pubpol only 22
## 5 First-year stat + other major 26
## 6 First-year stats only 7
## 7 First-year undecided 5
## 8 Junior compsci only 7
## 9 Junior econ only 3
## 10 Junior other 12
## # … with 13 more rows
We’ll reformat the data into a contingency table, a table frequently used to study the association between two categorical variables. In this contingency table, each row will represent a year, each column will represent a major, and each cell is the number of students have a particular combination of year and major.
To make the contingency table, we will use a new function in dplyr
called pivot_wider()
. It will take the data frame produced by count()
that is current in a “long” format and reshape it to be in a “wide” format.
We will also use the kable()
function in the knitr
package to neatly format our new table.
sta199 %>%
count(year, major_category) %>%
pivot_wider(id_cols = c(year, major_category),#how we identify unique obs
names_from = major_category, #how we will name the columns
values_from = n, #values used for each cell
values_fill = 0) %>% #how to fill cells with 0 observations
kable() # neatly display the results
year | compsci only | econ only | other | pubpol only | stat + other major | stats only | undecided |
---|---|---|---|---|---|---|---|
First-year | 8 | 6 | 39 | 22 | 26 | 7 | 5 |
Junior | 7 | 3 | 12 | 4 | 1 | 0 | 0 |
Senior | 2 | 0 | 5 | 1 | 1 | 0 | 0 |
Sophomore | 23 | 6 | 42 | 11 | 8 | 3 | 5 |
For each of the following exercises:
Calculate the probability using the contingency table above.
Then write code to check your answer using the sta199
data frame and dplyr
functions.
sta199 %>%
count(year) %>%
mutate(prop = n / sum(n)) %>%
filter(year == "Sophomore")
## # A tibble: 1 × 3
## year n prop
## <chr> <int> <dbl>
## 1 Sophomore 98 0.397
sta199 %>%
count(major_category) %>%
mutate(prop = n / sum(n)) %>%
filter(major_category == "compsci only")
## # A tibble: 1 × 3
## major_category n prop
## <chr> <int> <dbl>
## 1 compsci only 40 0.162
sta199 %>%
mutate(comp_soph = ifelse(year == "Sophomore" | major_category == "compsci only", 1, 0)) %>%
summarize(mean_stats = mean(comp_soph))
## # A tibble: 1 × 1
## mean_stats
## <dbl>
## 1 0.466
sta199 %>%
mutate(comp_soph = ifelse(year == "Sophomore" & major_category == "compsci only", 1, 0)) %>%
summarize(mean_stats = mean(comp_soph))
## # A tibble: 1 × 1
## mean_stats
## <dbl>
## 1 0.0931
pivot_wider
and pivot_longer