For this assignment you must have at least three meaningful commits and all of your code chunks must have informative names.
For your first commit, update your author name in the YAML header of the template R Markdown file.
All plots should follow the best visualization practices discussed in lecture, including an informative title, labeled axes, and careful consideration ofaesthetic choices.
All code should follow the tidyverse style guidelines, including not exceeding the 80 character limit.
For every join
function you should explicitly specify the by
argument
Click the link below to create your repo for Homework #02: https://classroom.github.com/a/i8v_Huuh.
Clone the repository, open a new project in RStudio and configure Git.
We will work with the tidyverse
package as usual. You may also want to use viridis.
library(tidyverse)
library(viridis)
The U.S. News rankings are an influential, but controversial metric that influences the college application process.
A brief description of the data sets for this lab and how they are related to each other is provided below.
The natunivs
data set contains all schools in the National Universities category ranked 50 or above in the current (2022) rankings. Data on school rankings comes from Andy Reiter, with several blank values filled in by the professor. Observations are uniquely identified by school.
The variables in this data set are:
school
: The name of the college or university.state
: The state in which the college or university is located.rank_2022
: The school’s rank in the 2022 issue.rank_2021
: The school’s rank in the 2021 issue.natuniv_slac
: A variable identifying the type of school.The slacs
data set contains contains all schools in the National Liberal Arts Colleges category ranked 50 or above in the current (2022) rankings. Data on school rankings comes from Andy Reiter. Observations are uniquely identified by school.
The variables in this data set are:
school
: The name of the college or university.state
: The state in which the college or university is located.rank_2022
: The school’s rank in the 2022 issue.rank_2021
: The school’s rank in the 2021 issue.natuniv_slac
: A variable identifying the type of school.The presvote_pop
data set contains four variables related to the characteristics of a state:
abbrev
: The state’s abbreviation.trump_votes
: The number of votes received by Donald Trump in 2020.biden_votes
: The number of votes received by Joe Biden in 2020.2020_pop
: The state’s population in the 2020 census.The Trump and Biden votes variables come from the CQ Voting and Elections Collection, which was accessed through the Duke Library. The 2020 population data comes from the US Census.
First, join the slacs
data set to the natunivs
data set. The goal is to combine these data sets in such as a way so that the new data set has the same number of rows and unique columns as each of the individual data sets. Call this new data set full_data
.
Next, use a join to add the columns from the presvote_pop
data set to full_data
.
The final full_data data frame should have 107 observations and 8 variables.
We will use full_data
for the remainder of the assignment. (Please note that there are more than 100 observations total due to ties at 50.)
Which states have the most schools in the full_data
data set? Please find the number of schools by state. Then, order these states from greatest to least and return the 5 states with the most schools on the list. Please report these 5 states.
Which states do not have a school in the full_data
data set? Use the presvote_pop
data set and an appropriate join to help answer this question. Return a data set with two variables, state abbreviations and state population, in order from greatest population to least. Show all code and output, and print the state abbrevations and populations. What is the state with the largest population that does not have a school in the full_data
data set?
Please recreate the below plot. Please use a dplyr
command to create the variable noting which presidential candidate won the state in 2020. After recreating the plot, please discuss what patterns you observe.
full_data
data set. To answer this question, first use the code from exercise 2 to create a data set called counts
with the counts of the number of schools by state for the 31 states with at least one school in the full data
data set.Then, use a join to add counts
to the full_data
data set and make a scatter plot with a state’s 2020 population as the x-axis variable and the number of schools as the y-axis variable. Add a line of best fit using the method = lm
option. Please include an informative title and axis labels.
Finally, please describe what you observe. Is there a positive or negative relationship? Do most points fall near the line?
Let’s now focus on North Carolina schools in the full_data
data set. For these schools, create a new variable that indicates the change in ranking in 2022 compared to 2021, where a positive value indicates an increased ranking (e.g., if a school went from 21 to 20, you would want this variable to have a value of 1.) Finally, return a tibble that shows the name of the NC schools and the new variable you created. Please discuss what you observe.
Do the politics and populations of states where national universities are located differ from those where national liberal arts colleges are located?
Using the full_data
data set, please create a new variable for the percentage of the vote Joe Biden received in a state in 2020. (Note: You can ignore third party vote here since it is not in the data set.)
Then, please calculate the mean Biden vote percentage and mean population of each group of schools and discuss what you find.
Knit to PDF to create a PDF document. Stage and commit all remaining changes, and push your work to GitHub. Make sure all files are updated on your GitHub repo. Only upload your PDF document to Gradescope. Before you submit the uploaded document, mark where each answer is to the exercises. If any answer spans multiple pages, then mark all pages. Associate the “Overall” section with the first page.