Homework #02: Data Wrangling and Joins

Thursday February 3 11:59 PM

Goals

For this assignment you must have at least three meaningful commits and all of your code chunks must have informative names.

For your first commit, update your author name in the YAML header of the template R Markdown file.

All plots should follow the best visualization practices discussed in lecture, including an informative title, labeled axes, and careful consideration ofaesthetic choices.

All code should follow the tidyverse style guidelines, including not exceeding the 80 character limit.

For every join function you should explicitly specify the by argument

Setup

College Rankings and State Characteristics

We will work with the tidyverse package as usual. You may also want to use viridis.

library(tidyverse)
library(viridis)

The U.S. News rankings are an influential, but controversial metric that influences the college application process.

A brief description of the data sets for this lab and how they are related to each other is provided below.

The natunivs data set contains all schools in the National Universities category ranked 50 or above in the current (2022) rankings. Data on school rankings comes from Andy Reiter, with several blank values filled in by the professor. Observations are uniquely identified by school.

The variables in this data set are:

The slacs data set contains contains all schools in the National Liberal Arts Colleges category ranked 50 or above in the current (2022) rankings. Data on school rankings comes from Andy Reiter. Observations are uniquely identified by school.

The variables in this data set are:

The presvote_pop data set contains four variables related to the characteristics of a state:

The Trump and Biden votes variables come from the CQ Voting and Elections Collection, which was accessed through the Duke Library. The 2020 population data comes from the US Census.

Looking at this data

  1. Let’s start by creating an analysis data set that includes information from all three data sets.

The final full_data data frame should have 107 observations and 8 variables.

We will use full_data for the remainder of the assignment. (Please note that there are more than 100 observations total due to ties at 50.)

  1. Which states have the most schools in the full_data data set? Please find the number of schools by state. Then, order these states from greatest to least and return the 5 states with the most schools on the list. Please report these 5 states.

  2. Which states do not have a school in the full_data data set? Use the presvote_pop data set and an appropriate join to help answer this question. Return a data set with two variables, state abbreviations and state population, in order from greatest population to least. Show all code and output, and print the state abbrevations and populations. What is the state with the largest population that does not have a school in the full_data data set?

  3. Please recreate the below plot. Please use a dplyr command to create the variable noting which presidential candidate won the state in 2020. After recreating the plot, please discuss what patterns you observe.

  1. Is there a relationship between the population of a state and the number of schools it has in the full_data data set. To answer this question, first use the code from exercise 2 to create a data set called counts with the counts of the number of schools by state for the 31 states with at least one school in the full data data set.
  1. Let’s now focus on North Carolina schools in the full_data data set. For these schools, create a new variable that indicates the change in ranking in 2022 compared to 2021, where a positive value indicates an increased ranking (e.g., if a school went from 21 to 20, you would want this variable to have a value of 1.) Finally, return a tibble that shows the name of the NC schools and the new variable you created. Please discuss what you observe.

  2. Do the politics and populations of states where national universities are located differ from those where national liberal arts colleges are located?

Submission

Knit to PDF to create a PDF document. Stage and commit all remaining changes, and push your work to GitHub. Make sure all files are updated on your GitHub repo. Only upload your PDF document to Gradescope. Before you submit the uploaded document, mark where each answer is to the exercises. If any answer spans multiple pages, then mark all pages. Associate the “Overall” section with the first page.

Rubric