Lab #03: Data Wrangling

due January 28 at 11:59 PM

Goals

Getting started

Click the link here https://classroom.github.com/a/OI7-0717 to create your private repository for lab #03 on GitHub. Follow the steps provided in lab and lecture to clone the repo and create a new project in RStudio.

Open the lab03.Rmd template and update the YAML header with your name and today’s date. Then, knit the document and make sure the resulting PDF file has the correct date. Stage, commit, and push your changes.

Write your answers in the lab03.Rmd template file. Your assignment should have at least three meaningful commits and all code chunks should have informative names.

Packages and Data

We will begin by loading the tidyverse package as usual.

library(tidyverse)

Let’s take a trip to the Midwest

The data we will examine is loaded automatically with the tidyverse. It is called midwest and contains demographic information about midwestern counties.

To begin, familiarize yourself with the dataset by reading the documentation. Remember, you can pull up the documentation by running ?midwest in the console.

  1. Which state has the largest population in the 2000 census? Please determine this by using an uninterrupted pipeline with three dplyr commands where you sum the population of all counties within each state and then order the states from least to greatest in population.

Now would be a good time to knit, stage, commit, and push.

  1. What are the three most populated counties in Wisconsin? Using a single, uninterrupted pipeline, please return a 3 X 2 tibble that lists the name of the county and the population of that county, starting with the county with the greatest population in Wisconsin, followed by the second, and then the third most populated.

  2. What is the mean population density of counties within a metropolitan area compared with those that are not in a metropolitan area? How many counties fall into each group? Using a single, uninterrupted pipeline, please return this information. (Hint: You will want to begin by using an if_else command to create a new variable using words for each group using the numerical variable in the data set.)

Now might be a good time for another knit, stage, commit, and push.

  1. Which five counties in the Midwest have the highest proportion of those with at least a college degree (percollege)? Return a 5 X 3 tibble that lists the county name, the state, and the percentage of residents with a college degree. What do three of these counties have in common that might explain why they are on this list? (Hint: You may want to use Google to answer this question.)

  2. Some county names occur in more than one of these Midwest states. Are there any that occur in all five states? (You can assume that no state has a county name occur more than once within that state.) Please return a tibble with the county name and a count of the number of occurrence (i.e., five) for all county names that occur in all five states.

One more exercise, but first, knit, stage, commit, and push!

  1. Which states have the most counties that are at least relatively diverse?

Please create a segmented bar chart with one bar per state, each bar going from 0 - 1, with the fill determined by the percentage of counties that are at least 10 percent non-white. Please include informative labels and use best practices of data visualization. What do you notice? Which states have the most diverse counties? Which have the fewest?

Note: Before making the visualization, you will need to use existing variables to create a new variable and then use a pipeline into ggplot code.

Once you are fully satisfied with your lab, Knit to PDF to create a PDF document.

Follow the instructions in previous labs to submit your PDF to Gradescope.

Be sure to identify which problems are on each page using Gradescope.

Grading

Overall: 50 pts.