Goals

Gain proficiency in data visualization
Apply principles of effective data visualization to a real dataset
Practice data science workflow using git and GitHub

For this assignment you must have at least three commits and all of your code chunks must have meaningful names.

For your first commit, update your author name in the YAML header of the template R Markdown file.

Clone assignment repo + start new project

You can access the assignment at https://classroom.github.com/a/qRRhVDce.
Clone the repository and open a new project in RStudio. See the earlier lab and lecture for additional instructions.

Packages

We will work with the tidyverse package as usual. We will also use viridis and the ggridges packages.

library(tidyverse)
library(viridis)
library(ggridges)

anes <- read_csv("anes2020_subset.csv")

Data

The data for this homework assignment comes from the 2020 American National Election Study.

A subset of variables are provided here. Some of them have been recoded, while others you may need to recode in order to be able to carry out your analysis. The variables are as follows:

CASEID: a Case ID for the respondent.
hunt_fish: a dummy variable asking if the respondent has gone hunting or fishing in the past year.
scientists: A feeling thermometer question that asks how warmly respondents feel towards scientists. A score of 0 represents the coolest rating, while a score of 100 represents the warmest rating.
education: An ordinal categorical representing the highest level of education for the respondent, ranging from less than high school to a professional degree.
ideology: a seven point self-rating scale for the respondent’s ideology ranging from most liberal to most conservative
urbanrural: a variable indicating how rural or urban the respondent’s home community is with four possible values: rural, small town, suburb, or city.

All plots should follow the best visualization practices discussed in lecture. Plots should include an informative title, axes should be labeled, and careful consideration should be given to aesthetic choices.

In addition, code and narrative should not exceed the 80 character limit. See the Lab #01 instructions for setting a vertical line at 80 characters in your R Markdown file.

How many rows are in the anes dataset? How many columns? Please include code and output to support your response.

Now would be a good time to do your first knit, commit, and push.

Create a bar chart showing the ideology of the respondents, with the count on the y-axis. Please remember to include labels. What is the most common ideology? Do respondents tend to be moderate or more ideologically extreme?
Now, let’s examine whether ideologies are different based upon where people live. Please make a filled bar plot, showing one bar for each ideology, with the percentage of respondents on the y-axis going from 0-1, and the fill determined by urbanrural. You are encouraged but not required to use viridis colors. Please remember to include labels.

Where do people of different ideologies tend to live? Does the percentage of non-responses (i.e., people who said NA) vary much by ideology?

Now would be a good time to knit, commit, and push again.

How do people view scientists? Please make a histogram with the feeling thermometer on the x-axis and the number of respondents on the y-axis. Then, please comment on features of the histogram such as skewness and peaks.
Does the ideology of those who have gone hunting or fishing in the past year differ from those who haven’t? Please make side-by-side boxplots of these two groups.

You should start your code with:

anes %>%
  drop_na(hunt_fish) %>%
  mutate(hunted_fished = ifelse(hunt_fish == 0, "Did Not Hunt or Fish", "Hunted or Fished")) %>%

Please note that drop_na removes observations that are NA for the hunt_fish variable.

Then construct side-by-side ridgeline density plots using geom_density_ridge(). You can read more about ridgeline plots here.

Please describe what you observe in both plots and what you learn from one plot that you do not see in the other or that adds additional context to the other.

Is there a relationship between a respondent’s education level and how they view scientists? Please make a scatterplot of these two variables with education on the x-axis and view of scientists on the y-axis. Then, include a line of best fit with the option method = "lm". Do you think this is an especially useful visualization? Why or why not?

Now would be a good time do a final knit, commit, and push again.

For this problem, you should again make a scatterplot where you look at the relationship between education as the x-variable and respondents’ views of scientists as the y-variable.

There are a lot of data points in this dataset. For this exercise, you are going to begin by taking a sample, using the code below:

set.seed(18)
anes2 <- anes %>%
  sample_frac(.10)

This code takes a random subset of the dataset– including set.seed makes sure that it is the same subset each time.

Make a scatterplot using this subset and facet by whether the person hunted or fished in the past year, with labels in words identifying which group the subplot represents.

Then, please add a geom_smooth() with method = lm for each season and add the argument se = FALSE to omit the bands surrounding the line. Describe what you observe.

Submission

Knit to PDF to create a PDF document. Stage and commit all remaining changes, and push your work to GitHub. Make sure all files are updated on your GitHub repo.

Only upload your PDF document to Gradescope. Before you submit the uploaded document, mark where each answer is to the exercises. If any answer spans multiple pages, then mark all pages. Associate the “Overall” section with the first page.

Grading

Total: 50 pts.

Exercise 1: 2 pts
Exercise 2: 6 pts
Exercise 3: 6 pts
Exercise 4: 4 pts
Exercise 5: 6 pts
Exercise 6: 10 pts
Exercise 7: 10 pts
Workflow and formatting: 6 pts
- Workflow and formatting includes the number of commits made, naming chunks, updating the name on the assignment to your name, and using tidyverse style throughout.