Main Ideas

Coming Up

Lecture Notes and Exercises

You will probably not need to do this any more at this stage, but if you do, please configure git by running the following code in the terminal. Fill in your GitHub username and the email address associated with your GitHub account.

git config --global user.name 'username'
git config --global user.email 'useremail'

Next load the tidyverse package. Recall, a package is just a bundle of shareable code.

library(tidyverse)

There are two types of variables numeric and categorical.

Types of variables

Numerical variables can be classified as either continuous or discrete. Continuous numeric variables have an infinite number of values between any two values. Discrete numeric variables have a countable number of values.

  • height
  • number of siblings

Categorical variables can be classified as either nominal or ordinal. Ordinal variables have a natural ordering.

  • hair color
  • education

Numeric Variables

To describe the distribution of a numeric we will use the properties below.

  • shape
    • skewness: right-skewed, left-skewed, symmetric
    • modality: unimodal, bimodal, multimodal, uniform
  • center: mean (mean), median (median)
  • spread: range (range), standard deviation (sd), interquartile range (IQR)
  • outliers: observations outside the pattern of the data

We will continue our investigation of home prices in Minneapolis, Minnesota.

mn_homes <- read_csv("mn_homes.csv")

Add a glimpse to the code chunk below and identify the following variables as numeric continuous, numeric discrete, categorical ordinal, or categorical nominal.

  • area
  • beds
  • community
glimpse(mn_homes$community)
##  chr [1:495] "Calhoun-Isles" "Longfellow" "Longfellow" "Southwest" "Camden" ...

The summary command is also useful in looking at numerical variables. Use this command to look at the numeric variables from the previous chunk.

summary(mn_homes$beds)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   3.000   3.087   4.000   7.000

We can use a histogram to summarize a numeric variable.

ggplot(data = mn_homes, 
       mapping = aes(x = salesprice)) + 
   geom_histogram(bins = 25)

A density plot is another option. We just connect the boxes in a histogram with a smooth curve.

ggplot(data = mn_homes, 
       mapping = aes(x = salesprice)) + 
   geom_density()

Side-by-side boxplots are helpful to visualize the distribution of a numeric variable across the levels of a categorical variable.

ggplot(data = mn_homes, 
       mapping = aes(x = community, y = salesprice)) + 
       geom_boxplot() + coord_flip() + 
       labs(main= "Sales Price by Community", x= "Community", y="Sales Price")

Question: What is coord_flip() doing in the code chunk above? Try removing it to see.

coord_flip() flips the x and y-axes– if you do this, make sure to change the axis labels too so they are labelled correctly!

Categorical Variables

Bar plots allow us to visualize categorical variables.

ggplot(data = mn_homes) + 
  geom_bar(mapping = aes(x = community)) + 
  coord_flip() +
  labs(main= "Homes by Community", x= "Community", y="Number of Homes")

Segmented bar plots can be used to visualize two categorical variables.

library(viridis)
## Loading required package: viridisLite
ggplot(data = mn_homes, mapping = aes(x = community, fill = fireplace)) + 
  geom_bar() +
  coord_flip() + 
  scale_fill_viridis(discrete=TRUE, option = "D", name="Fireplace?") +
  labs(main= "Fireplaces by Community", x= "Community", y="Number of Homes")

ggplot(data = mn_homes, mapping = aes(x = community, fill = fireplace)) + 
  geom_bar(position = "fill") + coord_flip() + 
  scale_fill_viridis(discrete=TRUE, option = "D", name="Fireplace?") +
  labs(main= "Percentage of Homes with a Fireplace by Community", x=
  "Community", y="Percentage of Homes")

Question: Which of the above two visualizations do you prefer? Why? Is this answer always the same?

Each of these has advantages. The first gives us information on raw counts, while the second tells us relative percentages. Depending on what information would be most useful to the reader, you might choose one over the other.

There is something wrong with each of the plots below. Run the code for each plot, read the error, then identify and fix the problem.

ggplot(data = mn_homes) + 
  geom_point(mapping = aes(x = lotsize, y = salesprice,
                           shape = 21, size = .85))
ggplot(data = mn_homes, mapping = (x = otsize, y = area)) + 
  geom_point(, shape = 21, size = .85)
ggplot(data = mn_homes) +
  geom_point(mapping = aes(x = lotsize, y = area), color=community, size = 0.85)
ggplot(data = mn_homes) +
  geom_point(mapping = aes(x = 1otsize, y = area))

Errors in Plots: 1) Placement of the ) at the end instead of after y=salesprice. 2) Left out aes(). (And also some extra typos!) 3) color=community needs to be inside aes(). 4) The number “1” instead of the letter “l” at the beginning of the lotsize variable.

General principles for effective data visualization

  • keep it simple
  • use color effectively
  • tell a story

Why is data visualization important? We will illustrate using the datasaurus_dozen data from the datasauRus package.

datasaurus_dozen <- read_csv("datasaurus_dozen.csv")
glimpse(datasaurus_dozen)
## Rows: 1,846
## Columns: 3
## $ dataset <chr> "dino", "dino", "dino", "dino", "dino", "dino", "dino", "dino"…
## $ x       <dbl> 55.3846, 51.5385, 46.1538, 42.8205, 40.7692, 38.7179, 35.6410,…
## $ y       <dbl> 97.1795, 96.0256, 94.4872, 91.4103, 88.3333, 84.8718, 79.8718,…

The code below calculates the correlation, mean of y, mean of x, standard deviation of y, and standard deviation of x for each of the 13 datasets.

Question: What do you notice?

The correlation is slightly different (and weak), but the other summary statistics are the same for each dataset.

datasaurus_dozen %>% 
   group_by(dataset) %>%
   summarize(r = cor(x, y), 
             mean_y = mean(y),
             mean_x = mean(x),
             sd_x = sd(x),
             sd_y = sd(y))
## # A tibble: 13 × 6
##    dataset          r mean_y mean_x  sd_x  sd_y
##    <chr>        <dbl>  <dbl>  <dbl> <dbl> <dbl>
##  1 away       -0.0641   47.8   54.3  16.8  26.9
##  2 bullseye   -0.0686   47.8   54.3  16.8  26.9
##  3 circle     -0.0683   47.8   54.3  16.8  26.9
##  4 dino       -0.0645   47.8   54.3  16.8  26.9
##  5 dots       -0.0603   47.8   54.3  16.8  26.9
##  6 h_lines    -0.0617   47.8   54.3  16.8  26.9
##  7 high_lines -0.0685   47.8   54.3  16.8  26.9
##  8 slant_down -0.0690   47.8   54.3  16.8  26.9
##  9 slant_up   -0.0686   47.8   54.3  16.8  26.9
## 10 star       -0.0630   47.8   54.3  16.8  26.9
## 11 v_lines    -0.0694   47.8   54.3  16.8  26.9
## 12 wide_lines -0.0666   47.8   54.3  16.8  26.9
## 13 x_shape    -0.0656   47.8   54.3  16.8  26.9

Let’s visualize the relationships

ggplot(data = datasaurus_dozen, 
       mapping = aes(x = x, y = y)) + 
   geom_point(size = .5) + 
   facet_wrap( ~ dataset)

Question: Why is visualization important?

This allows us to explore relationships in data that summary statistics do not uncover. Even though datasets may have the same mean or standard deviation, the individual data points might be arrayed widely differently.

Practice

  1. Modify the code outline to create a faceted histogram examining the distribution of year built within each community.

When you are finished, remove eval = FALSE and knit the file to see the changes.

ggplot(data = mn_homes, mapping = aes(x = yearbuilt)) +
  geom_histogram(binwidth = 10) +
  facet_wrap(~ community) +
  labs(x = "Year Built", 
      title = "Which Communities Have the Oldest Homes?",
      subtitle = "Faceted by Community")