Main Ideas
- There are different types of variables.
- Visualizations and summaries of variables must be consistent with the variable type.
You will probably not need to do this any more at this stage, but if you do, please configure git by running the following code in the terminal. Fill in your GitHub username and the email address associated with your GitHub account.
git config --global user.name 'username'
git config --global user.email 'useremail'
Next load the tidyverse
package. Recall, a package is just a bundle of shareable code.
library(tidyverse)
There are two types of variables numeric and categorical.
Numerical variables can be classified as either continuous or discrete. Continuous numeric variables have an infinite number of values between any two values. Discrete numeric variables have a countable number of values.
Categorical variables can be classified as either nominal or ordinal. Ordinal variables have a natural ordering.
To describe the distribution of a numeric we will use the properties below.
mean
), median (median
)range
), standard deviation (sd
), interquartile range (IQR
)We will continue our investigation of home prices in Minneapolis, Minnesota.
mn_homes <- read_csv("mn_homes.csv")
Add a glimpse
to the code chunk below and identify the following variables as numeric continuous, numeric discrete, categorical ordinal, or categorical nominal.
glimpse(mn_homes$community)
## chr [1:495] "Calhoun-Isles" "Longfellow" "Longfellow" "Southwest" "Camden" ...
The summary
command is also useful in looking at numerical variables. Use this command to look at the numeric variables from the previous chunk.
summary(mn_homes$beds)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 3.000 3.087 4.000 7.000
We can use a histogram to summarize a numeric variable.
ggplot(data = mn_homes,
mapping = aes(x = salesprice)) +
geom_histogram(bins = 25)
A density plot is another option. We just connect the boxes in a histogram with a smooth curve.
ggplot(data = mn_homes,
mapping = aes(x = salesprice)) +
geom_density()
Side-by-side boxplots are helpful to visualize the distribution of a numeric variable across the levels of a categorical variable.
ggplot(data = mn_homes,
mapping = aes(x = community, y = salesprice)) +
geom_boxplot() + coord_flip() +
labs(main= "Sales Price by Community", x= "Community", y="Sales Price")
Question: What is coord_flip()
doing in the code chunk above? Try removing it to see.
coord_flip()
flips the x and y-axes– if you do this, make sure to change the axis labels too so they are labelled correctly!
Bar plots allow us to visualize categorical variables.
ggplot(data = mn_homes) +
geom_bar(mapping = aes(x = community)) +
coord_flip() +
labs(main= "Homes by Community", x= "Community", y="Number of Homes")
Segmented bar plots can be used to visualize two categorical variables.
library(viridis)
## Loading required package: viridisLite
ggplot(data = mn_homes, mapping = aes(x = community, fill = fireplace)) +
geom_bar() +
coord_flip() +
scale_fill_viridis(discrete=TRUE, option = "D", name="Fireplace?") +
labs(main= "Fireplaces by Community", x= "Community", y="Number of Homes")
ggplot(data = mn_homes, mapping = aes(x = community, fill = fireplace)) +
geom_bar(position = "fill") + coord_flip() +
scale_fill_viridis(discrete=TRUE, option = "D", name="Fireplace?") +
labs(main= "Percentage of Homes with a Fireplace by Community", x=
"Community", y="Percentage of Homes")
Question: Which of the above two visualizations do you prefer? Why? Is this answer always the same?
Each of these has advantages. The first gives us information on raw counts, while the second tells us relative percentages. Depending on what information would be most useful to the reader, you might choose one over the other.
There is something wrong with each of the plots below. Run the code for each plot, read the error, then identify and fix the problem.
ggplot(data = mn_homes) +
geom_point(mapping = aes(x = lotsize, y = salesprice,
shape = 21, size = .85))
ggplot(data = mn_homes, mapping = (x = otsize, y = area)) +
geom_point(, shape = 21, size = .85)
ggplot(data = mn_homes) +
geom_point(mapping = aes(x = lotsize, y = area), color=community, size = 0.85)
ggplot(data = mn_homes) +
geom_point(mapping = aes(x = 1otsize, y = area))
Errors in Plots: 1) Placement of the ) at the end instead of after y=salesprice. 2) Left out aes(). (And also some extra typos!) 3) color=community needs to be inside aes(). 4) The number “1” instead of the letter “l” at the beginning of the lotsize variable.
General principles for effective data visualization
Why is data visualization important? We will illustrate using the datasaurus_dozen
data from the datasauRus
package.
datasaurus_dozen <- read_csv("datasaurus_dozen.csv")
glimpse(datasaurus_dozen)
## Rows: 1,846
## Columns: 3
## $ dataset <chr> "dino", "dino", "dino", "dino", "dino", "dino", "dino", "dino"…
## $ x <dbl> 55.3846, 51.5385, 46.1538, 42.8205, 40.7692, 38.7179, 35.6410,…
## $ y <dbl> 97.1795, 96.0256, 94.4872, 91.4103, 88.3333, 84.8718, 79.8718,…
The code below calculates the correlation, mean of y, mean of x, standard deviation of y, and standard deviation of x for each of the 13 datasets.
Question: What do you notice?
The correlation is slightly different (and weak), but the other summary statistics are the same for each dataset.
datasaurus_dozen %>%
group_by(dataset) %>%
summarize(r = cor(x, y),
mean_y = mean(y),
mean_x = mean(x),
sd_x = sd(x),
sd_y = sd(y))
## # A tibble: 13 × 6
## dataset r mean_y mean_x sd_x sd_y
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 away -0.0641 47.8 54.3 16.8 26.9
## 2 bullseye -0.0686 47.8 54.3 16.8 26.9
## 3 circle -0.0683 47.8 54.3 16.8 26.9
## 4 dino -0.0645 47.8 54.3 16.8 26.9
## 5 dots -0.0603 47.8 54.3 16.8 26.9
## 6 h_lines -0.0617 47.8 54.3 16.8 26.9
## 7 high_lines -0.0685 47.8 54.3 16.8 26.9
## 8 slant_down -0.0690 47.8 54.3 16.8 26.9
## 9 slant_up -0.0686 47.8 54.3 16.8 26.9
## 10 star -0.0630 47.8 54.3 16.8 26.9
## 11 v_lines -0.0694 47.8 54.3 16.8 26.9
## 12 wide_lines -0.0666 47.8 54.3 16.8 26.9
## 13 x_shape -0.0656 47.8 54.3 16.8 26.9
Let’s visualize the relationships
ggplot(data = datasaurus_dozen,
mapping = aes(x = x, y = y)) +
geom_point(size = .5) +
facet_wrap( ~ dataset)
Question: Why is visualization important?
This allows us to explore relationships in data that summary statistics do not uncover. Even though datasets may have the same mean or standard deviation, the individual data points might be arrayed widely differently.
When you are finished, remove eval = FALSE
and knit the file to see the changes.
ggplot(data = mn_homes, mapping = aes(x = yearbuilt)) +
geom_histogram(binwidth = 10) +
facet_wrap(~ community) +
labs(x = "Year Built",
title = "Which Communities Have the Oldest Homes?",
subtitle = "Faceted by Community")