Main Ideas
- Data visualization is an extremely effective way to express information and extract meaning from data.
- We can build up an effective visualization systematically layer by layer using a grammar of graphics (
ggplot2
).
ggplot2
).“The simple graph has brought more information to the data analyst’s mind than any other device” - John Tukey
Before we start the exercise, we need to configure git so that RStudio can communicate with GitHub. This requires two pieces of information: your email address and your GitHub username.
Configure git by running the following code in the terminal. Fill in your GitHub username and the email address associated with your GitHub account.
git config --global user.name 'username'
git config --global user.email 'useremail'
Next load the tidyverse
package. Recall, a package is just a bundle of shareable code.
library(tidyverse)
## Warning in system("timedatectl", intern = TRUE): running command 'timedatectl'
## had status 1
Exploratory data analysis (EDA) is an approach to analyzing datasets in order to summarize the main characteristics, often with visual representations of the data (today). We can also calculate summary statistics and perform data wrangling, manipulation, and transformation (next week).
We will use ggplot2
to construct visualizations. The gg in ggplot2
stands for “grammar of graphics”, a system or framework that allows us to describe the components of a graphic, building up an effective visualization layer by later.
We will introduce visualization using data on single-family homes sold in Minneapolis, Minnesota between 2005 and 2015.
Question: What happens when you click the green arrow in the code chunk below? What changes in the “Environment” pane?
This will load the data into RStudio.
mn_homes <- read_csv("mn_homes.csv")
glimpse(mn_homes)
## Rows: 495
## Columns: 13
## $ saleyear <dbl> 2012, 2014, 2005, 2010, 2010, 2013, 2011, 2007, 2013, 20…
## $ salemonth <dbl> 6, 7, 7, 6, 2, 9, 1, 9, 10, 6, 7, 8, 5, 2, 7, 6, 10, 6, …
## $ salesprice <dbl> 690467.0, 235571.7, 272507.7, 277767.5, 148324.1, 242871…
## $ area <dbl> 3937, 1440, 1835, 2016, 2004, 2822, 2882, 1979, 3140, 35…
## $ beds <dbl> 5, 2, 2, 3, 3, 3, 4, 3, 4, 3, 3, 3, 2, 3, 3, 6, 2, 3, 2,…
## $ baths <dbl> 4, 1, 1, 2, 1, 3, 3, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1,…
## $ stories <dbl> 2.5, 1.7, 1.7, 2.5, 1.0, 2.0, 1.7, 1.5, 1.5, 2.5, 1.0, 2…
## $ yearbuilt <dbl> 1907, 1919, 1913, 1910, 1956, 1934, 1951, 1929, 1940, 19…
## $ neighborhood <chr> "Lowry Hill", "Cooper", "Hiawatha", "King Field", "Shing…
## $ community <chr> "Calhoun-Isles", "Longfellow", "Longfellow", "Southwest"…
## $ lotsize <dbl> 6192, 5160, 5040, 4875, 5060, 6307, 6500, 5600, 6350, 75…
## $ numfireplaces <dbl> 0, 0, 0, 0, 0, 2, 2, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,…
## $ fireplace <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, TR…
Question: What does each row represent? Each column?
Rows represent observations while columns represent variables.
ggplot
creates the initial base coordinate system that we will add layers to. We first specify the dataset we will use with data = mn_homes
. The mapping
argument is paired with an aesthetic (aes
), which tells us how the variables in our dataset should be mapped to the visual properties of the graph.
Question: What does the code chunk below do?
This creates our base layer, using the area and salesprice variables.
ggplot(data = mn_homes,
mapping = aes(x = area, y = salesprice))
ggplot(data = mn_homes,
mapping = aes(x = area, y = salesprice)) +
geom_point()
ggplot(data = mn_homes,
mapping = aes(x = area, y = salesprice)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Run ?geom_smooth
in the console. What does this function do?
This fits a loess regression line (moving regression) to the data.
ggplot(data = mn_homes,
mapping = aes(x = area, y = salesprice)) +
geom_point() +
geom_smooth() +
labs(title = "Sales price vs. area of homes in Minneapolis, MN",
x = "Area (square feet)", y = "Sales Price (dollars)")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The procedure used to construct plots can be summarized using the code below.
ggplot(data = [dataset],
mapping = aes(x = [x-variable], y = [y-variable])) +
geom_xxx() +
geom_xxx() +
other options
Question: What do you think eval = FALSE
is doing in the code chunk above?
This is so we don’t run the chunk when we knit, which will cause R to be unable to knit it.
An aesthetic is a visual property of one of the objects in your plot.
We can map a variable in our dataset to a color, a size, a transparency, and so on. The aesthetics that can be used with each geom_
can be found in the documentation.
Question: What will the visualization look like below? Write your answer down before running the code.
This is making a scatter plot where area is the x variable and sales price is the y variable. We are coloring the data points based upon whether there is a fireplace, using viridis
colors.
Here we are going to use the viridis package, which has more color-blind accessible colors. scale_color_viridis
specifies which colors you want to use. You can learn more about the options here.
Other sources that can be helpful in devising accessible color schemes include Color Brewer, the Wes Anderson package, and the cividis package.
This visualization shows a scatterplot of area (x variable) and sales price (y variable). Using the viridis function, we make points for houses with a fireplace yellow and those without purple. We also add axis and an overall label.
library(viridis)
## Loading required package: viridisLite
ggplot(data = mn_homes,
mapping = aes(x = area, y = salesprice,
color = fireplace)) +
geom_point() +
labs(title = "Sales price vs. area of homes in Minneapolis, MN",
x = "Area (square feet)", y = "Sales Price (dollars)") +
scale_color_viridis(discrete = TRUE, option = "D", name="Fireplace?")
Question: What about this one?
Now we are using shapes instead of colors for whether there is a fireplace.
ggplot(data = mn_homes,
mapping = aes(x = area, y = salesprice,
shape = fireplace)) +
geom_point() +
labs(title = "Sales price vs. area of homes in Minneapolis, MN",
x = "Area (square feet)", y = "Sales Price (dollars)",
shape="Fireplace?")
Question: This one?
Now we coloring based upon whether there is a fireplace and sizing the points based upon their lot size.
ggplot(data = mn_homes,
mapping = aes(x = area, y = salesprice,
color = fireplace,
size = lotsize)) +
geom_point() +
labs(title = "Sales price vs. area of homes in Minneapolis, MN",
x = "Area (square feet)", y = "Sales Price (dollars)",
size = "Lot Size") +
scale_color_viridis(discrete=TRUE, option = "D",name="Fireplace?")
Question: Are the above visualizations effective? Why or why not? How might you improve them?
For a question like this, you would want to use your best evidence.
Question: What is the difference between the two plots below?
The placement of color is key here– since it is not a variable name, you want it outside aes.
ggplot(data = mn_homes) +
geom_point(mapping = aes(x = area, y = salesprice, color = "blue"))
ggplot(data = mn_homes) +
geom_point(mapping = aes(x = area, y = salesprice), color = "blue")
Use aes
to map variables to plot features, use arguments in geom_xxx
for customization not mapped to a variable.
Mapping in the ggplot
function is global, meaning they apply to every layer we add. Mapping in a particular geom_xxx
function treats the mappings as local.
Question: Create a scatterplot using variables of your choosing using the mn_homes
data.
Answers will vary here. I have a facted scatterplot as my example for Exercise 3 below.
Question: Modify your scatterplot above by coloring the points for each community.
We can use smaller plots to display different subsets of the data using faceting. This is helpful to examine conditional relationships.
Let’s try a few simple examples of faceting. Note that these plots should be improved by careful consideration of labels, aesthetics, etc.
ggplot(data = mn_homes,
mapping = aes(x = area, y = salesprice)) +
geom_point() +
labs(title = "Sales price vs. area of homes in Minneapolis, MN",
x = "Area (square feet)", y = "Sales Price (dollars)") +
facet_grid(. ~ beds)
ggplot(data = mn_homes,
mapping = aes(x = area, y = salesprice)) +
geom_point() +
labs(title = "Sales price vs. area of homes in Minneapolis, MN",
x = "Area (square feet)", y = "Sales Price (dollars)") +
facet_grid(beds ~ .)
ggplot(data = mn_homes,
mapping = aes(x = area, y = salesprice)) +
geom_point() +
labs(title = "Sales price vs. area of homes in Minneapolis, MN",
x = "Area (square feet)", y = "Sales Price (dollars)") +
facet_grid(beds ~ baths)
ggplot(data = mn_homes,
mapping = aes(x = area, y = salesprice)) +
geom_point() +
labs(title = "Sales price vs. area of homes in Minneapolis, MN",
x = "Area (square feet)", y = "Sales Price (dollars)") +
facet_wrap(~ community)
facet_grid()
facet_wrap()
alpha
to make the points more transparent.viridis
palette. (Note, you can’t do all of these things at once in terms of color, these are just suggestions.)When you are finished, remove eval = FALSE
and knit the file to see the changes.
Here is some starter code:
ggplot(data = mn_homes,
mapping = aes(x = lotsize, y = salesprice)) +
geom_point(color = "green", alpha = 0.5) +
labs(title = "Price and Size of Lots", x = "Lot Size", y = "Price in Dollars")
lotsize
.fill = "blue"
inside the geom_histogram()
function.color = "red"
inside the geom_histogram()
function.When you are finished, remove eval = FALSE
and knit the file to see the changes.
ggplot(data = mn_homes,
mapping = aes(x = lotsize)) +
geom_histogram(fill = "blue", color = "red") +
labs(title = "Histogram of Lot Size" , x = "Size of Lot", y = "Number of Homes")
Question: What is the difference between the color
and fill
arguments?
ggplot(data = mn_homes,
mapping = aes(x = saleyear, y = salesprice)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~ community) +
labs(title = "Are Homes Becoming More Expensive", subtitle = "faceted by community", x = "Year", y = "Price in Dollars")
## `geom_smooth()` using formula 'y ~ x'