This material is heavily inspired (stolen?) from
Thomas Lin Pedersen’s, Software Engineer at RStudio, co-author of ggplot2, author of gganimate, ggplot2-workshop
Hadley Wickham (2016) Elegant graphics for data analysis, 2nd edition, ISSN 2197-5744
Further resources:
Main objective: Hands on ggplot
What is ggplot2?
# install.packages("ggplot2")
ggplot(faithful, aes(x=waiting, y=eruptions)) +
geom_point() +
labs(title="ggplot2 default")
Theory of data visualisation: how to draw information to convey a (non-misleading) message
Base R plotting
plot(x=faithful$waiting, y=faithful$eruptions, main="base r default")
plot_ly(data = faithful, x = ~waiting, y = ~eruptions) %>%
layout(title="plotly default")
many other packages/attempts
ggplot extensions
ggplot is heavily inspired by the grammar of graphics. Grammar of graphics is a general approach to plotting by following the approach of Wilkinson L (2005) The grammar of graphics. Statistics and computing, 2nd edn. Springer, New York.
The grammar of graphics distinguishes several elements
All of these elements are somehow represented in r-package ggplot2.
A first basic plot:
ggplot(data = faithful,
mapping = aes(x = eruptions,y = waiting)) +
Which elements have been used?
Answer: All elements
Note the ‘+’ to combine different parts of the graphic (the layers)
# same plot - alternative specification
ggplot(data = faithful) +
geom_point(mapping = aes(x = waiting,y = eruptions),
colour= "blue")
# local (here) vs. global (above) aesthetics
Use ‘?geom_point’, ggplot2 cheatsheet, vignette(“ggplot2-specs”)
Look at g and g_built with the view function. Can you find size? It’s in the ‘data’ part of one of the objects.
g <- ggplot(data = faithful) +
geom_point(mapping = aes(x = eruptions,y = waiting, size=eruptions/waiting))
No output - result of ggplot() captured in object g. What’s in g? Can you find ‘size’?
df <- data.frame(x = c(1,2,5,4,5), y = c(9, 1, 9,3,4))
base <- ggplot(df, aes(x, y))
base + geom_point()
base + geom_line()
base + geom_path()
base + geom_polygon()
base + geom_rect()
base + geom_ribbon(aes(ymin=x,ymax=y))
base + geom_label(aes(label=paste0("(",x,",",y,")")))
faithful %>% slice(10)
## eruptions waiting
## 1 4.35 85
## eruptions waiting
## 10 4.35 85
ggplot(data = faithful) +
geom_point(mapping = aes(x = eruptions,y = waiting),
data = faithful %>% filter(eruptions > 3 & eruptions < 3.1), colour="red", size=10) +
geom_point(mapping = aes(x = eruptions,y = waiting))
Data is not necessarily plotted ‘as is’. Some graphs need calculations before drawing.
ggplot(data = faithful) +
geom_histogram(mapping = aes(x = eruptions, y = after_stat(density)), fill="steelblue", color="black")
# Wow that looks pretty ugly - set : fill="steelblue" and color="black"
In order to draw the histogram we provide one vector of data (eruptions). There must be a statistic that calculates the height of the bars.
Task: Which specific statistic has been used? check out ?geom_histogram ! Use the stat_* function instead of the geom_* function.
Insight: In most cases there is a one-to-one mapping of (default) stat and (default) geom in ggplot, but people tend to specify the geom_* rather than the stat_*
Task: Which other stats does that statistic provide? Plot ‘density’ instead of ‘count’.
Say there are two regimes in eruptions, high and low. How can we mark them in different colors?
# in many cases there is a one-to-one mapping of (default) stat and (default) geom in ggplot
ggplot(data = faithful) +
geom_histogram(mapping = aes(x = eruptions, y = after_stat(density), fill= eruptions > 3.1),color="black")
It’s always nice to overlay histogram by a density. We add another layer (requiring stat_*)
# in many cases there is a one-to-one mapping of (default) stat and (default) geom in ggplot
ggplot(data = faithful) +
geom_histogram(mapping = aes(x = eruptions, y = after_stat(density), fill= eruptions > 3.1),color="black") +
geom_density(mapping = aes(x = eruptions))
Wait - something’s wrong here. Histogram area should integrate to 1, and so should the area under the empirical density estimate. But density is much lower… Why?
# in many cases there is a one-to-one mapping of (default) stat and (default) geom in ggplot
ggplot(data = faithful) +
geom_histogram(mapping = aes(x = eruptions, y = after_stat(density),
fill= eruptions > 3.1),
color="black") +
geom_density(mapping = aes(x = eruptions))
Task Try out, then explain:
# in many cases there is a one-to-one mapping of (default) stat and (default) geom in ggplot
ggplot(data = faithful) +
geom_histogram(mapping = aes(x = eruptions, y = after_stat(density),
fill = eruptions > 4.5),
color="black") +
geom_density(mapping = aes(x = eruptions))
ggplot(data = faithful, aes(x = eruptions)) +
geom_histogram(mapping = aes(y = after_stat(density), fill = eruptions > 3.1),
color="black") +
ggplot(data = faithful, aes(x = eruptions)) +
geom_histogram(mapping = aes(y = after_stat(density),
fill = after_stat(x) > 3.1),
color="black") +
Insight: Assigning color, shape, fill on data always creates a grouping (as group_by in dplyr package). One way out is to apply color after statistics have been done with the after_stat() function. The manual tells you which statistics are available, but you can also look at the built graph (i.e. the result of ggplot_build()).
Which elements have been specified?
Facets is a mapping from data to sub-samples. Try out and explain.
ggplot(iris) +
geom_point(aes(Sepal.Length,Sepal.Width,color=Species)) +
facet_wrap( ~ Species)
ggplot(iris) +
geom_point(aes(Sepal.Length,Sepal.Width,color=Species)) +
facet_wrap( ~ Species, scales="free")
ggplot(iris) +
geom_point(aes(Sepal.Length,Sepal.Width)) +
geom_smooth(data=iris, aes(Sepal.Length,Sepal.Width), method="lm") +
facet_wrap( ~ Species)
ggplot(iris) +
geom_point(aes(Sepal.Length,Sepal.Width,color=Species)) +
geom_smooth(aes(Sepal.Length,Sepal.Width,color=Species), method="lm", se=FALSE) +
geom_smooth(aes(Sepal.Length,Sepal.Width,color=Species,group=1), method="lm", se=FALSE, color="black")
ggplot(iris) +
geom_point(aes(Sepal.Length,Sepal.Width)) +
geom_smooth(aes(Sepal.Length,Sepal.Width,group=1), method="lm") +
facet_wrap( ~ Species)
ggplot(iris,aes(Sepal.Length,Sepal.Width)) +
geom_point() +
geom_smooth(data=iris, aes(Sepal.Length,Sepal.Width),method="lm", fullrange=FALSE) +
facet_grid( ~ Species)
ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width)) +
geom_point(aes(color=Species)) +
geom_smooth(data=select(iris,-Species), method="lm",se=FALSE) +
geom_smooth(data=iris, aes(color=Species),method="lm",se=FALSE) +
facet_grid( ~ Species)
Insight Facets create real data subsets which are harder to overcome than color-groupings. (And you may loose a lot of time, or need good knowledge of ggplot2, if you want to do everything within ggplot2.)
Which elements have been specified?
Scales map data into image values. Every aesthetic variable has a scale! A continuous variable in the data, say ‘eruptions’, becomes ‘x’ in the plot through an identity mapping. A categorical variable, say ‘Species’, becomes ‘color_var’ through a mapping of categories into colors.
Scale functions (mostly) consist out of three terms:
ggplot(iris) +
geom_point(aes(x = Sepal.Length, y = Sepal.Width, colour = Species)) +
scale_x_continuous() +
scale_y_continuous() +
Any scale can be altered (in several ways):
We focus on colors, probably the most complicated scaling in practice. Coloring graphs is a world on its own: check out
Discrete colors
# assign colors by hand
ggplot(iris) +
geom_point(aes(x = Sepal.Length, y = Sepal.Width, colour = Species)) +
scale_colour_discrete(type = c("red","blue","green"))
# note
## [1] "setosa" "versicolor" "virginica"
ggplot(iris) +
geom_point(aes(x = Sepal.Length, y = Sepal.Width, colour = Species)) +
scale_colour_discrete(type = c(virginica="red",setosa="blue",versicolor="green"))
If there are more categories it may be good to decide for a palette (a set of colours with which to paint). RColorBrewer is a well known package that provides multiple palettes. Note that there are many others outside. And, in principle, you can design your own.
# assign colors with palettes
# example on RColorBrewer, far more palettes are available and you can design your own
At least 5 ways of achieving the same thing. That happens often with ggplot2 (shortcuts and alternatives are abundant).
# most verbose
ggplot(iris) +
geom_point(aes(x = Sepal.Length, y = Sepal.Width, colour = Species)) +
# less verbose
ggplot(iris) +
geom_point(aes(x = Sepal.Length, y = Sepal.Width, colour = Species)) +
scale_colour_discrete(type=scale_colour_brewer, palette="Set2")
# least verbose
ggplot(iris) +
geom_point(aes(x = Sepal.Length, y = Sepal.Width, colour = Species)) +
# note colour vs. fill
ggplot(faithful, aes(x = eruptions)) +
geom_histogram(aes(y=after_stat(density), fill=eruptions < 3.1),colour="black") +
# alternative via scale_colour_manual - again tweaking everything else...
ggplot(iris) +
geom_point(aes(x = Sepal.Length, y = Sepal.Width, colour = Species)) +
scale_colour_manual(values = c("red","blue","green"),
breaks = c("virginica","setosa","versicolor"),
name = "my species")
Continuous colors
Categorical variables need only as many colors as categories. Coloring continuous data necessitates a function that returns for each value some color.
# continuous
ggplot(faithful) +
geom_point(aes(x = waiting, y = eruptions, colour=eruptions/waiting))
# default...
ggplot(faithful) +
geom_point(aes(x = waiting, y = eruptions, colour=eruptions/waiting)) +
# change colors with gradient
ggplot(faithful) +
geom_point(aes(x = waiting, y = eruptions, colour=eruptions/waiting)) +
scale_colour_continuous(type="gradient", low="yellow",high="red")
# directly use gradient fct
ggplot(faithful) +
geom_point(aes(x = waiting, y = eruptions, colour=eruptions/waiting)) +
# use colour brewer again... with palette Spectral
ggplot(faithful) +
geom_point(aes(x = waiting, y = eruptions, colour=eruptions/waiting)) +
scale_colour_continuous(type=scale_colour_distiller, palette="Spectral")
# The distiller scales extend brewer scales by smoothly interpolating 7
# colours from any palette to a continuous scale.
# The fermenter scales provide binned versions of the brewer scales.
# check out.
ggplot(faithful) +
geom_point(aes(x = waiting, y = eruptions, colour=eruptions/waiting)) +
scale_colour_continuous(type=scale_colour_fermenter, palette="Spectral")
Work same as colours but are simpler.
Task In faithful, vary sizes by eruptions/waiting and shapes by regime (eruptions > 3.1).
ggplot(faithful) +
geom_point(aes(x = waiting, y = eruptions)) +
## Warning: Removed 27 rows containing missing values (geom_point).
Which elements have been used?
Scales map from data space into aesthetic space, whereas guides map visual properties back to data.
ggplot(faithful) +
geom_point(aes(x = waiting, y = eruptions, colour=eruptions/waiting)) +
scale_colour_continuous(type=scale_colour_distiller, palette="Spectral") +
guides(x=guide_axis(title="waiting (min)"), #
y=guide_axis(title="eruptions (sec)"),
colour=guide_colorbar(title = "ratio"))
ggplot(faithful) +
geom_point(aes(x = waiting, y = eruptions, colour=eruptions/waiting)) +
scale_x_continuous(name="wait",breaks=c(50,70,90) ,labels=c("fifty","seventy","ninety")) +
scale_colour_continuous(type=scale_colour_distiller, palette="Spectral") +
guides(x=guide_axis(title="waiting (min)"), #
y=guide_axis(title="eruptions (sec)"),
colour=guide_colorbar(title = "ratio"))
TODO - expand.
There are multiple themes available (check cheatsheet), these days theme_minimal() seems to be popular. Once a theme is added, you may alter the positioning of the different graphical elements (e.g. the legend) and how they look like (fonts, sizes, etc.). Note that this is done through basic functions such as element_text() - they package all settings for a certain element.
TODO - expand.
# install.packages("palmerpenguins)
data(package = 'palmerpenguins')
g <- ggplot(penguins, aes(x=flipper_length_mm, y=body_mass_g)) +
Task: You should be now in the position to turn the figure above into the one below by using the cheatsheet and the manual (?theme)
Colors are: c(“darkorange”,“darkorchid”,“cyan4”)