1 Introduction

1.1 Disclaimer

This material is heavily inspired (stolen?) from

1.2 Prerequisites

  • basic knowledge of R (as from the ‘RIntro’ course)

1.3 Course objective

Main objective: Hands on ggplot

Schedule

  • Morning (9:30-12:00) be able to read the ggplot2 cheatsheet
  • Afternoon (14:00-16:00) draw your own plot

What is ggplot2?

  • ggplot2 creates an R object that can be transformed with ‘+’ and then exported on a device.
  • general approach to plotting by following the `grammars of graphics’ rule of Wilkinson (2005)Wilkinson L (2005) The grammar of graphics. Statistics and computing, 2nd edn. Springer, New York
  • Use case: Explorative data analysis and high-level publication figures
# install.packages("ggplot2")
library(ggplot2)
ggplot(faithful, aes(x=waiting, y=eruptions)) + 
  geom_point() + 
  labs(title="ggplot2 default")

1.4 Not in this course (next one?)

Theory of data visualisation: how to draw information to convey a (non-misleading) message

Base R plotting

  • e.g. plot(x,y,…), is like a piece of paper, things are added on a device (plot window, pdf, html, etc.) until the device is closed.
  • Specific plotting functions, e.g. hist(), barplot(), etc…
  • Use case: still quickest look at the data
plot(x=faithful$waiting, y=faithful$eruptions, main="base r default")

plotly

  • plotly creates interactive web-based graphs (html-widgets) via the open source JavaScript graphing library plotly.js.
  • provides interface to ggplot2 to create interactive ggplot versions
  • exands into dashboards based on dash (written on top of plotly.js)
  • Use case: exploratory data analysis, entering discussion on interpretation
library(plotly)
plot_ly(data = faithful, x = ~waiting, y = ~eruptions) %>%
  layout(title="plotly default")

many other packages/attempts

ggplot extensions

2 Background

ggplot is heavily inspired by the grammar of graphics. Grammar of graphics is a general approach to plotting by following the approach of Wilkinson L (2005) The grammar of graphics. Statistics and computing, 2nd edn. Springer, New York.

The grammar of graphics distinguishes several elements

  • Data
    • information to display
  • Mapping
    • aesthetic mapping: variables in data to graphical properties in the geometry
    • facet mapping: variables in data to panels (facets) in the plot
  • Statistics
    • transform data to values to be displayed (e.g. smooth data, calculate histogram)
  • Scales
    • translate data values into figure properties (e.g. categories into colors)
  • Geometries
    • data displayed as points, lines, bars, etc…
  • Facets
    • look at subsets of data in different panels
  • Coordinates
    • the coordinate system to display position values
  • Theme
    • look-and-feel of the graph

All of these elements are somehow represented in r-package ggplot2.

3 Discover ggplot2

3.1 A first basic plot

A first basic plot:

ggplot(data = faithful,
       mapping = aes(x = eruptions,y = waiting)) +
  geom_point()

Which elements have been used?

  • Data, data.frame faithful
  • Mapping (aes() defines vars to coord.axes)
  • Statistics
  • Scales
  • Geometries, geoms tell how to display data (geom_point() - scatterplot)
  • Facets
  • Coordinates
  • Theme

Answer: All elements

  • some (the minimum) specified by user,
  • all others by default settings.

Note the ‘+’ to combine different parts of the graphic (the layers)

# same plot - alternative specification
ggplot(data = faithful) +
  geom_point(mapping = aes(x = waiting,y = eruptions),
             colour= "blue")

# local (here) vs. global (above) aesthetics

3.1.1 Exercise - change a geom

  • set ‘shape’ of points to all diamonds
  • set ‘color’ of points to all blue
  • set ‘alpha’ of points (transparency) to all 0.3
  • vary ‘size’ of dots by eruptions/waiting
  • vary ‘color’ of points by ‘eruptions > 3.1’

Use ‘?geom_point’, ggplot2 cheatsheet, vignette(“ggplot2-specs”)

Look at g and g_built with the view function. Can you find size? It’s in the ‘data’ part of one of the objects.

Insights

  • settings vs. aesthetics
  • x, y, size etc.. not transmitted as strings, rather as objects. You can calculate it!

3.1.2 When is what calculated/executed?

g <- ggplot(data = faithful) +
  geom_point(mapping = aes(x = eruptions,y = waiting, size=eruptions/waiting))

No output - result of ggplot() captured in object g. What’s in g? Can you find ‘size’?

Insight

  • the function ggplot() returns the data and settings/parameters/function names, not the values to be plotted.
  • only when print() function is called values are calculated.

3.1.3 Exercise - try out different geom_*

df <- data.frame(x = c(1,2,5,4,5), y = c(9, 1, 9,3,4))
base <- ggplot(df, aes(x, y))
base + geom_point()
base + geom_line()
base + geom_path()
base + geom_polygon()
base + geom_rect()
base + geom_ribbon(aes(ymin=x,ymax=y))
base + geom_label(aes(label=paste0("(",x,",",y,")")))
faithful %>% slice(10)
##   eruptions waiting
## 1      4.35      85
faithful[10,]
##    eruptions waiting
## 10      4.35      85
ggplot(data = faithful) +
    geom_point(mapping = aes(x = eruptions,y = waiting),
             data = faithful %>% filter(eruptions > 3 & eruptions < 3.1), colour="red", size=10) +
  geom_point(mapping = aes(x = eruptions,y = waiting))

Insight

  • same data results in different figures depending on the geom.
  • one can add geom-layers on top of each other, the order plays a role.

3.2 stat_*

Data is not necessarily plotted ‘as is’. Some graphs need calculations before drawing.

ggplot(data = faithful) +
  geom_histogram(mapping = aes(x = eruptions, y = after_stat(density)), fill="steelblue", color="black")

# Wow that looks pretty ugly - set : fill="steelblue" and color="black"

In order to draw the histogram we provide one vector of data (eruptions). There must be a statistic that calculates the height of the bars.

Task: Which specific statistic has been used? check out ?geom_histogram ! Use the stat_* function instead of the geom_* function.

Insight: In most cases there is a one-to-one mapping of (default) stat and (default) geom in ggplot, but people tend to specify the geom_* rather than the stat_*

Task: Which other stats does that statistic provide? Plot ‘density’ instead of ‘count’.

Say there are two regimes in eruptions, high and low. How can we mark them in different colors?

# in many cases there is a one-to-one mapping of (default) stat and (default) geom in ggplot
ggplot(data = faithful) +
  geom_histogram(mapping = aes(x = eruptions, y = after_stat(density), fill= eruptions > 3.1),color="black")

It’s always nice to overlay histogram by a density. We add another layer (requiring stat_*)

# in many cases there is a one-to-one mapping of (default) stat and (default) geom in ggplot
ggplot(data = faithful) +
  geom_histogram(mapping = aes(x = eruptions, y = after_stat(density), fill= eruptions > 3.1),color="black") +
  geom_density(mapping = aes(x = eruptions))

Wait - something’s wrong here. Histogram area should integrate to 1, and so should the area under the empirical density estimate. But density is much lower… Why?

# in many cases there is a one-to-one mapping of (default) stat and (default) geom in ggplot
ggplot(data = faithful) +
  geom_histogram(mapping = aes(x = eruptions, y = after_stat(density), 
                               fill= eruptions > 3.1),
                 color="black") +
  geom_density(mapping = aes(x = eruptions))

Task Try out, then explain:

# in many cases there is a one-to-one mapping of (default) stat and (default) geom in ggplot
ggplot(data = faithful) +
  geom_histogram(mapping = aes(x = eruptions, y = after_stat(density), 
                               fill = eruptions > 4.5),
                 color="black") +
  geom_density(mapping = aes(x = eruptions))


ggplot(data = faithful, aes(x = eruptions)) +
  geom_histogram(mapping = aes(y = after_stat(density), fill = eruptions > 3.1),
                 color="black") +
  geom_density(alpha=0.4)


ggplot(data = faithful, aes(x = eruptions)) +
  geom_histogram(mapping = aes(y = after_stat(density), 
                               fill = after_stat(x) > 3.1),
                 color="black") +
  geom_density(alpha=0.4)

Insight: Assigning color, shape, fill on data always creates a grouping (as group_by in dplyr package). One way out is to apply color after statistics have been done with the after_stat() function. The manual tells you which statistics are available, but you can also look at the built graph (i.e. the result of ggplot_build()).

Which elements have been specified?

  • Data, data.frame faithful
  • Mapping (aes() defines vars to coord.axes)
  • Statistics closely related to (their) geoms, yield different outputs that can be used
  • Scales
  • Geometries, geoms tell how to display data (geom_point() - scatterplot)
  • Facets
  • Coordinates
  • Theme

3.3 Facets

Facets is a mapping from data to sub-samples. Try out and explain.

data(iris)
iris

ggplot(iris) +
  geom_point(aes(Sepal.Length,Sepal.Width,color=Species)) +
  facet_wrap( ~ Species)

ggplot(iris) +
  geom_point(aes(Sepal.Length,Sepal.Width,color=Species)) +
  facet_wrap( ~ Species, scales="free")

ggplot(iris) +
  geom_point(aes(Sepal.Length,Sepal.Width)) +
  geom_smooth(data=iris, aes(Sepal.Length,Sepal.Width), method="lm") +
  facet_wrap( ~ Species)

ggplot(iris) +
  geom_point(aes(Sepal.Length,Sepal.Width,color=Species)) +
  geom_smooth(aes(Sepal.Length,Sepal.Width,color=Species), method="lm", se=FALSE) +
  geom_smooth(aes(Sepal.Length,Sepal.Width,color=Species,group=1), method="lm", se=FALSE, color="black")

ggplot(iris) +
  geom_point(aes(Sepal.Length,Sepal.Width)) +
  geom_smooth(aes(Sepal.Length,Sepal.Width,group=1), method="lm") +
  facet_wrap( ~ Species)

ggplot(iris,aes(Sepal.Length,Sepal.Width)) +
  geom_point() +
  geom_smooth(data=iris, aes(Sepal.Length,Sepal.Width),method="lm", fullrange=FALSE) +
  facet_grid( ~ Species)

ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width)) +
  geom_point(aes(color=Species)) +
  geom_smooth(data=select(iris,-Species), method="lm",se=FALSE) +
  geom_smooth(data=iris, aes(color=Species),method="lm",se=FALSE) +
  facet_grid( ~ Species)

Insight Facets create real data subsets which are harder to overcome than color-groupings. (And you may loose a lot of time, or need good knowledge of ggplot2, if you want to do everything within ggplot2.)

Which elements have been specified?

  • Data, data.frame faithful
  • Mapping (aes() defines vars to coord.axes)
  • Statistics closely related to (their) geoms, yield different outputs that can be used
  • Scales
  • Geometries, geoms tell how to display data (geom_point() - scatterplot)
  • Facets map data into sub-samples.
  • Coordinates
  • Theme

3.4 Scales

Scales map data into image values. Every aesthetic variable has a scale! A continuous variable in the data, say ‘eruptions’, becomes ‘x’ in the plot through an identity mapping. A categorical variable, say ‘Species’, becomes ‘color_var’ through a mapping of categories into colors.

Scale functions (mostly) consist out of three terms:

  • ‘scale_’ (many scale_* fcts.)
  • varying element in figure: ‘x’, ‘y’, ‘colour’, ‘shape’, ‘size’, and
  • data type ‘continuous’, ‘discrete’.
ggplot(iris) + 
  geom_point(aes(x = Sepal.Length, y = Sepal.Width, colour = Species)) +
  scale_x_continuous() + 
  scale_y_continuous() + 
  scale_colour_discrete()

Any scale can be altered (in several ways):

3.4.1 Color scales

We focus on colors, probably the most complicated scaling in practice. Coloring graphs is a world on its own: check out

Discrete colors

# assign colors by hand
ggplot(iris) + 
  geom_point(aes(x = Sepal.Length, y = Sepal.Width, colour = Species)) +
  scale_colour_discrete(type = c("red","blue","green"))

# note
levels(factor(iris$Species))
## [1] "setosa"     "versicolor" "virginica"
ggplot(iris) + 
  geom_point(aes(x = Sepal.Length, y = Sepal.Width, colour = Species)) +
  scale_colour_discrete(type = c(virginica="red",setosa="blue",versicolor="green"))

If there are more categories it may be good to decide for a palette (a set of colours with which to paint). RColorBrewer is a well known package that provides multiple palettes. Note that there are many others outside. And, in principle, you can design your own.

# assign colors with palettes
# example on RColorBrewer, far more palettes are available and you can design your own
RColorBrewer::display.brewer.all()

At least 5 ways of achieving the same thing. That happens often with ggplot2 (shortcuts and alternatives are abundant).

# most verbose
ggplot(iris) + 
  geom_point(aes(x = Sepal.Length, y = Sepal.Width, colour = Species)) +
  scale_colour_discrete(type=RColorBrewer::brewer.pal(n=3,name="Set2"))

# less verbose
ggplot(iris) + 
  geom_point(aes(x = Sepal.Length, y = Sepal.Width, colour = Species)) +
  scale_colour_discrete(type=scale_colour_brewer, palette="Set2") 

# least verbose
ggplot(iris) + 
  geom_point(aes(x = Sepal.Length, y = Sepal.Width, colour = Species)) +
  scale_colour_brewer(palette="Spectral")

# note colour vs. fill
ggplot(faithful, aes(x = eruptions)) + 
  geom_histogram(aes(y=after_stat(density), fill=eruptions < 3.1),colour="black") +
  scale_fill_discrete(type=c("steelblue","red"))

# alternative via scale_colour_manual - again tweaking everything else...
ggplot(iris) + 
  geom_point(aes(x = Sepal.Length, y = Sepal.Width, colour = Species)) +
  scale_colour_manual(values = c("red","blue","green"),
                      breaks = c("virginica","setosa","versicolor"),
                      name = "my species")

Continuous colors

Categorical variables need only as many colors as categories. Coloring continuous data necessitates a function that returns for each value some color.

# continuous
ggplot(faithful) + 
  geom_point(aes(x = waiting, y = eruptions, colour=eruptions/waiting))

# default...
ggplot(faithful) + 
  geom_point(aes(x = waiting, y = eruptions, colour=eruptions/waiting)) +
  scale_colour_continuous()

# change colors with gradient
ggplot(faithful) + 
  geom_point(aes(x = waiting, y = eruptions, colour=eruptions/waiting)) +
  scale_colour_continuous(type="gradient", low="yellow",high="red")

# directly use gradient fct
ggplot(faithful) + 
  geom_point(aes(x = waiting, y = eruptions, colour=eruptions/waiting)) +
  scale_colour_gradient(low="yellow",high="red")

# use colour brewer again... with palette Spectral
ggplot(faithful) + 
  geom_point(aes(x = waiting, y = eruptions, colour=eruptions/waiting)) +
  scale_colour_continuous(type=scale_colour_distiller, palette="Spectral")

# The distiller scales extend brewer scales by smoothly interpolating 7 
# colours from any palette to a continuous scale. 
# The fermenter scales provide binned versions of the brewer scales.
# check out.
ggplot(faithful) + 
  geom_point(aes(x = waiting, y = eruptions, colour=eruptions/waiting)) +
  scale_colour_continuous(type=scale_colour_fermenter, palette="Spectral")

3.4.2 Size, shape scales

Work same as colours but are simpler.

Task In faithful, vary sizes by eruptions/waiting and shapes by regime (eruptions > 3.1).

  • Scale names, name of the mapping (e.g. shown in legend)
  • Scale breaks, breakpoints of the mapping
  • Scale limits, range of mapping
ggplot(faithful) + 
  geom_point(aes(x = waiting, y = eruptions)) +
  scale_x_continuous(limits=c(50,90),trans="log10")
## Warning: Removed 27 rows containing missing values (geom_point).

Which elements have been used?

  • Data, data.frame faithful
  • Mapping (aes() defines vars to coord.axes)
  • Statistics closely related to (their) geoms, yield different outputs that can be used
  • Scales explicit mapping of data values to figure values
  • Geometries, geoms tell how to display data (geom_point() - scatterplot)
  • Facets, map into data sub-samples
  • Coordinates
  • Theme

3.5 Guides

Scales map from data space into aesthetic space, whereas guides map visual properties back to data.

ggplot(faithful) + 
  geom_point(aes(x = waiting, y = eruptions, colour=eruptions/waiting)) +
  scale_colour_continuous(type=scale_colour_distiller, palette="Spectral") +
  guides(x=guide_axis(title="waiting (min)"), # 
         y=guide_axis(title="eruptions (sec)"),
         colour=guide_colorbar(title = "ratio"))

ggplot(faithful) + 
  geom_point(aes(x = waiting, y = eruptions, colour=eruptions/waiting)) +
  scale_x_continuous(name="wait",breaks=c(50,70,90) ,labels=c("fifty","seventy","ninety")) +
  scale_colour_continuous(type=scale_colour_distiller, palette="Spectral") +
  guides(x=guide_axis(title="waiting (min)"), # 
         y=guide_axis(title="eruptions (sec)"),
         colour=guide_colorbar(title = "ratio"))

TODO - expand.

3.6 Themes

There are multiple themes available (check cheatsheet), these days theme_minimal() seems to be popular. Once a theme is added, you may alter the positioning of the different graphical elements (e.g. the legend) and how they look like (fonts, sizes, etc.). Note that this is done through basic functions such as element_text() - they package all settings for a certain element.

ggplot theme elements TODO - expand.

4 Mastering ggplot2

# install.packages("palmerpenguins)
library(palmerpenguins)
data(package = 'palmerpenguins')

penguins

g <- ggplot(penguins, aes(x=flipper_length_mm, y=body_mass_g)) +
  geom_point()

Task: You should be now in the position to turn the figure above into the one below by using the cheatsheet and the manual (?theme)

Colors are: c(“darkorange”,“darkorchid”,“cyan4”)

5 Your figure HERE