ggplot Basics: Scales And Coordinates

This is part three of a series on ggplot2.

In yesterday’s post, I built a number of charts in ggplot2.  When I plotted the log of a variable, I mentioned that the way I did it wasn’t the best.  Today, we’re going to look at the right way to do it.

Scale

The first thing we’re going to look at is scaling our data.  Here’s a plot showing the relationship between GDP per capita and mean life expectancy:

Plotting mean life expectancy versus the log of per-capita GDP.

One problem with this visual is that we are making our users think a little too much.  People have trouble thinking in logarithmic terms.  If I tell you that the base-2 log of a value is 8.29, you probably won’t know that the value is 3983.83 without busting out a calculator.  But that’s what I’m making people do with this chart.  So let’s fix that with a scale.

if(!require(tidyverse)) {
    install.packages("tidyverse", repos = "http://cran.us.r-project.org")
    library(tidyverse)
}
 
if(!require(gapminder)) {
    install.packages("gapminder", repos = "http://cran.us.r-project.org")
    library(gapminder)
}

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
    geom_point(alpha = 0.2) +
    geom_smooth(method = "lm", se = FALSE) +
    scale_x_log10()

Using the built-in scale function instead of having us calculate the logs.

By adding one line of code, we changed the scale on the X axis from continuous to logarithmic in base 10.  That gives us numbers on the X axis that we can immediately understand:  1e4, or $10,000.  But, uh, maybe I want to see $10,000 instead of 1e+04?  Fortunately, there is a label parameter on the scale that lets us set a label.  The scales package in R (part of the tidyverse) gives us a set of pre-packaged labels, including USD and other currency formats.  This is what the call looks like:

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
    geom_point(alpha = 0.2) +
    geom_smooth(method = "lm", se = FALSE) +
    scale_x_log10(label = scales::dollar)

A useful X axis; how droll.

And now we have a version where my users don’t have to think hard about what those values mean.

Going Deeper With Scale

To understand the ggplot scale better, let’s take a look at what functions are available to us.

The quick summary is that there are two parts of most scale functions.  The first part describes what we want to scale, and the second part describes how we want to scale it.

First, the whats:

  1. alpha — Using alpha transparency levels to differentiate categories
  2. color — Using a color scale as a way to differentiate categories
  3. fill — Using a color fill as a way of describing a variable
  4. linetype — Using the line type (e.g., solid line, dotted line, dashed line) to differentiate categories
  5. shape — Using a shape (e.g., circle, triangle, square) to differentiate categories
  6. size — Using the size of a shape to differentiate categories
  7. x — Change the scale of the X axis
  8. y — Change the scale of the Y axis

Next, the hows, which I’ll break up into two categories.  The first category is the “differentiation” hows, which handle alpha, color, fill, linetype, shape, and size:

  1. Continuous
  2. Discrete
  3. Brewer
  4. Distiller
  5. Gradient / Gradient2 / Gradientn
  6. Grey
  7. Hue
  8. Identity
  9. Manual

And here are the X-Y hows:

  1. Continuous
  2. Discrete
  3. Log10
  4. Reverse
  5. Sqrt
  6. Date / Time / Datetime

There are a few scale functions which don’t fit this pattern (scale_radius) and a couple which have “default” how values (scale_alpha, scale_size, scale_shape).  Also, not all whats intersect with hows:  for example, there is no scale_shape_continuous or scale_size_hue because those combinations don’t make sense.

Now let’s dig into these a bit more and see what we can find.

X and Y Scales Galore

We’ve already seen scale_x_log10(), which converts the X axis to a base-10 logarithmic scale.  It turns out that this is just a transformation of scale_continuous().  So we can re-write it as:

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
    geom_point(alpha = 0.2) +
    geom_smooth(method = "lm", se = FALSE) +
    scale_x_continuous(trans = "log10", label = scales::dollar)

There are approximately 15 transformations and you can build your own if you’d like.  For the most part, however, you’re probably going to use the base scale or one of the most common transformations, which have their own functions:  scale_x_log10, scale_x_sqrt, and scale_x_reverse.

We can also handle dates and times in ggplot2.  Looking at the example for scaling dates, let’s start with the following code:

time_frame <- Sys.Date() - 0:31
df <- data.frame(
    date = time_frame,
    price = runif(32)
)
ggplot(df, aes(date, price)) +
    geom_line()

By default, ggplot2 handles dates sensibly.

This graph works, and you can see that it labels each week.  If we slap on scale_x_date(), this doesn’t change—that tells us that ggplot2 will use the default date scale for a date type; the scale_x_date function is there if we want to make modifications.

For example, suppose we have an event with a periodicity of eight days—like you send a crew out to a site for eight days of work and then send out the next crew and ship the last one back.  We can use the date_breaks parameter to show date breaks every eight days instead of the default:

ggplot(df, aes(date, price)) +
    geom_line() +
    scale_x_date(date_breaks = "8 days")

Make our own periodicity.

On this graph, you can see white lines that split our dates but don’t have labels; these are called minor breaks.  By default, there is one minor break halfway between each pair of major breaks.  Let’s say that instead, we want to show a minor break every 2 days.  We can use the date_minor_breaks function to set those.

ggplot(df, aes(date, price)) +
    geom_line() +
    scale_x_date(date_breaks = "8 days", date_minor_breaks = "2 day")

Periods within periods.

Finally, let’s say that we want to look at the middle two segments of our 32-day period.  We can use the limits parameter to define how much of the space we’d like to see:

ggplot(df, aes(date, price)) +
    geom_line() +
    scale_x_date(date_breaks = "8 days", date_minor_breaks = "2 day", limits = c(Sys.Date() - 24, Sys.Date() - 8))

Focusing in on a particular range.

Brewing Colors

Another area of intrigue is coloration.  ggplot2 will give you some colors by default, but you may not want to use them.  You can specify your own colors if you’d like, or you can ask Color Brewer for help.  For example, suppose I want to segment by gapminder data by continent, displaying each continent as a different color.  I can use the scale_color_brewer function to generate an appropriate set of colors for me, and it adds just one more line of code.

This function has two important parameters:  the type of data and the palette you wish to use.  By default, ggplot2 assumes that you’re sending sequential data.  You can also tell it that you are graphing divergent data (commonly seen in two-party electoral maps as the percentage margin of victory for each candidate) or that you have qualitative data (typically unordered categorical data).  In this case, I’m going to show all three even though the data doesn’t really fit two of them.

First up, here’s a sequential palette of various greens, starting from the lightest green and going darker based on continent name.

ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp)) +
    geom_point(mapping = aes(color = continent)) +
    scale_color_brewer(type = "seq", palette = "Greens") +
    geom_smooth(method = "lm", se = FALSE)

A sequential color map.

Next up, here’s what it looks like if you use a divergent color scheme, where names closer to A have shades of orange and names closer to Z have shades of purple.

ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp)) +
    geom_point(mapping = aes(color = continent)) +
    scale_color_brewer(type = "div", palette = "PuOr") +
    geom_smooth(method = "lm", se = FALSE)

Divergent colors, ranging from orange to purple.

Finally, we have a qualitative color scheme, which actually matches our data.  The five continents aren’t really continuous, so we’d want five different and unique colors to show our results.  Note that I’ve re-introduced alpha values here because these are solid colors and I want to be able to see some amount of interplay:

ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp)) +
    geom_point(alpha = 0.5, mapping = aes(color = continent)) +
    scale_color_brewer(type = "qual", palette = "Dark2") +
    geom_smooth(method = "lm", se = FALSE)

Qualitative coloration fits our data set the best.

Note that all of this above uses the scale_color_brewer function because we’re colorizing points.  If you want to colorize a bar graph or some other 2D structure, you’ll want to use scale_fill_brewer to colorize the filled-in portion and scale_color_brewer if for some reason you’d like the outline to be a different color.

For example, here is a bar chart of life expectancy by continent in 1952.  I’m setting the color to continent and have set the overall fill to white so you can see the coloration.

lifeExp_by_continent_1952 <- gapminder %>%
  filter(year == 1952) %>%
  group_by(continent) %>%
  summarize(avg_lifeExp = mean(lifeExp)) %>%
  select(continent, avg_lifeExp)

ggplot(data = lifeExp_by_continent_1952, mapping = aes(x = reorder(continent, desc(avg_lifeExp)), y = avg_lifeExp)) +
  geom_col(mapping = aes(color = continent), fill = "White") +
  scale_color_brewer(type = "seq", palette = "Greens")

The color parameter on a 2D object like a bar only colors the outline of the object.

That doesn’t look like what we intended.  This is more like it:

ggplot(data = lifeExp_by_continent_1952, mapping = aes(x = reorder(continent, desc(avg_lifeExp)), y = avg_lifeExp)) +
  geom_col(mapping = aes(fill = reorder(continent, desc(avg_lifeExp)))) +
  scale_fill_brewer(type = "seq", palette = "Greens", direction = -1)

Building a color scheme to fit our chart.

Note that I re-used the continent order to define colors in terms of mean life expectancy rather than alphabetically.  I also set the direction parameter on scale_fill_brewer to -1, which means to reverse colors.  By default, color brewed results go from light to dark, but here I want them to go from dark to light, so I reversed the direction.

Shapes And Sizes

You can plot data according to shape and size as well.  As far as shape goes, there are only a few options.  By default, we can have our scatter plot points show continents as different shapes using the following code:

ggplot(data = filter(gapminder, year == 1997), mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point(alpha = 0.5, mapping = aes(shape = continent)) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_log10(label = scales::dollar)

Using shapes to differentiate continents.

We can also choose whether to use solid or hollow shapes with the solid flag on scale_shape:

ggplot(data = filter(gapminder, year == 1997), mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point(alpha = 0.5, mapping = aes(shape = continent)) +
  scale_shape(solid = FALSE) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_log10(label = scales::dollar)

Using hollowed-out shapes to differentiate continents.

You can also set the size of a point using the size attribute, and can use scale_size to control the size.  Here’s an example where we increase in size based on continent:

ggplot(data = filter(gapminder, year == 1997), mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point(alpha = 0.5, mapping = aes(size = continent)) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_log10(label = scales::dollar)

Using size to differentiate continents. Not really a good idea at all, but you can do it.

Shapes tend to show up in black-and-white graphs of categorical data—like our continents—and sizes tend to show up with continuous variables.  In fact, when you try to run the code above, you get a warning:  “Using size for a discrete variable is not advised.”  It’s a bad practice, but it is something that you can do if you really want to.

Coordinates

The other thing I want to cover today is coordinate systems.  The ggplot2 documentation shows seven coordinate functions.  There are good reasons to use each, but I’m only going to demonstrate one.  By default, we use the Cartesian coordinate system and ggplot2 sets the viewing space.  This viewing space covers the fullness of your data set and generally is reasonable, though you can change the viewing area using the xlim and ylim parameters.

The special coordinate system I want to point out is coord_flip, which flips the X and Y axes.  This allows us, for example, to turn a column chart into a bar chart.  Taking our life expectancy by continent, data I can create a bar chart whereas before, we’ve been looking at column charts.  I can use coord_flip to switch the x and y axes:

ggplot(data = lifeExp_by_continent_1952, mapping = aes(x = reorder(continent, avg_lifeExp), y = avg_lifeExp)) +
  geom_col() +
  coord_flip()

Creating a bar chart instead of a column chart.

Now we have a bar chart.  With coord_flip(), we can easily create bar charts or Cleveland dot plots.

ggplot(data = lifeExp_by_continent_1952, mapping = aes(x = reorder(continent, avg_lifeExp), y = avg_lifeExp)) +
  geom_point(size = 4) +
  coord_flip()

A sample Cleveland dot plot.

Conclusion

In today’s post, we looked at some of the more common scale and coordinate functions in ggplot2.  There are quite a few that I did not cover, but I think this gives us a pretty fair idea of what we can create from this library.   In the next post, I will look at labels and annotations.

Advertisements