This is part two of a series on ggplot2.

In today’s post, we will look at some of the basics of ggplot.  As mentioned in the previous post, ggplot has a number of layers.  In today’s post, we will look at two of these layers:  the basic mapping layer and the geometric object (geom) layer.

Loading Libraries

The first step is to load the libraries that we’ll use today.  I’m using the tidyverse library, which includes a number of useful packages including ggplot2.  I’m also loading the gapminder library, which has interesting periodic, cross-country data covering a 50-year time frame.

if(!require(tidyverse)) {
    install.packages("tidyverse", repos = "http://cran.us.r-project.org")
    library(tidyverse)
}

if(!require(gapminder)) {
    install.packages("gapminder", repos = "http://cran.us.r-project.org")
    library(gapminder)
}

Once I have these libraries loaded, I can start messing around with the data.

Let’s first take a quick look at what’s in the gapminder data frame, using the head and glimpse functions:

A quick glimpse at the gapminder data set.

We can see that there are three descriptive variables—country, continent and year—and three measures of interest—life expectancy, population, and GDP per capita.  The gapminder data set collects data for countries around the world from the period 1952 through 2007.

Mapping Data

The first question I want to ask is, what was the average life expectancy by continent in 1952?  I’m imagining a column chart to display this data (for a brief review of column charts, check out my post on various types of visuals).

The first thing we want to do is to get our data into the right format. We’ll see some of ggplot2’s ability to reshape data soon, but I want to start by feeding it the final data set, as that makes it easier for us to follow.

lifeExp_by_continent_1952 <- gapminder %>%
    filter(year == 1952) %>%
    group_by(continent) %>%
    summarize(avg_lifeExp = mean(lifeExp)) %>%
    select(continent, avg_lifeExp)

Now that we have my source data prepared, we can get to work on plotting. The first thing we need is a mapping. Here’s the mapping that we’re going to use:

ggplot(data = lifeExp_by_continent_1952, mapping = aes(x = continent, y = avg_lifeExp))

We have two parameter that we’re passing into the ggplot function: data and mapping. We can specify the data frame that we’re using as the data parameter; doing so means that we do not need to keep specifying it down the line. The mapping, as you’ll recall, lets us represent parts of the graph. Specifically, we’re going to define an aesthetic which lays out what the X and Y axes contain: the X axis will list the individual continents, and the Y axis will cover average life expectancy.

If we run that first line of code, we already have a visual:

What we get when we specify a mapping with X and Y values.

Yeah, it’s not a very useful visual, but it’s a start.

Adding Geometric Objects

The next step is to add a geometric object.  I mentioned that I want a column for each continent, and that column’s value represents the average life expectancy.  This leads to our first geom:  the column.

ggplot(data = lifeExp_by_continent_1952, mapping = aes(x = continent, y = avg_lifeExp)) +
    geom_col()

We can display a simple column chart in two lines of code.This gives us a column chart. The order of the columns is alphabetical, which could be the way that we want to display the data, but probably isn’t. We probably want to reorder the X axis by average life expectancy descending, and that’s what the reorder function lets us do.

ggplot(data = lifeExp_by_continent_1952, mapping = aes(x = reorder(continent, desc(avg_lifeExp)), y = avg_lifeExp)) +
    geom_col()
Reordering the column chart by mean life expectancy helps us make better sense of the data.

This might not be absolutely necessary for this particular visual, but it’s a good principle to follow and definitely helps us when there are more categories or several categories which are very close in mean life expectancy.

More Geometric Objects

We’re going to look at a few more geoms.  If you want to see even more, check out the ggplot2 cheat sheet or sape’s geom reference.

Scatter Plots And Smoothers

Let’s say that I want to test a conjecture that higher GDP per capita (measured here in USD) correlates with higher life expectancy.  I can plot out GDP per capita versus life expectancy pretty easily.

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
    geom_point()
Comparing GDP to average life expectancy.

Well, that’s not extremely clear…  But I made a mistake that should be clear to people who have done data analytics on economic data:  money typically should be expressed as a logarithmic function.  There’s a good way to do this but we won’t cover it today.  I’m going to cover the bad method today and save the good method for the next post in the series.  The bad method is to modify the X-axis variable and call the log function

ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp)) +
    geom_point()
Comparing the log of GDP to mean life expectancy.

Now we see a much clearer relationship between the log of GDP per capita and average life expectancy.  It’s not a perfect relationship, but there’s definitely a positive line that we could draw.  So let’s draw that positive line!

ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE)
Adding a regression line to our scatter plot.

We have used a new geom here, geom_smooth.  The geom_smooth function creates a smoothed conditional mean.  Basically, we’re drawing some line as a result of a function based on this input data.  Notice that there are two parameters that I set:  method and se.  The method parameter tells the function which method to use.  There are five methods available, including using a Generalized Additive Model (gam), Locally Weighted Scatterplot Smoothing (loess), and three varieties of Linear Models (lm, glm, and rlm).  The se parameter controls whether we want to see the standard error bar.  Here’s our graph with the standard error bar turned on:

ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp)) +
    geom_point() +
    geom_smooth(method = "lm", se = TRUE)
Showing the standard error bar on our smoothed conditional mean geom.

This model also represents the first time that we’ve created a complex visual.  This is a visual with dots as well as a line.  It was really easy to create this because we can lay out the two layers independent of one another:  I can have geom_point() without geom_smooth() or vice versa, so if I need to work on one layer, I can comment out the other and hide it until I’m ready.  This also allowed us to step through the visual iteratively.

Let’s turn off standard errors again and look at the scatter plot.  One trick we can use to see the line more clearly is to change the alpha channel for our scatter plot dots.  We can use the alpha parameter on geom_point to do just this.

ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp)) +
    geom_point(alpha = 0.2) +
    geom_smooth(method = "lm", se = FALSE)
Turning down the alpha channel on our scatter plot makes it easier to see overlap.

Now we can see the line more clearly without losing the scatter plot.  This has a second beneficial effect for us:  there was some overplotting of dots, where several country-year combos had roughly the same GDP and life expectancy.  By toning down the alpha channel a bit, we can see the overlap much more clearly.

Zooming in a bit, let’s filter down to one country, Germany.  We could create a new data frame with just the Germany data, but we don’t need to do that.  We can just simply use the filter function in dplyr and get data for Germany:

ggplot(data = filter(gapminder, country == "Germany"), mapping = aes(x = gdpPercap, y = lifeExp)) +
    geom_point()
Filtering down to just one country.

Note that I switched back to normal format for GDP per capita so we can see dollar amounts.

Other Charts

Creating a line chart is pretty easy as well.  Let’s graph GDP per capita in Germany from 1952 to 2007.

ggplot(data = filter(gapminder, country == "Germany"), mapping = aes(x = year, y = gdpPercap)) +
    geom_line()
A line chart of Germany’s GDP over time.

We can easily switch it over to an area chart with geom_area:

ggplot(data = filter(gapminder, country == "Germany"), mapping = aes(x = year, y = gdpPercap)) +
    geom_area(alpha = 0.4)
An area chart of Germany’s GDP over time.

Note that I’ve set the alpha on geom_area, mostly so that the amount of black doesn’t overwhelm the eyes.

We can create a step chart as well.  This is helpful in gauging the magnitude of changes from period to period a little more clearly than on a line chart:

ggplot(data = filter(gapminder, country == "Germany"), mapping = aes(x = year, y = gdpPercap)) +
    geom_step()
A step chart of Germany’s GDP over time.

We can also compare what the step output looks like versus the line output.  In this case, I’m coloring the step output as red and leaving the line as black.

ggplot(data = filter(gapminder, country == "Germany"), mapping = aes(x = year, y = gdpPercap)) +
    geom_step(color="Red") +
    geom_line()
Combining line and step charts. Note how the area compares between the two, making the step changes look sharper than the line changes.

Taking another tack, let’s see what the spread for GDP per capita is across continents in the year 1997.  We will once again use the log of GDP, and will create a box and whiskers plot using geom_boxplot.

ggplot(data = filter(gapminder, year == 1997), mapping = aes(x = continent, y = log(gdpPercap))) +
    geom_boxplot()
A box plot of GDP in 1997.

Conclusion

In today’s post, we looked at several of the geometric objects available within ggplot2.  We’re able to create simplistic but functional graphs with just two or three lines of code.  Starting with the next post, we’ll begin to improve some of these charts by looking at scales and coordinates.

One thought on “ggplot Basics: Mappings And Geoms

Leave a comment