Tidy Data And Normalization

In Hadley Wickham’s paper on tidy data, he makes a few points that I really appreciated.

Data sets are made up of variables and observations.  In the database world, we’d call variables attributes and observations entities.  In the spreadsheet world, we’d call variables/attributes columns and observations/entities rows.

Each variable contains all values which measure the same underlying attribute across units.  For example, a variable might be the height of a person.  In that case, every instance of that variable should be the height of a person.  You should not intersperse measures, like having the variable contain height for a person, wingspan for a bird, and hind leg length for a dog.

Each observation contains all values measured on the same unit.  For example, we might have a person, and different variables which represent the person:  height, weight, wingspan, primary handedness, maximum number of ice cream cones consumed in a single sitting, etc.  We should not have data for two separate people in the same observation; each person, in this case, gets his own observation.

The reason we want to arrange our data this way is that it makes life easier for us.  First, it is easier to describe relationships between variables.  For example, your age is a function of your date of birth and the current date.  If we have date of birth and current date as two variables, we can easily calculate age.  Here we can see it in R and SQL:

people$age <- as.double(difftime(people$current_date, people$date_of_birth, units = "days")) / 365.25 
 SELECT 	DateOfBirth, 	CurrentDate, 	DATEDIFF(DAY, DateOfBirth, CurrentDate) / 365.25 AS Age FROM dbo.Person; 

By contrast, it is easier to make comparisons between observations.  For example, we can easily determine how many people are using a particular telephone number:

 telephones %>%
  group_by(telephone_number) %>%
  summarize(number_of_users = n()) %>%
  select(telephone_number, number_of_users)
SELECT
	TelephoneNumber,
	COUNT(1) AS NumberOfUsers
FROM dbo.Telephone
GROUP BY
	Telephone;

The kicker, as Wickham describes on pages 4-5, is that normalization is a critical part of tidying data.  Specifically, Wickham argues that tidy data should achieve third normal form.

Now, in practice, Wickham argues, we tend to need to denormalize data because analytics tools prefer having everything connected together, but the way we denormalize still retains a fairly normal structure:  we still treat observations and variables like we would in a normalized data structure, so we don’t try to pack multiple observations in the same row or multiple variables in the same column, reuse a column for multiple purposes, etc.

Next time around, I’m going to make an argument that 3NF isn’t where we need to be, that there’s a better place for those analytics cool kids to hang out.

Advertisements

ggplot2: Radar Love

This is part eight of a series on ggplot2.

As I bring this series to a close, I want to show off one last geom:  the radar chart.  I’m a fan of radar charts and you can build them natively with ggplot, but there is also an extension called ggradar.

This brings me to a bit of a sidebar.  During the course of this series, I’ve looked at several ggplot2 extensions:  ggthemes, ggrepel, and cowplot.  These extensions and more are available at the ggplot2 extensions gallery.  There are some quite good extensions here and if you’re struggling to conceptualize a graph in ggplot2, this might give you other alternatives.

Anyhow, let’s get our radar chart on.

Building A Radar Chart

The first trick to a good radar chart is having normalized data, where everything is scaled on a common range.  Typically, we standardize data on a range from 0 to 1, where 1 is the largest value in the set and the point value for each entity is its value divided by the largest value in the set.  So for example, in the gapminder data set, Norway’s per-capita GDP in 2007 is $49,357.19.  The United States is 4th at $42,951.65.  Standardized values would set Norway = 1.0 and the US at 0.87.

The second trick to a good radar chart is to have several variables of interest and a relatively small number of observations to track.  With gapminder, we can group by continent so we only have five observations, and can track a few variables for the year 2007:  number of countries, min and max life expectancy, average population, and min and max GDP per capita.

Here’s the code:

devtools::install_github("ricardo-bion/ggradar", dependencies=TRUE)
library(tidyverse)
library(gapminder)

standardize <- function(x){
    x/max(x)
}

radar_data <- gapminder %>%
    filter(year == 2007) %>%
    group_by(continent) %>%
    summarize(
        n = n(),
        minLife = min(lifeExp),
        maxLife = max(lifeExp),
        meanPop = mean(pop),
        minGdpPercap = min(gdpPercap),
        maxGdpPercap = max(gdpPercap)
    ) %>%
    mutate_each_(
        funs(standardize(.) %>% as.vector),
        vars=c("n", "minLife", "maxLife", "meanPop", "minGdpPercap", "maxGdpPercap")
    )

ggradar::ggradar(
    plot.data = radar_data,
    font.radar = "Gill Sans MT",
    grid.label.size = 6,
    axis.label.size = 5,
    group.point.size = 3,
    group.line.width = 1,
    legend.text.size = 12
)
50_radar

Things I can’t quit:  radar charts.

First, we need to install ggradar and load our relevant libraries. Then, I create a quick standardization function which divides our variable by the max value of that variable in the vector. It doesn’t handle niceties like divide by 0, but we won’t have any zero values in our data frames.

The radar_data data frame starts out simple: build up some stats by continent. Then I call the mutate_each_ function to call standardize for each variable in the vars set. mutate_each_ is deprecated and I should use something different like mutate_at, but this does work in the current version of ggplot2 at least.

Finally, I call the ggradar() function. This function has a large number of parameters, but the only one you absolutely need is plot.data. I decided to change the sizes because by default, it doesn’t display well at all on Windows.

What we end up with is a fun radar chart, letting us see how each continent stacks up.  We can see that Africa has the most countries and Oceania the least—Oceania only has two countries in the data set, Australia and New Zealand.

I should note that the “minimum” variables are for the lowest value within that continent, so the lowest GDP per capita in Oceania is significantly higher than the lowest GDP per capita in any other continent.  The easiest way to think about it is to consider these the best of the worst per continent.  The other variables work as you’d expect.

Conclusion

In today’s post, we took a look at a ggplot2 extension, ggradar.  These extensions provide us an easy way of getting functionality which we might possibly only get with great difficulty, or maybe not at all.  You don’t need to use extensions to create good visuals, but knowing where they are and when to use them can make a big difference.

ggplot2: cowplot

This is part seven of a series on ggplot2.

Up to this point, I’ve covered what I consider to be the basics of ggplot2.  Today, I want to cover a library which is still easy to use, but helps you create more advanced visuals:  cowplot.  I was excited by the name cowplot, but once I learned that it had nothing to do with cattle (instead, the author’s name is Claus O. Wilke), that did diminish the charm a little bit.  Nevertheless, there are a couple of great things you can do with this library and we’ll see one of them today.

If you’re interested in cowplot, I recommend reading the vignette first, as it provides several useful examples.  For our case, we are going to use cowplot to stack two related charts.

Charting Genocide

To this point, we have been using the gapminder data set to compare GDP and life expectancy across continents, but without looking at any countries in particular.  In today’s post, I want to show a comparison between one country and the world.

First up, let’s load our libraries:

library(tidyverse)
library(gapminder)
library(ggthemes)
library(extrafont)
if(.Platform$OS.type == "windows") {
    loadfonts(device="win")
} else {
    loadfonts()
}

Next up, I want to build a plot showing GDP and life expectancy changes over time across the globe.  The gapminder data set has a number of individual year-GDP-expectancy points, so we’re going to summarize them first in a data frame.  After I do that, I will plot them using ggplot.

global_avg <- gapminder %>%
    group_by(year) %>%
    summarize(m_lifeExp = mean(lifeExp), m_gdpPercap = mean(gdpPercap)) %>%
    select(year, m_lifeExp, m_gdpPercap)

plot_global <- ggplot(data = global_avg, mapping = aes(x = m_gdpPercap, y = m_lifeExp)) +
    geom_point() +
    geom_path(color = "#999999") +
    scale_x_continuous(label = scales::dollar) +
    geom_text_repel(
        mapping = aes(label = year),
        nudge_y = 0.7,
        nudge_x = -120,
        segment.alpha = 0,
        family = "Gill Sans MT",
        size = 4
    ) +
    theme_minimal() +
    labs(
        x = "GDP (PPP, normalized to 2005 USD)",
        y = NULL,
        caption = "Source:  Gapminder data set, 2010"
    ) +
    theme(
        text = element_text(family = "Gill Sans MT"),
        plot.title = element_text(size = 20),
        plot.subtitle = element_text(size = 12),
        plot.caption = element_text(size = 9),
        legend.title = element_text(size = 9),
        axis.title = element_text(size = 10)
    )
plot_global
46_global_changes

Global changes in GDP and life expectancy over time

Notice that I used geom_path().  This is a geom I did not cover earlier in the series.  It’s not a common geom, though it does show up in charts like this where we want to display data for three variables.  The geom_line() geom follows the basic rules for a line:  that the variable on the y axis is a function of the variable on the x axis, which means that for each element of the domain, there is one and only one corresponding element of the range (and I have a middle school algebra teacher who would be very happy right now that I still remember the definition she drilled into our heads all those years ago).

But when you have two variables which change over time, there’s no guarantee that this will be the case, and that’s where geom_path() comes in.  The geom_path() geom does not plot y based on sequential x values, but instead plots values according to a third variable.  The trick is, though, that we don’t define this third variable—it’s implicit in the data set order.  In our case, our data frame comes in ordered by year, but we could decide to order by, for example, life expectancy by setting data = arrange(global_avg, m_lifeExp).  Note that in a scenario like these global numbers, geom_line() and geom_path() produce the same output because we’ve seen consistent improvements in both GDP per capita and life expectancy over the 55-year data set.  So let’s look at a place where that’s not true.

ggplot(data = filter(gapminder, country == "Cambodia"), mapping = aes(x = gdpPercap, y = lifeExp)) +
    geom_point() +
    geom_path(color = "#999999") +
    scale_x_continuous(label = scales::dollar) +
    geom_text_repel(
        mapping = aes(label = year),
        segment.alpha = 0,
        family = "Gill Sans MT",
        size = 4
    ) +
    theme_minimal() +
    labs(
        x = "GDP (PPP, normalized to 2005 USD)",
        y = NULL,
        caption = "Source:  Gapminder data set, 2010"
    ) +
    theme(
        text = element_text(family = "Gill Sans MT"),
        plot.title = element_text(size = 20),
        plot.subtitle = element_text(size = 12),
        plot.caption = element_text(size = 9),
        legend.title = element_text(size = 9),
        axis.title = element_text(size = 10)
    )
47_cambodia

Cambodian changes in GDP and life expectancy over time

Cambodia starts out similar to the rest of the world, seeing some growth in GDP per capita and life expectancy through 1967, but a precipitous drop in both during the 1970s.  The reason was the Khmer Rouge, one of the nastiest communist governments.  This graph alone is evidence of disaster, but I really want to drive the point home:  I want a direct comparison between what happened in Cambodia versus the rest of the world at the same time, and that’s where cowplot comes in.

Plotting A Grid

We’ve seen facet_wrap() and facet_grid() already in ggplot2, but cowplot’s plot_grid() has something very helpful for us:  the rel_heights parameter.  This lets us state what percentage of the total visual space each chart should take.  Let’s take the global plot, attach the Cambodian plot, and clean up titles and axes.  Then we’ll call cowplot’s plot_grid() function.  Here’s the full code:

plot_cambodia <- ggplot(data = filter(gapminder, country == "Cambodia"), mapping = aes(x = gdpPercap, y = lifeExp)) +
    geom_point() +
    geom_path(color = "#999999") +
    scale_x_continuous(label = scales::dollar) +
    geom_text_repel(
        mapping = aes(label = year),
        segment.alpha = 0,
        family = "Gill Sans MT",
        size = 4
    ) +
    theme_minimal() +
    labs(
        x = NULL,
        y = NULL,
        title = "The Khmer Rouge Legacy",
        subtitle = "Charting Cambodian life expectancy and GDP over time, compared to global averages."
    ) +
    theme(
        text = element_text(family = "Gill Sans MT"),
        plot.title = element_text(size = 20),
        plot.subtitle = element_text(size = 12),
        plot.caption = element_text(size = 9),
        legend.title = element_text(size = 9),
        axis.title = element_text(size = 10)
    )

global_avg <- gapminder %>%
    group_by(year) %>%
    summarize(m_lifeExp = mean(lifeExp), m_gdpPercap = mean(gdpPercap)) %>%
    select(year, m_lifeExp, m_gdpPercap)

plot_global <-
    ggplot(data = global_avg, mapping = aes(x = m_gdpPercap, y = m_lifeExp)) +
    geom_point() +
    geom_path(color = "#999999") +
    scale_x_continuous(label = scales::dollar) +
    geom_text_repel(
        mapping = aes(label = year),
        nudge_y = 0.7,
        nudge_x = -120,
        segment.alpha = 0,
        family = "Gill Sans MT",
        size = 4
    ) +
    theme_minimal() +
    labs(
        x = "GDP (PPP, normalized to 2005 USD)",
        y = NULL,
        caption = "Source:  Gapminder data set, 2010"
    ) +
    theme(
        text = element_text(family = "Gill Sans MT"),
        plot.title = element_text(size = 20),
        plot.subtitle = element_text(size = 12),
        plot.caption = element_text(size = 9),
        legend.title = element_text(size = 9),
        axis.title = element_text(size = 10)
    )

cowplot::plot_grid(plot_cambodia, plot_global, rel_heights = c(0.55, 0.45), ncol=1)
48_khmer_rouge_legacy

Comparing Cambodia to the rest of the world

We used relative heights of 55% versus 45% for this plot.  If you squeeze the world chart down, the line flattens out and distorts the image, so we want to keep these plots relatively similarly sized.

One thing I don’t like about this chart is that the year labels still end up overlapping the lines.  The ggrepel library will have text shift away from data points, but it doesn’t appear to prevent overlapping lines in a geom_path() geom.  I tried different nudge values but nothing quite worked right.

Keeping The Same X Axis

In this next chart, we’re going to look at Rwanda, another country which experienced a well-known genocide.  This time, instead of plotting both GDP per capita and life expectancy, we’re only going to look at life expectancy changes over time.  In the top chart, I’ll show Rwanda’s figures.  In the bottom chart, I’ll show a line chart with global averages over the same time frame.  Because we’ll use the same X axis, I don’t want two separate X axes for the two charts; I want them to blend.

plot_rwanda <- ggplot(data = filter(gapminder, country == "Rwanda"), mapping = aes(x = year, y = lifeExp)) +
    geom_point() +
    geom_line(color = "#999999") +
    theme_minimal() +
    labs(
        x = NULL,
        y = NULL,
        title = "The Rwandan Genocide",
        subtitle = "Charting Rwandan life expectancy over time, compared to the global average."
    ) +
    theme(
        text = element_text(family = "Gill Sans MT"),
        plot.title = element_text(size = 20),
        plot.subtitle = element_text(size = 12),
        axis.text.x = element_blank(),
        axis.title.x = element_blank(),
        axis.ticks.x = element_blank()
    )

plot_global <-
    ggplot(data = global_avg, mapping = aes(x = year, y = m_lifeExp)) +
    geom_point() +
    geom_line(color = "#999999") +
    theme_minimal() +
    labs(
        x = NULL,
        y = NULL,
        subtitle = "Global Average",
        caption = "Source:  Gapminder data set, 2010"
    ) +
    theme(
        text = element_text(family = "Gill Sans MT"),
        plot.title = element_text(size = 20),
        plot.subtitle = element_text(size = 12),
        plot.caption = element_text(size = 9),
        legend.title = element_text(size = 9),
        axis.title = element_text(size = 10)
    )

cowplot::plot_grid(plot_rwanda, plot_global, rel_heights = c(0.55, 0.45), ncol=1)
49_rwanda

Seeing the Rwandan genocide in stark contrast to global averages

There are a couple of changes here.  Because I have a consistent X axis, I removed the ticks from the top graph.  I also removed the text labels, as we are now showing year explicitly instead of implicitly through the data path.

Conclusion

This is only one of the uses for cowplot, but it’s a good one.  We are also not limited to two charts—we could just as easily stack an indefinite number of charts on top of one another and define relative sizes for each chart.  We can also combine cowplot’s plot_graph() with facet_wrap() to group together a set of charts and fit it in relationship to another chart.  This would be helpful if, say, we showed one country’s change in life expectancy over time to plots of similar countries’ changes over time.

We Speak Linux

I’m pleased to announce the launch of We Speak Linux, a site dedicated to helping Windows administrators and developers become familiar with Linux.  This has been Tracy Boggiano’s pet project for several months.  Along for the ride are Brian Carrig (who still needs to update his blog), Mark Wilkinson, Anthony Nocentio, and me.  I have the good fortune to work with Tracy, Brian, and Mark on a daily basis, but we haven’t recruited Anthony yet…

Why This?  Why Now?

All five of the founders are data platform professionals specializing in SQL Server.  Three years ago, that wouldn’t exactly have sounded like the kind of group to start up a Linux-oriented site, but then things started changing.  Now, it’s been a dream of mine to have SQL Server Management Studio on Linux (e.g., this post from 2014) so that I could dump Windows and go full-time with Linux on my computers, but I figured the chances of SQL on Linux happening were nil.  Then Scott Guthrie went and announced SQL Server on Linux in March of 2016.  As soon as we heard about it, Mark and I began conspiring to get involved in the program preview.  We caught the eye of the SQL Server Customer Advisory Team (SQLCAT) and had an opportunity in early 2017 to test a workload in Linux.  Seeing how serious the SQL Server team (especially Slava Oks) was sold it for me.  Half a year later, Microsoft released SQL Server 2017 on Linux.

Since then, Tracy and Anthony have been active in the SQL Server community, introducing Windows administrators and developers to Linux and showing that the transition from Windows to Linux for SQL Server is rather straightforward (especially for developers, who generally shouldn’t care what the underlying OS does).  We’re now taking this one step further.

What Are We Doing Here?

We Speak Linux is dedicated to building a virtual user group experience for Linux.  First, we’re hosting monthly webinars  on various topics of interest.  Kellyn Pot’vin-Gorman will present at our inaugural event and we’re working on filling out the schedule for the rest of the year.

Second, we’ve set up a Slack for We Speak Linux.  Joining this Slack is easy; just fill in your e-mail address and we’ll get you an automatic invitation.

Third, we have a Twitter account.

From there, we have other things planned but I don’t want to spoil everything just yet.

What Can You Do?

Are you a Windows developer or administrator interested in learning about Linux?  Check out our upcoming webinars and as we get closer to go-live for the first webinar, there will be registration details.

Are you a seasoned Linux professional and willing to help Windows developers make the leap to Linux?  Contact us; we’d love to talk to you about doing a session.  This isn’t a SQL Server-specific group, so we want a broad range of talks.

ggplot Basics: Facets

This is part six of a series on ggplot2.

Up to this point, we’ve looked at single graphs.  But sometimes, a single graph can get a little too complicated for us.  Let’s go back to our gapminder data set showing data by continent:

39_final

The relationship between wealth and longevity across the world.

I’d like to see if these relationships hold within the five different continents.  I can easily change the R code to give me five smoothed lines, one per continent:

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
    geom_point(alpha = 0.5, mapping = aes(color = continent)) +
    geom_smooth(method = "lm", se = FALSE, mapping = aes(color = continent)) +
    scale_color_brewer(type = "qual", palette = "Dark2") +
    scale_x_log10(label = scales::dollar) +
    theme_minimal() +
    labs(
        x = "GDP (PPP, normalized to 2005 USD)",
        y = NULL,
        title = "Wealth And Longevity",
        subtitle = "Charting the relationship between a country's prosperity and its residents' life expectancy.",
        caption = "Source:  Gapminder data set, 2010",
        color = "Continent"
    ) +
    guides(color = guide_legend(title = "Continent:")) +
    theme(
        legend.position = "bottom",
        text = element_text(family = "Gill Sans MT"),
        plot.title = element_text(size = 20),
        plot.subtitle = element_text(size = 12),
        plot.caption = element_text(size = 9),
        legend.title = element_text(size = 9),
        axis.title = element_text(size = 10)
    )
41_five_lines

That definitely cleared things up.

That’s pretty ugly.  How about instead, we show each as a separate plot?  We could write the R code to show each individually, but then we’d need to know about each category.  Instead, let’s use the facet functionality in ggplot:  facet_wrap() and facet_grid().

Facet Wrap

The facet_wrap() function wraps one grid after another after another.  Because we’re only displaying two variables per scatter plot (we are no longer showing continent), we can remove the separate colors and go back to a single, consistent color for each graph.

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
    geom_point(alpha = 0.5) +
    geom_smooth(method = "lm", se = FALSE) +
    scale_x_log10(label = scales::dollar) +
    theme_minimal() +
    labs(
        x = "GDP (PPP, normalized to 2005 USD)",
        y = NULL,
        title = "Wealth And Longevity",
        subtitle = "Charting the relationship between a country's prosperity and its residents' life expectancy.",
        caption = "Source:  Gapminder data set, 2010",
        color = "Continent"
    ) +
    guides(color = guide_legend(title = "Continent:")) +
    theme(
        text = element_text(family = "Gill Sans MT"),
        plot.title = element_text(size = 20),
        plot.subtitle = element_text(size = 12),
        plot.caption = element_text(size = 9),
        legend.title = element_text(size = 9),
        axis.title = element_text(size = 10)
    ) +
    facet_wrap(facets = ~continent, ncol = 3)
42_facet_wrap

Using facet_wrap(), we can easily create independent but related graphs.

Notice that we create a graph per continent by setting facets = ~continent.  The tilde there is important—it’s a one-sided formula.  You could also write c("continent") if that’s clearer to you.

I also set the number of columns, guaranteeing that we see no more than 3 columns of grids. I could alternatively set nrow, which would guarantee we see no more than a certain number of rows.

There are a couple other interesting features in facet_wrap. First, we can set scales = "free" if we want to draw each grid as if the others did not exist. By default, we use a scale of “fixed” to ensure that everything plots on the same scale. I prefer that for this exercise because it lets us more easily see those continental clusters.

Facet Grid

The facet_grid() function builds a matrix of panels.  Unlike facet_wrap(), there is no ncol or nrow parameter. Instead, we have the ability to define the left-hand or right-hand side of an equation to populate the grids.

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
    geom_point(alpha = 0.5) +
    scale_x_log10(label = scales::dollar) +
    theme_minimal() +
    labs(
        x = "GDP (PPP, normalized to 2005 USD)",
        y = NULL,
        title = "Wealth And Longevity",
        subtitle = "Charting the relationship between a country's prosperity and its residents' life expectancy.",
        caption = "Source:  Gapminder data set, 2010",
        color = "Continent"
    ) +
    guides(color = guide_legend(title = "Continent:")) +
    theme(
        text = element_text(family = "Gill Sans MT"),
        plot.title = element_text(size = 20),
        plot.subtitle = element_text(size = 12),
        plot.caption = element_text(size = 9),
        legend.title = element_text(size = 9),
        axis.title = element_text(size = 10)
    ) +
    facet_grid(facets = continent~.)
43_facet_grid_rows

A chart with grid faceting by row.

Note that I took the smoothed line off in this case. That way, we can more easily see the data points and not the line. I’ve got one variable of interest on the left-hand side—that is, one variable which defines the rows of this grid. Because the right-hand side is “everything else,” we can share the X axis for all of these grids. This particular setup lets us contrast PPP GDP by continent fairly easily.

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
    geom_point(alpha = 0.5) +
    scale_x_log10(label = scales::dollar) +
    theme_minimal() +
    labs(
        x = "GDP (PPP, normalized to 2005 USD)",
        y = NULL,
        title = "Wealth And Longevity",
        subtitle = "Charting the relationship between a country's prosperity and its residents' life expectancy.",
        caption = "Source:  Gapminder data set, 2010",
        color = "Continent"
    ) +
    guides(color = guide_legend(title = "Continent:")) +
    theme(
        text = element_text(family = "Gill Sans MT"),
        plot.title = element_text(size = 20),
        plot.subtitle = element_text(size = 12),
        plot.caption = element_text(size = 9),
        legend.title = element_text(size = 9),
        axis.title = element_text(size = 10)
    ) +
    facet_grid(facets = .~continent)
44_facet_grid_columns

A chart with grid faceting by column.

And here’s what happens when I put continent on the right-hand side. Now we have a shared Y axis, letting us see relative life expectancy clusters by continent.

So what happens if we define both sides? Then we start building out our grid:

ggplot(data = filter(gapminder, year %in% c(1982, 2007)), mapping = aes(x = gdpPercap, y = lifeExp)) +
    geom_point(alpha = 0.5) +
    scale_x_log10(label = scales::dollar) +
    theme_minimal() +
    labs(
        x = "GDP (PPP, normalized to 2005 USD)",
        y = NULL,
        title = "Wealth And Longevity",
        subtitle = "Charting the relationship between a country's prosperity and its residents' life expectancy.",
        caption = "Source:  Gapminder data set, 2010",
        color = "Continent"
    ) +
    guides(color = guide_legend(title = "Continent:")) +
    theme(
        text = element_text(family = "Gill Sans MT"),
        plot.title = element_text(size = 20),
        plot.subtitle = element_text(size = 12),
        plot.caption = element_text(size = 9),
        legend.title = element_text(size = 9),
        axis.title = element_text(size = 10)
    ) +
    facet_grid(facets = year~continent)
45_facet_grid_both

A chart with grid faceting by row and by column.

In this example, I am looking at the years 1982 and 2007 and comparing life expectancy to income per continent—that is, four separate variables in one plot. It’s getting a bit too busy on this chart, but we can make out some trends, like a big boost in life expectancy across the board, but particularly in Asia.

Conclusion

Faceting is one way to introduce one or more “extra” variables into a plot.  By breaking data out into multiple, connected plots, we can make relationships clearer.  Doing so runs the risk of information overload, however:  if I try to fit 20 or 30 graphs on the same page, I’m probably going to be doing more confusing than elucidating.

In the next post, I’ll look at another way of arranging graphs using an external library.

ggplot Basics: Themes And Legends

This is part five of a series on ggplot2.

Today, we are going to spend some time on themes and legends in ggplot2.  This is where we can add a lot of polish to our graphs.

Legends

The guides() function gives us some control over how legends appear.  Let’s start with a graph which includes a single legend:

library(tidyverse)
library(gapminder)
library(ggrepel)

oddities <- gapminder %>%
filter(gdpPercap > 75000 & lifeExp < 70) %>%
group_by(country) %>%
summarize(maxLifeExp = max(lifeExp)) %>%
inner_join(gapminder, by = c("country" = "country", "maxLifeExp" = "lifeExp"))

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
    geom_point(alpha = 0.5, mapping = aes(color = continent)) +
    scale_color_brewer(type = "qual", palette = "Dark2") +
    geom_smooth(method = "lm", se = FALSE) +
    scale_x_log10(label = scales::dollar) +
    labs(
        x = "GDP (PPP, normalized to 2005 USD)",
        y = "Mean Life Expectancy",
        title = "Wealth And Longevity",
        subtitle = "Charting the relationship between a country's prosperity and its residents' life expectancy.",
        caption = "Source:  Gapminder data set, 2010",
        color = "Continent"
    ) +
    annotate(
        geom = "text",
        x = 85000,
        y = 48.3,
        label = "High-GDP countries with\nunexpectedly low mean\nlife expectancy.",
        size = 3.5
    ) +
    annotate(
        geom = "rect",
        xmin = 75000,
        xmax = 130000,
        ymin = 53,
        ymax = 70,
        fill = "Red",
        alpha = 0.2
    ) +
    geom_text_repel(
        data = oddities,
        mapping = aes(x = gdpPercap, y = maxLifeExp, label = country),
        size = 2.3, segment.color = NA, nudge_x = 0
    )
33_geom_text_repel

Starting where we left off…

By default, the legend is on the right-hand side and is named after the variable.  That label is fine here, but often times, you’ll want something a bit nicer.

First up, let’s use the guides() function to fuss with the guide.  Inside guides(), you can work with any individual legend on a plot.  Our continent legend is based on the color of data points—we defined that up in geom_point()—so we want to modify the guide associated with color.  To modify this, we use the guide_legend() function.  The guide_legend() function lets us set details on the legend, like how many rows or columns, what the legend title should be, label ordering, and positioning.

In our case, I’m going to add a colon to our title.  If I wanted to change the number of columns or number of rows used to display the legend, I could set ncol or nrow here, respectively.  But this legend looks alright as a single column–making it two columns doesn’t make it look better.

I would next like to show the continent list at the bottom of the graph rather than on the right-hand side.  To do this, we need to introduce the theme() function.  The theme() function is jam-packed with parameters, but we’re going to start with legend.position:

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
    geom_point(alpha = 0.5, mapping = aes(color = continent)) +
    scale_color_brewer(type = "qual", palette = "Dark2") +
    geom_smooth(method = "lm", se = FALSE) +
    scale_x_log10(label = scales::dollar) +
    labs(
        x = "GDP (PPP, normalized to 2005 USD)",
        y = "Mean Life Expectancy",
        title = "Wealth And Longevity",
        subtitle = "Charting the relationship between a country's prosperity and its residents' life expectancy.",
        caption = "Source:  Gapminder data set, 2010",
        color = "Continent"
    ) +
    annotate(
        geom = "text",
        x = 85000,
        y = 48.3,
        label = "High-GDP countries with\nunexpectedly low mean\nlife expectancy.",
        size = 3.5
    ) +
    annotate(
        geom = "rect",
        xmin = 75000,
        xmax = 130000,
        ymin = 53,
        ymax = 70,
        fill = "Red",
        alpha = 0.2
    ) +
    geom_text_repel(
        data = oddities,
        mapping = aes(x = gdpPercap, y = maxLifeExp, label = country),
        size = 2.3, segment.color = NA, nudge_x = 0
    ) +
    guides(color = guide_legend(title = "Continent:")) +
    theme(legend.position = "bottom")
36_legend_changes

We have moved the legend to the bottom and changed the title slightly.

There’s a lot more we can do with the theme() function around legends.  We can set background, spacing, alignment, direction, and justification for the title, the keys, and the boxes.  In this case, I’m going to leave well enough alone and move on to overall themes.

Themes

There are a few themes built into ggplot2:  theme_grey() [the default], theme_bw(), theme_classic(), and theme_minimal().  Of these four built-in themes, my preference is for theme_minimal(), which is a minimalist approach to visualization.  The background is white rather than grey, there aren’t any borders or boxes, and it really makes your data the star of the show.

As an important note, you must put the theme before any modifications you want to make.  So in our case, theme_minimal() must go before my theme() function call; otherwise, theme_minimal() will override my choices and show the legend on the right-hand side.

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
    geom_point(alpha = 0.5, mapping = aes(color = continent)) +
    scale_color_brewer(type = "qual", palette = "Dark2") +
    geom_smooth(method = "lm", se = FALSE) +
    scale_x_log10(label = scales::dollar) +
    labs(
        x = "GDP (PPP, normalized to 2005 USD)",
        y = "Mean Life Expectancy",
        title = "Wealth And Longevity",
        subtitle = "Charting the relationship between a country's prosperity and its residents' life expectancy.",
        caption = "Source:  Gapminder data set, 2010",
        color = "Continent"
    ) +
    annotate(
        geom = "text",
        x = 85000,
        y = 48.3,
        label = "High-GDP countries with\nunexpectedly low mean\nlife expectancy.",
        size = 3.5
    ) +
    annotate(
        geom = "rect",
        xmin = 75000,
        xmax = 130000,
        ymin = 53,
        ymax = 70,
        fill = "Red",
        alpha = 0.2
    ) +
    geom_text_repel(
        data = oddities,
        mapping = aes(x = gdpPercap, y = maxLifeExp, label = country),
        size = 2.3, segment.color = NA, nudge_x = 0
    ) +
    theme_minimal() +
    guides(color = guide_legend(title = "Continent:")) +
    theme(legend.position = "bottom")
37_theme_minimal

This minimalist theme helps us focus on the visual rather than the accouterments.

The ggthemes library gives us a couple dozen more themes, as well as some color and shape scales.  Let’s switch color palette and theme, using a colorblind-safe palette and the 538 theme.

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
    geom_point(alpha = 0.5, mapping = aes(color = continent)) +
    ggthemes::scale_color_colorblind() +
    geom_smooth(method = "lm", se = FALSE) +
    scale_x_log10(label = scales::dollar) +
    labs(
        x = "GDP (PPP, normalized to 2005 USD)",
        y = "Mean Life Expectancy",
        title = "Wealth And Longevity",
        subtitle = "Charting the relationship between a country's prosperity and its residents' life expectancy.",
        caption = "Source:  Gapminder data set, 2010",
        color = "Continent"
    ) +
    annotate(
        geom = "text",
        x = 85000,
        y = 48.3,
        label = "High-GDP countries with\nunexpectedly low mean\nlife expectancy.",
        size = 3.5
    ) +
    annotate(
        geom = "rect",
        xmin = 75000,
        xmax = 130000,
        ymin = 53,
        ymax = 70,
        fill = "Red",
        alpha = 0.2
    ) +
    geom_text_repel(
        data = oddities,
        mapping = aes(x = gdpPercap, y = maxLifeExp, label = country),
        size = 2.3, segment.color = NA, nudge_x = 0
    ) +
    ggthemes::theme_fivethirtyeight() +
    guides(color = guide_legend(title = "Continent:")) +
    theme(legend.position = "bottom")
38_ggthemes

Reskinning our chart

It’s pretty easy to swap out themes and scales, and there are some nice themes in here.  Some of my favorites are theme_economist(), theme_wsj(), theme_fivethirtyeight(), and theme_few().

Custom Modifications

You are not limited to using defaults in your graphs.  Let’s go back to the minimal theme but change the fonts a bit.  I want to make the following changes:

  1. Use Gill Sans fonts instead of the default
  2. Increase the title font size a little bit
  3. Decrease the X axis font size a little bit
  4. Remove the Y axis; the subtitle makes it clear what the Y axis contains

Fonts?  On Windows?

If you’re following along on a Windows box, you will inevitably hit one of my favorite errors:  “font family not found in Windows font database

It turns out that the way R on Windows works with fonts is a bit different than on MacOS or Linux.  If you stick to the default fonts, you’re okay, but as soon as you want to start doing anything fancy, you get stuck in font purgatory.

There are a few ways to try to solve this problem.  I’ve tried most of them with mixed success.  Most of them involve loading the extrafont library.  Then, import your fonts and load the Windows fonts.  Note that font import takes a while—it took 5-10 minutes on my machines.

install.packages("extrafont")
library(extrafont)
font_import()
loadfonts(device="win") #load Windows-specific fonts

This is definitely a place where R on Linux/Mac is superior to R on Windows.

The Changes

With that sidebar out of the way, let’s look at our new graph:

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
    geom_point(alpha = 0.5, mapping = aes(color = continent)) +
    scale_color_brewer(type = "qual", palette = "Dark2") +
    geom_smooth(method = "lm", se = FALSE, color = "#777777") +
    scale_x_log10(label = scales::dollar) +
    theme_minimal() +
    labs(
        x = "GDP (PPP, normalized to 2005 USD)",
        y = NULL,
        title = "Wealth And Longevity",
        subtitle = "Charting the relationship between a country's prosperity and its residents' life expectancy.",
        caption = "Source:  Gapminder data set, 2010",
        color = "Continent"
    ) +
    annotate(
        geom = "text",
        x = 85000,
        y = 48.3,
        label = "High-GDP countries with\nunexpectedly low mean\nlife expectancy.",
        size = 3.5,
        family = "Gill Sans MT"
    ) +
    annotate(
        geom = "rect",
        xmin = 75000,
        xmax = 130000,
        ymin = 53,
        ymax = 70,
        fill = "Red",
        alpha = 0.2
    ) +
    geom_text_repel(
        data = oddities,
        mapping = aes(x = gdpPercap, y = maxLifeExp, label = country),
        size = 3, segment.color = NA, nudge_x = 0, family = "Gill Sans MT"
    ) +
    guides(color = guide_legend(title = "Continent:")) +
    theme(
        legend.position = "bottom",
        text = element_text(family = "Gill Sans MT"),
        plot.title = element_text(size = 20),
        plot.subtitle = element_text(size = 12),
        plot.caption = element_text(size = 9),
        legend.title = element_text(size = 9),
        axis.title = element_text(size = 10)
    )
39_final

A final version of the graph.

I made a few changes here. You can see that I added family = “Gill Sans MT” to several spots. This changes the font from a default sans-serif font to the Gill Sans MT library. This is a smaller sans-serif font, so I bumped up the size of the title and subtitle in the theme() function to set them off a bit relative to the X axis font size. I also changed the geom_smooth line color to gray, so that it’s a little easier to focus on the distribution of dots rather than the line itself.

At this point, we have a publication-quality graph.  Well done!

Creating Another Publication-Worthy Graph

Let’s go back to the time series graph from before:

time_frame <- Sys.Date() - 0:31
df <- data.frame(
    date = time_frame,
    price = runif(32)
)
annotations <- data.frame(
    date = c(Sys.Date() - 24, Sys.Date() - 3),
    remark = c("Problem reported", "False alarm reported")
)

ggplot(df, aes(date, price)) +
    geom_line() +
    scale_x_date(date_breaks = "8 days", date_minor_breaks = "2 day") +
    geom_vline(xintercept = as.numeric(annotations$date), color = "Red", linetype = "dotdash") +
    geom_text(
        data = annotations,
        mapping = aes(x = date, y = 0, label = remark),
        color = "Red"
    )
36_annotated_time_frame

A half-finished graph

We can take what we know and turn this into a fully finished graph.

ggplot(df, aes(date, price)) +
    geom_line() +
    scale_x_date(date_breaks = "8 days", date_minor_breaks = "2 day") +
    geom_vline(xintercept = as.numeric(annotations$date), color = "Red", linetype = "dotdash") +
    geom_text_repel(
        data = annotations,
        mapping = aes(x = date, y = 0, label = remark),
        color = "Red",
        nudge_x = -2
    ) +
    theme_minimal() +
    labs(
        x = NULL,
        y = "Price",
        title = "Widget Price Changes",
        caption = "Source:  Vital Corporate Data Set"
    ) +
    theme(
        text = element_text(family = "Gill Sans MT"),
        plot.title = element_text(size = 20),
        plot.caption = element_text(size = 9),
        axis.title = element_text(size = 14),
        axis.text = element_text(size = 12)
    )
40_line_final

A nicer version of the graph

There are a few changes here.  I used the same theme and title scheme as before, so these two graphs could fit together as part of the same report.  I switched from geom_text to ggrepel‘s geom_text_repel and used the nudge_x attribute over so that the text did not overlap with the vertical line.  Just making a few simple changes is enough to turn a graph from half-finished to something with a lot more polish.

ggplot Basics: Labels And Annotations

This is part four of a series on ggplot2.

Last time around, we looked at how to use scales and coordinates to clean up charts.  Today, we’re going to dig into labels and annotations, two vital parts of creating aesthetically pleasing graphs.

Labels

In ggplot2, we use the labs function to modify labels.  By labels, I’m including labeling the X and Y axis, creating subtitles and titles, creating captions, and the header for a legend.

Let’s start with an image that we’ve already seen before:

if(!require(tidyverse)) {
  install.packages("tidyverse", repos = "http://cran.us.r-project.org")
  library(tidyverse)
}

if(!require(gapminder)) {
  install.packages("gapminder", repos = "http://cran.us.r-project.org")
  library(gapminder)
}

ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp)) +
  geom_point(alpha = 0.5, mapping = aes(color = continent)) +
  scale_color_brewer(type = "qual", palette = "Dark2") +
  geom_smooth(method = "lm", se = FALSE)

An example of an image that we’ve created.

Aside from our scale problem (that we’ll fix…again), I don’t like having “lifeExp” and “log(gdpPercap)” be the X and Y axis labels.  I guess the term “continent” is okay but I’d prefer it be capitalized.  We should also create a title so people know what this visual represents, and I’d like to reference that the source is the gapminder data set.  We can do pretty much all of this in one extra function call.

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point(alpha = 0.5, mapping = aes(color = continent)) +
  scale_color_brewer(type = "qual", palette = "Dark2") +
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_log10(label = scales::dollar) +
  labs(
    x = "GDP (PPP, normalized to 2005 USD)",
    y = "Mean Life Expectancy",
    title = "Wealth And Longevity",
    subtitle = "Charting the relationship between a country's prosperity and its residents' life expectancy.",
    caption = "Source:  Gapminder data set, 2010",
    color = "Continent"
  )

A chart with title, subtitle, labeled axes, and labeled legend.

And just like that, we have a nicer-looking visual.

So what else can we do?  If you set x = NULL or y = NULL, then the  X or Y axis, respectively, will no longer have a label.  This makes sense when laying out a bar or column chart, like our chart of data by continent:

ggplot(data = lifeExp_by_continent_1952, mapping = aes(x = reorder(continent, avg_lifeExp), y = avg_lifeExp)) +
  geom_col() +
  labs(
    x = NULL,
    y = "Mean Life Expectancy"
  )

A chart with no X axis label.

It’s clear that we’re showing continents, so I do not need to tell you that.  I also changed the Y axis label to be something better than “avg_lifeExp.”

There isn’t much else that you can do with the label itself.  There’s plenty you can do with themes, and we’ll cover that in detail in an upcoming post.  But at least we now have the ability to create nicer-looking labels.  Now let’s look at annotations.

Annotations

Annotations are useful for marking out important comments in your visual.  For example, going back to our wealth and longevity chart, there was a group of Asian countries with extremely high GDP but relatively low average life expectancy.  I’d like to call out that section of the visual and will use an annotation to do so.  To do this, I use the annotate() function.  In this case, I’m going to create a text annotation as well as a rectangle annotation so you can see exactly the points I mean.

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point(alpha = 0.5, mapping = aes(color = continent)) +
  scale_color_brewer(type = "qual", palette = "Dark2") +
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_log10(label = scales::dollar) +
  labs(
    x = "GDP (PPP, normalized to 2005 USD)",
    y = "Mean Life Expectancy",
    title = "Wealth And Longevity",
    subtitle = "Charting the relationship between a country's prosperity and its residents' life expectancy.",
    caption = "Source:  Gapminder data set, 2010",
    color = "Continent"
  ) +
  annotate(
    geom = "text",
    x = 85000,
    y = 48.3,
    label = "High-GDP countries with\nunexpectedly low mean\nlife expectancy.",
    size = 3.5
  ) +
  annotate(
    geom = "rect",
    xmin = 75000,
    xmax = 130000,
    ymin = 53,
    ymax = 70,
    alpha = 0.2
  )

Adding an annotation to call out a section of our visual.

We’ve added about 15 lines of code (because I’m keeping things nice and readable for the annotations) but it’s still just two function calls.  Also, notice that the text geom uses x and y parameters, whereas the rectangle uses xmin+xmax and ymin+ymax.  Finally, I want to point out that the x and y values are at the same scale as on our graph—it’s not pixels over or anything weird like that.

More Geoms:  Text, HLine, VLine

I skipped four geometric objects early on:  text, rect, hline, and vline.  The reason I skipped them is because they fit better here, with labeling and annotating data.  We can use the text geom to write text onto our visual,  rect will let us draw a rectangle, and the hline and vline geoms will draw horizontal and vertical lines, respectively.  In this post, I’m going to look at a variant on text, as well as horizontal and vertical lines.  I’ll skip the geom_rect, but it behaves similarly to other geoms that we’ve seen so far.

So let’s start with the text geom.  What if I want to label each country in that high-GDP range above?  I could go look it up manually and annotate each point, but that’s a brittle solution which really only works on this chart and this data (sort of like my annotation…).  For something a little more robust, let’s use geom_text.  Specifically, I’m going to use geom_text_repel(), which is a function that comes with ggrepel.  The ggrepel package is an addon for ggplot2 which figures out ways to add text onto a graph without overlap.  Visuals with overlapping text look amateur, and we’re trying to avoid looking more amateur than we already do…

Specifically, I want to show the name of each country in the red box, but I only want to name each country once, and I want to put the label somewhere near the point with the highest life expectancy within that block.  To do that, I’m going to use geom_text_repel but I also need to build up a data set with the distinct countries in that range.

if(!require(ggrepel)) {
  install.packages("ggrepel", repos = "http://cran.us.r-project.org")
  library(ggrepel)
}

oddities <- gapminder %>%
  filter(gdpPercap > 75000 & lifeExp < 70) %>%
  group_by(country) %>%
  summarize(maxLifeExp = max(lifeExp)) %>%
  inner_join(gapminder, by = c("country" = "country", "maxLifeExp" = "lifeExp"))

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point(alpha = 0.5, mapping = aes(color = continent)) +
  scale_color_brewer(type = "qual", palette = "Dark2") +
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_log10(label = scales::dollar) +
  labs(
    x = "GDP (PPP, normalized to 2005 USD)",
    y = "Mean Life Expectancy",
    title = "Wealth And Longevity",
    subtitle = "Charting the relationship between a country's prosperity and its residents' life expectancy.",
    caption = "Source:  Gapminder data set, 2010",
    color = "Continent"
  ) +
  annotate(
    geom = "text",
    x = 85000,
    y = 48.3,
    label = "High-GDP countries with\nunexpectedly low mean\nlife expectancy.",
    size = 3.5
  ) +
  annotate(
    geom = "rect",
    xmin = 75000,
    xmax = 130000,
    ymin = 53,
    ymax = 70,
    fill = "Red",
    alpha = 0.2
  ) +
  geom_text_repel(
    data = oddities,
    mapping = aes(x = gdpPercap, y = maxLifeExp, label = country),
    size = 2.3, segment.color = NA, nudge_x = 0
  )

Adding a label for countries in the odd zone.

We’re getting more and more code, but our visual is also a bit more complex.  The first bit, after installing/loading ggrepel, is to build up the data frame of odd countries.  It turns out that there’s just one odd country in this set:  Kuwait, which had extremely high GDP from the 1950s through 1970s, but low life expectancy.  Kuwait is a special case in this data set for a couple of reasons:  their population was extremely low in the 1950s and there was limited ownership of the sole interesting resource in the country.  The first factor means that the mean GDP (which is our per capita calculation) is through the roof; the second factor means that most people were not direct recipients of that benefit.

Regardless of the explanation, we now have annotations and labels for those extreme outliers.

We can also draw vertical and horizontal lines on the graph using the geom_vline and geom_line functions, respectively.

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point(alpha = 0.5, mapping = aes(color = continent)) +
  scale_color_brewer(type = "qual", palette = "Dark2") +
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_log10(label = scales::dollar) +
  labs(
    x = "GDP (PPP, normalized to 2005 USD)",
    y = "Mean Life Expectancy",
    title = "Wealth And Longevity",
    subtitle = "Charting the relationship between a country's prosperity and its residents' life expectancy.",
    caption = "Source:  Gapminder data set, 2010",
    color = "Continent"
  ) +
  annotate(
    geom = "text",
    x = 85000,
    y = 48.3,
    label = "High-GDP countries with\nunexpectedly low mean\nlife expectancy.",
    size = 3.5
  ) +
  annotate(
    geom = "rect",
    xmin = 75000,
    xmax = 130000,
    ymin = 53,
    ymax = 70,
    fill = "Red",
    alpha = 0.2
  ) +
  geom_text_repel(
    data = oddities,
    mapping = aes(x = gdpPercap, y = maxLifeExp, label = country),
    size = 2.3, segment.color = NA, nudge_x = 0
  ) +
  geom_vline(xintercept = 35000, color = "red", linetype = "dotdash") +
  geom_hline(yintercept = 75, color = "brown", linetype = "twodash")

Adding horizontal and vertical lines to a graph.

In this case, I decided to show off a few features of hline and vline, filling in the color and linetype parameters.  Those are not required and have sensible defaults of black and solid, respectively.  You can also make the lines thicker by setting the size parameter to something greater than 1, but this is already garish enough.  Horizontal and vertical lines can be useful, particularly in time series data.  For example, let’s go back to the datetime example we had in a previous post.  Suppose that we have a set of data but also a separate data frame with days where something interesting happened.  We can annotate those interesting things  on our plot pretty easily:

time_frame <- Sys.Date() - 0:31
df <- data.frame(
  date = time_frame,
  price = runif(32)
)
annotations <- data.frame(
  date = c(Sys.Date() - 24, Sys.Date() - 3),
  remark = c("Problem reported", "False alarm reported")
)

ggplot(df, aes(date, price)) +
  geom_line() +
  scale_x_date(date_breaks = "8 days", date_minor_breaks = "2 day") +
  geom_vline(xintercept = as.numeric(annotations$date), color = "Red", linetype = "dotdash") +
  geom_text(
    data = annotations,
    mapping = aes(x = date, y = 0, label = remark),
    color = "Red"
  )

That’s the last time I trust a man nicknamed False Alarm Fred.

In this case, we had two separate incidents occur, one a problem and one a false alarm.  We plotted a vertical line at the appropriate date and also wrote out the remark text.  Note that to map the date, I needed to call the as.numeric function to transform the date into a number.  That’s a little weird but understandable when you figure that you’re trying to plot on a Cartesian space and that means real numbers rather than dates.

Conclusion

In today’s post, we looked at using labels and annotations to spruce up visuals.  We saw how to create a visual with proper labels, a title, and even a caption with our data source.  We also learned how to draw text on a plot, annotate sections of a plot, and draw horizontal and vertical lines.

In the next post, we will look at how to view different subsets of chart data using facets.  Stay tuned!