ggplot Basics: Scales And Coordinates

This is part three of a series on ggplot2.

In yesterday’s post, I built a number of charts in ggplot2.  When I plotted the log of a variable, I mentioned that the way I did it wasn’t the best.  Today, we’re going to look at the right way to do it.

Scale

The first thing we’re going to look at is scaling our data.  Here’s a plot showing the relationship between GDP per capita and mean life expectancy:

Plotting mean life expectancy versus the log of per-capita GDP.

One problem with this visual is that we are making our users think a little too much.  People have trouble thinking in logarithmic terms.  If I tell you that the base-2 log of a value is 8.29, you probably won’t know that the value is 3983.83 without busting out a calculator.  But that’s what I’m making people do with this chart.  So let’s fix that with a scale.

if(!require(tidyverse)) {
    install.packages("tidyverse", repos = "http://cran.us.r-project.org")
    library(tidyverse)
}
 
if(!require(gapminder)) {
    install.packages("gapminder", repos = "http://cran.us.r-project.org")
    library(gapminder)
}

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
    geom_point(alpha = 0.2) +
    geom_smooth(method = "lm", se = FALSE) +
    scale_x_log10()

Using the built-in scale function instead of having us calculate the logs.

By adding one line of code, we changed the scale on the X axis from continuous to logarithmic in base 10.  That gives us numbers on the X axis that we can immediately understand:  1e4, or $10,000.  But, uh, maybe I want to see $10,000 instead of 1e+04?  Fortunately, there is a label parameter on the scale that lets us set a label.  The scales package in R (part of the tidyverse) gives us a set of pre-packaged labels, including USD and other currency formats.  This is what the call looks like:

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
    geom_point(alpha = 0.2) +
    geom_smooth(method = "lm", se = FALSE) +
    scale_x_log10(label = scales::dollar)

A useful X axis; how droll.

And now we have a version where my users don’t have to think hard about what those values mean.

Going Deeper With Scale

To understand the ggplot scale better, let’s take a look at what functions are available to us.

The quick summary is that there are two parts of most scale functions.  The first part describes what we want to scale, and the second part describes how we want to scale it.

First, the whats:

  1. alpha — Using alpha transparency levels to differentiate categories
  2. color — Using a color scale as a way to differentiate categories
  3. fill — Using a color fill as a way of describing a variable
  4. linetype — Using the line type (e.g., solid line, dotted line, dashed line) to differentiate categories
  5. shape — Using a shape (e.g., circle, triangle, square) to differentiate categories
  6. size — Using the size of a shape to differentiate categories
  7. x — Change the scale of the X axis
  8. y — Change the scale of the Y axis

Next, the hows, which I’ll break up into two categories.  The first category is the “differentiation” hows, which handle alpha, color, fill, linetype, shape, and size:

  1. Continuous
  2. Discrete
  3. Brewer
  4. Distiller
  5. Gradient / Gradient2 / Gradientn
  6. Grey
  7. Hue
  8. Identity
  9. Manual

And here are the X-Y hows:

  1. Continuous
  2. Discrete
  3. Log10
  4. Reverse
  5. Sqrt
  6. Date / Time / Datetime

There are a few scale functions which don’t fit this pattern (scale_radius) and a couple which have “default” how values (scale_alpha, scale_size, scale_shape).  Also, not all whats intersect with hows:  for example, there is no scale_shape_continuous or scale_size_hue because those combinations don’t make sense.

Now let’s dig into these a bit more and see what we can find.

X and Y Scales Galore

We’ve already seen scale_x_log10(), which converts the X axis to a base-10 logarithmic scale.  It turns out that this is just a transformation of scale_continuous().  So we can re-write it as:

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
    geom_point(alpha = 0.2) +
    geom_smooth(method = "lm", se = FALSE) +
    scale_x_continuous(trans = "log10", label = scales::dollar)

There are approximately 15 transformations and you can build your own if you’d like.  For the most part, however, you’re probably going to use the base scale or one of the most common transformations, which have their own functions:  scale_x_log10, scale_x_sqrt, and scale_x_reverse.

We can also handle dates and times in ggplot2.  Looking at the example for scaling dates, let’s start with the following code:

time_frame <- Sys.Date() - 0:31
df <- data.frame(
    date = time_frame,
    price = runif(32)
)
ggplot(df, aes(date, price)) +
    geom_line()

By default, ggplot2 handles dates sensibly.

This graph works, and you can see that it labels each week.  If we slap on scale_x_date(), this doesn’t change—that tells us that ggplot2 will use the default date scale for a date type; the scale_x_date function is there if we want to make modifications.

For example, suppose we have an event with a periodicity of eight days—like you send a crew out to a site for eight days of work and then send out the next crew and ship the last one back.  We can use the date_breaks parameter to show date breaks every eight days instead of the default:

ggplot(df, aes(date, price)) +
    geom_line() +
    scale_x_date(date_breaks = "8 days")

Make our own periodicity.

On this graph, you can see white lines that split our dates but don’t have labels; these are called minor breaks.  By default, there is one minor break halfway between each pair of major breaks.  Let’s say that instead, we want to show a minor break every 2 days.  We can use the date_minor_breaks function to set those.

ggplot(df, aes(date, price)) +
    geom_line() +
    scale_x_date(date_breaks = "8 days", date_minor_breaks = "2 day")

Periods within periods.

Finally, let’s say that we want to look at the middle two segments of our 32-day period.  We can use the limits parameter to define how much of the space we’d like to see:

ggplot(df, aes(date, price)) +
    geom_line() +
    scale_x_date(date_breaks = "8 days", date_minor_breaks = "2 day", limits = c(Sys.Date() - 24, Sys.Date() - 8))

Focusing in on a particular range.

Brewing Colors

Another area of intrigue is coloration.  ggplot2 will give you some colors by default, but you may not want to use them.  You can specify your own colors if you’d like, or you can ask Color Brewer for help.  For example, suppose I want to segment by gapminder data by continent, displaying each continent as a different color.  I can use the scale_color_brewer function to generate an appropriate set of colors for me, and it adds just one more line of code.

This function has two important parameters:  the type of data and the palette you wish to use.  By default, ggplot2 assumes that you’re sending sequential data.  You can also tell it that you are graphing divergent data (commonly seen in two-party electoral maps as the percentage margin of victory for each candidate) or that you have qualitative data (typically unordered categorical data).  In this case, I’m going to show all three even though the data doesn’t really fit two of them.

First up, here’s a sequential palette of various greens, starting from the lightest green and going darker based on continent name.

ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp)) +
    geom_point(mapping = aes(color = continent)) +
    scale_color_brewer(type = "seq", palette = "Greens") +
    geom_smooth(method = "lm", se = FALSE)

A sequential color map.

Next up, here’s what it looks like if you use a divergent color scheme, where names closer to A have shades of orange and names closer to Z have shades of purple.

ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp)) +
    geom_point(mapping = aes(color = continent)) +
    scale_color_brewer(type = "div", palette = "PuOr") +
    geom_smooth(method = "lm", se = FALSE)

Divergent colors, ranging from orange to purple.

Finally, we have a qualitative color scheme, which actually matches our data.  The five continents aren’t really continuous, so we’d want five different and unique colors to show our results.  Note that I’ve re-introduced alpha values here because these are solid colors and I want to be able to see some amount of interplay:

ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp)) +
    geom_point(alpha = 0.5, mapping = aes(color = continent)) +
    scale_color_brewer(type = "qual", palette = "Dark2") +
    geom_smooth(method = "lm", se = FALSE)

Qualitative coloration fits our data set the best.

Note that all of this above uses the scale_color_brewer function because we’re colorizing points.  If you want to colorize a bar graph or some other 2D structure, you’ll want to use scale_fill_brewer to colorize the filled-in portion and scale_color_brewer if for some reason you’d like the outline to be a different color.

For example, here is a bar chart of life expectancy by continent in 1952.  I’m setting the color to continent and have set the overall fill to white so you can see the coloration.

lifeExp_by_continent_1952 <- gapminder %>%
  filter(year == 1952) %>%
  group_by(continent) %>%
  summarize(avg_lifeExp = mean(lifeExp)) %>%
  select(continent, avg_lifeExp)

ggplot(data = lifeExp_by_continent_1952, mapping = aes(x = reorder(continent, desc(avg_lifeExp)), y = avg_lifeExp)) +
  geom_col(mapping = aes(color = continent), fill = "White") +
  scale_color_brewer(type = "seq", palette = "Greens")

The color parameter on a 2D object like a bar only colors the outline of the object.

That doesn’t look like what we intended.  This is more like it:

ggplot(data = lifeExp_by_continent_1952, mapping = aes(x = reorder(continent, desc(avg_lifeExp)), y = avg_lifeExp)) +
  geom_col(mapping = aes(fill = reorder(continent, desc(avg_lifeExp)))) +
  scale_fill_brewer(type = "seq", palette = "Greens", direction = -1)

Building a color scheme to fit our chart.

Note that I re-used the continent order to define colors in terms of mean life expectancy rather than alphabetically.  I also set the direction parameter on scale_fill_brewer to -1, which means to reverse colors.  By default, color brewed results go from light to dark, but here I want them to go from dark to light, so I reversed the direction.

Shapes And Sizes

You can plot data according to shape and size as well.  As far as shape goes, there are only a few options.  By default, we can have our scatter plot points show continents as different shapes using the following code:

ggplot(data = filter(gapminder, year == 1997), mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point(alpha = 0.5, mapping = aes(shape = continent)) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_log10(label = scales::dollar)

Using shapes to differentiate continents.

We can also choose whether to use solid or hollow shapes with the solid flag on scale_shape:

ggplot(data = filter(gapminder, year == 1997), mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point(alpha = 0.5, mapping = aes(shape = continent)) +
  scale_shape(solid = FALSE) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_log10(label = scales::dollar)

Using hollowed-out shapes to differentiate continents.

You can also set the size of a point using the size attribute, and can use scale_size to control the size.  Here’s an example where we increase in size based on continent:

ggplot(data = filter(gapminder, year == 1997), mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point(alpha = 0.5, mapping = aes(size = continent)) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_log10(label = scales::dollar)

Using size to differentiate continents. Not really a good idea at all, but you can do it.

Shapes tend to show up in black-and-white graphs of categorical data—like our continents—and sizes tend to show up with continuous variables.  In fact, when you try to run the code above, you get a warning:  “Using size for a discrete variable is not advised.”  It’s a bad practice, but it is something that you can do if you really want to.

Coordinates

The other thing I want to cover today is coordinate systems.  The ggplot2 documentation shows seven coordinate functions.  There are good reasons to use each, but I’m only going to demonstrate one.  By default, we use the Cartesian coordinate system and ggplot2 sets the viewing space.  This viewing space covers the fullness of your data set and generally is reasonable, though you can change the viewing area using the xlim and ylim parameters.

The special coordinate system I want to point out is coord_flip, which flips the X and Y axes.  This allows us, for example, to turn a column chart into a bar chart.  Taking our life expectancy by continent, data I can create a bar chart whereas before, we’ve been looking at column charts.  I can use coord_flip to switch the x and y axes:

ggplot(data = lifeExp_by_continent_1952, mapping = aes(x = reorder(continent, avg_lifeExp), y = avg_lifeExp)) +
  geom_col() +
  coord_flip()

Creating a bar chart instead of a column chart.

Now we have a bar chart.  With coord_flip(), we can easily create bar charts or Cleveland dot plots.

ggplot(data = lifeExp_by_continent_1952, mapping = aes(x = reorder(continent, avg_lifeExp), y = avg_lifeExp)) +
  geom_point(size = 4) +
  coord_flip()

A sample Cleveland dot plot.

Conclusion

In today’s post, we looked at some of the more common scale and coordinate functions in ggplot2.  There are quite a few that I did not cover, but I think this gives us a pretty fair idea of what we can create from this library.   In the next post, I will look at labels and annotations.

Advertisements

ggplot Basics: Mappings And Geoms

This is part two of a series on ggplot2.

In today’s post, we will look at some of the basics of ggplot.  As mentioned in the previous post, ggplot has a number of layers.  In today’s post, we will look at two of these layers:  the basic mapping layer and the geometric object (geom) layer.

Loading Libraries

The first step is to load the libraries that we’ll use today.  I’m using the tidyverse library, which includes a number of useful packages including ggplot2.  I’m also loading the gapminder library, which has interesting periodic, cross-country data covering a 50-year time frame.

if(!require(tidyverse)) {
    install.packages("tidyverse", repos = "http://cran.us.r-project.org")
    library(tidyverse)
}

if(!require(gapminder)) {
    install.packages("gapminder", repos = "http://cran.us.r-project.org")
    library(gapminder)
}

Once I have these libraries loaded, I can start messing around with the data.

Let’s first take a quick look at what’s in the gapminder data frame, using the head and glimpse functions:

A quick glimpse at the gapminder data set.

We can see that there are three descriptive variables—country, continent and year—and three measures of interest—life expectancy, population, and GDP per capita.  The gapminder data set collects data for countries around the world from the period 1952 through 2007.

Mapping Data

The first question I want to ask is, what was the average life expectancy by continent in 1952?  I’m imagining a column chart to display this data (for a brief review of column charts, check out my post on various types of visuals).

The first thing we want to do is to get our data into the right format. We’ll see some of ggplot2’s ability to reshape data soon, but I want to start by feeding it the final data set, as that makes it easier for us to follow.

lifeExp_by_continent_1952 <- gapminder %>%
    filter(year == 1952) %>%
    group_by(continent) %>%
    summarize(avg_lifeExp = mean(lifeExp)) %>%
    select(continent, avg_lifeExp)

Now that we have my source data prepared, we can get to work on plotting. The first thing we need is a mapping. Here’s the mapping that we’re going to use:

ggplot(data = lifeExp_by_continent_1952, mapping = aes(x = continent, y = avg_lifeExp))

We have two parameter that we’re passing into the ggplot function: data and mapping. We can specify the data frame that we’re using as the data parameter; doing so means that we do not need to keep specifying it down the line. The mapping, as you’ll recall, lets us represent parts of the graph. Specifically, we’re going to define an aesthetic which lays out what the X and Y axes contain: the X axis will list the individual continents, and the Y axis will cover average life expectancy.

If we run that first line of code, we already have a visual:

What we get when we specify a mapping with X and Y values.

Yeah, it’s not a very useful visual, but it’s a start.

Adding Geometric Objects

The next step is to add a geometric object.  I mentioned that I want a column for each continent, and that column’s value represents the average life expectancy.  This leads to our first geom:  the column.

ggplot(data = lifeExp_by_continent_1952, mapping = aes(x = continent, y = avg_lifeExp)) +
    geom_col()

We can display a simple column chart in two lines of code.This gives us a column chart. The order of the columns is alphabetical, which could be the way that we want to display the data, but probably isn’t. We probably want to reorder the X axis by average life expectancy descending, and that’s what the reorder function lets us do.

ggplot(data = lifeExp_by_continent_1952, mapping = aes(x = reorder(continent, desc(avg_lifeExp)), y = avg_lifeExp)) +
    geom_col()

Reordering the column chart by mean life expectancy helps us make better sense of the data.

This might not be absolutely necessary for this particular visual, but it’s a good principle to follow and definitely helps us when there are more categories or several categories which are very close in mean life expectancy.

More Geometric Objects

We’re going to look at a few more geoms.  If you want to see even more, check out the ggplot2 cheat sheet or sape’s geom reference.

Scatter Plots And Smoothers

Let’s say that I want to test a conjecture that higher GDP per capita (measured here in USD) correlates with higher life expectancy.  I can plot out GDP per capita versus life expectancy pretty easily.

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
    geom_point()

Comparing GDP to average life expectancy.

Well, that’s not extremely clear…  But I made a mistake that should be clear to people who have done data analytics on economic data:  money typically should be expressed as a logarithmic function.  There’s a good way to do this but we won’t cover it today.  I’m going to cover the bad method today and save the good method for the next post in the series.  The bad method is to modify the X-axis variable and call the log function

ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp)) +
    geom_point()

Comparing the log of GDP to mean life expectancy.

Now we see a much clearer relationship between the log of GDP per capita and average life expectancy.  It’s not a perfect relationship, but there’s definitely a positive line that we could draw.  So let’s draw that positive line!

ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE)

Adding a regression line to our scatter plot.

We have used a new geom here, geom_smooth.  The geom_smooth function creates a smoothed conditional mean.  Basically, we’re drawing some line as a result of a function based on this input data.  Notice that there are two parameters that I set:  method and se.  The method parameter tells the function which method to use.  There are five methods available, including using a Generalized Additive Model (gam), Locally Weighted Scatterplot Smoothing (loess), and three varieties of Linear Models (lm, glm, and rlm).  The se parameter controls whether we want to see the standard error bar.  Here’s our graph with the standard error bar turned on:

ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp)) +
    geom_point() +
    geom_smooth(method = "lm", se = TRUE)

Showing the standard error bar on our smoothed conditional mean geom.

This model also represents the first time that we’ve created a complex visual.  This is a visual with dots as well as a line.  It was really easy to create this because we can lay out the two layers independent of one another:  I can have geom_point() without geom_smooth() or vice versa, so if I need to work on one layer, I can comment out the other and hide it until I’m ready.  This also allowed us to step through the visual iteratively.

Let’s turn off standard errors again and look at the scatter plot.  One trick we can use to see the line more clearly is to change the alpha channel for our scatter plot dots.  We can use the alpha parameter on geom_point to do just this.

ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp)) +
    geom_point(alpha = 0.2) +
    geom_smooth(method = "lm", se = FALSE)

Turning down the alpha channel on our scatter plot makes it easier to see overlap.

Now we can see the line more clearly without losing the scatter plot.  This has a second beneficial effect for us:  there was some overplotting of dots, where several country-year combos had roughly the same GDP and life expectancy.  By toning down the alpha channel a bit, we can see the overlap much more clearly.

Zooming in a bit, let’s filter down to one country, Germany.  We could create a new data frame with just the Germany data, but we don’t need to do that.  We can just simply use the filter function in dplyr and get data for Germany:

ggplot(data = filter(gapminder, country == "Germany"), mapping = aes(x = gdpPercap, y = lifeExp)) +
    geom_point()

Filtering down to just one country.

Note that I switched back to normal format for GDP per capita so we can see dollar amounts.

Other Charts

Creating a line chart is pretty easy as well.  Let’s graph GDP per capita in Germany from 1952 to 2007.

ggplot(data = filter(gapminder, country == "Germany"), mapping = aes(x = year, y = gdpPercap)) +
    geom_line()

A line chart of Germany’s GDP over time.

We can easily switch it over to an area chart with geom_area:

ggplot(data = filter(gapminder, country == "Germany"), mapping = aes(x = year, y = gdpPercap)) +
    geom_area(alpha = 0.4)

An area chart of Germany’s GDP over time.

Note that I’ve set the alpha on geom_area, mostly so that the amount of black doesn’t overwhelm the eyes.

We can create a step chart as well.  This is helpful in gauging the magnitude of changes from period to period a little more clearly than on a line chart:

ggplot(data = filter(gapminder, country == "Germany"), mapping = aes(x = year, y = gdpPercap)) +
    geom_step()

A step chart of Germany’s GDP over time.

We can also compare what the step output looks like versus the line output.  In this case, I’m coloring the step output as red and leaving the line as black.

ggplot(data = filter(gapminder, country == "Germany"), mapping = aes(x = year, y = gdpPercap)) +
    geom_step(color="Red") +
    geom_line()

Combining line and step charts. Note how the area compares between the two, making the step changes look sharper than the line changes.

Taking another tack, let’s see what the spread for GDP per capita is across continents in the year 1997.  We will once again use the log of GDP, and will create a box and whiskers plot using geom_boxplot.

ggplot(data = filter(gapminder, year == 1997), mapping = aes(x = continent, y = log(gdpPercap))) +
    geom_boxplot()

A box plot of GDP in 1997.

Conclusion

In today’s post, we looked at several of the geometric objects available within ggplot2.  We’re able to create simplistic but functional graphs with just two or three lines of code.  Starting with the next post, we’ll begin to improve some of these charts by looking at scales and coordinates.

The Grammar of Graphics

This is part one of a series on ggplot2.

I’m starting a new series on using ggplot2 to create high-quality visuals. But in order to understand why ggplot2 behaves the way it does, we need to understand a little bit about the grammar of graphics.  Leland Wilkinson published The Grammar of Graphics in 1999, with a revised edition in 2005.  By 2006, Hadley Wickham had created ggplot (as mentioned in his presentation A grammar of graphics:  past, present, and future) as an implementation of the grammar of graphics in R.  In 2010, Wickham published A Layered Grammar of Graphics, which explains the reasoning behind ggplot2.

In this first post of the series, I want to give you an idea of why we should think about the grammar of graphics.  From there, we’ll go into detail with ggplot2, starting simple and building up to more complex plots.  By the end of the series, I want to build high-quality, publication-worthy visuals.  With that flow in mind, let’s get started!

What Is The Grammar of Graphics?

First, my confession.  I haven’t read Wilkinson’s book and probably never will.  That’s not at all a knock on the book itself, but rather an indication that it is not for everybody, not even for everyone interested in data visualization.

Instead, we will start with Wickham’s paper on ggplot2.  This gives us the basic motivation behind the grammar of graphics by covering what a grammar does for us:  “A grammar provides a strong foundation for understanding a diverse range of graphics. A grammar may also help guide us on what a well-formed or correct graphic looks like, but there will still be many grammatically correct but nonsensical graphics. This is easy to see by analogy to the English language: good grammar is just the first step in creating a good sentence” (3).

With a language, we have different language components like nouns (which can be subjects, direct objects, or indirect objects), verbs, adjectives, adverbs, etc.  We put together combinations of those individual components to form complete sentences and transmit ideas.  Our particular word choice and language component usage will affect the likelihood of success in idea transmission, but to an extent, we can work iteratively on a sentence, switching words or adding phrases to get the point across the way we desire.

With graphics, we can do the same thing.  Instead of thinking of “a graph” as something which exists in and of itself, we should think of different objects that we combine into its final product:  a graph.

Implementing The Grammar

In the ggplot2 grammar, we have different layers of objects.  In some particular order, we have:

  1. The data itself, and a mapping explaining what portions of the data we want to represent parts of our graph.  This mapping is made up of things we see on the graph:  the aesthetics.  Aesthetic elements include the x axis, y axis, color, fill color, and so on.
  2. The statistical transformation we want to use.  For example, there are stats for boxplots, jitter, qq plots, and summarization (page 11).  Stats help you transform an input data frame into an output data frame that your plot can use, like generating a density function from an input data frame using stat_density().
  3. The geometric object (aka, geom) we want to draw.  This could be a histogram, a bar or column chart, a line chart, a radar chart, or whatever.  These relate closely to statistics.
  4. Scales and coordinates, which give us the axes and legends.
  5. Accompaniments to the visual.  These include data labels and annotations.  This is how we can mark specific points on the graph or give the graph a nice title.
  6. The ability to break our visual into facets, that is, splitting into multiple graphs.  If we have multiple graphs, we can see how different pieces of the data interact.

The key insight in ggplot2 is that these different layers are independent of one another:  you can change the geometric object from a line chart to a bar chart without needing to change the title, for example.  This lets you program graphs iteratively, starting with very simple graphs and adding on more polish as you go.

As a follow-on to this, you can choose more than one geometric object, for example.  So if you want to draw a column chart with a line chart in front of it, that’s two geometric objects and not one special line+column chart.  This lets you construct graphics at the level of complexity that you need.

Conclusion

Even though the Wickham paper is nearing 8 years old by this point and the ggplot2 library has expanded considerably in the meantime, it remains a good introduction to the grammar of graphics and gives the motivation behind ggplot2.

Over the rest of this series, we will dig into ggplot2 in some detail, generating some low-quality images at first but building up to better and better things.

 

Passing VARBINARY Models Into ML Services

This is the final part of a three-part series.  In part 1, we built an R model using ML Services.  In part two, we saved an R model using ML Services and also showed how to deal with multiple model types.  In this final part of the series, we’re going to try getting results for each model we pass in.

Where We Were

If you want to pick up at this point, here’s a quick script to get you back to where we were:

DROP TABLE IF EXISTS #ItemSales;
CREATE TABLE #ItemSales
(
	ItemID INT,
	SaleDate DATE,
	NumberSold INT,
	ShopperQuestions INT,
	EmployeeRecommendations INT,
	InchesOfPrintAdvertisement INT
);

INSERT INTO #ItemSales
(
	ItemID,
	SaleDate,
	NumberSold,
	ShopperQuestions,
	EmployeeRecommendations,
	InchesOfPrintAdvertisement
)
VALUES
	(1, '2017-12-01', 7, NULL, NULL, NULL),
	(1, '2017-12-02', 4, NULL, NULL, NULL),
	(1, '2017-12-03', 4, NULL, NULL, NULL),
	(1, '2017-12-04', 1, NULL, NULL, NULL),
	(1, '2017-12-05', 3, NULL, NULL, NULL),
	(1, '2017-12-06', 3, NULL, NULL, NULL),
	(1, '2017-12-07', 5, NULL, NULL, NULL),
	(1, '2017-12-08', 6, NULL, NULL, NULL),
	(1, '2017-12-09', 2, NULL, NULL, NULL),
	(1, '2017-12-10', 2, NULL, NULL, NULL),
	(1, '2017-12-11', 1, NULL, NULL, NULL),
	(1, '2017-12-12', 4, NULL, NULL, NULL),
	(1, '2017-12-13', 3, NULL, NULL, NULL),
	(1, '2017-12-14', 2, NULL, NULL, NULL),
	(1, '2017-12-15', 3, NULL, NULL, NULL),
	(1, '2017-12-16', 5, NULL, NULL, NULL),
	(1, '2017-12-17', 2, NULL, NULL, NULL),
	(1, '2017-12-18', 5, NULL, NULL, NULL),
	(1, '2017-12-19', 6, NULL, NULL, NULL),
	(1, '2017-12-20', 4, NULL, NULL, NULL),
	(1, '2017-12-21', 5, NULL, NULL, NULL),
	(1, '2017-12-22', 2, NULL, NULL, NULL),
	(1, '2017-12-23', 1, NULL, NULL, NULL),
	(1, '2017-12-24', 3, NULL, NULL, NULL),
	(1, '2017-12-25', 2, NULL, NULL, NULL),
	(1, '2017-12-26', 2, NULL, NULL, NULL),
	(1, '2017-12-27', 2, NULL, NULL, NULL),
	(1, '2017-12-28', 7, NULL, NULL, NULL),
	(1, '2017-12-29', 3, NULL, NULL, NULL),
	(1, '2017-12-30', 5, NULL, NULL, NULL),
	(2, '2017-12-01', 52, 9, 6, 14),
	(2, '2017-12-02', 49, 7, 5, 15),
	(2, '2017-12-03', 53, 13, 6, 14),
	(2, '2017-12-04', 48, 8, 6, 13),
	(2, '2017-12-05', 66, 8, 9, 16),
	(2, '2017-12-06', 58, 8, 8, 15),
	(2, '2017-12-07', 70, 8, 10, 16),
	(2, '2017-12-08', 68, 8, 10, 16),
	(2, '2017-12-09', 43, 12, 3, 14),
	(2, '2017-12-10', 41, 13, 2, 15),
	(2, '2017-12-11', 25, 3, 1, 14),
	(2, '2017-12-12', 42, 2, 4, 15),
	(2, '2017-12-13', 32, 8, 2, 12),
	(2, '2017-12-14', 61, 11, 8, 15),
	(2, '2017-12-15', 58, 14, 6, 16),
	(2, '2017-12-16', 67, 10, 9, 15),
	(2, '2017-12-17', 57, 8, 7, 15),
	(2, '2017-12-18', 49, 8, 6, 13),
	(2, '2017-12-19', 46, 9, 5, 13),
	(2, '2017-12-20', 63, 5, 9, 15),
	(2, '2017-12-21', 45, 9, 4, 15),
	(2, '2017-12-22', 69, 8, 9, 17),
	(2, '2017-12-23', 70, 9, 11, 12),
	(2, '2017-12-24', 70, 2, 12, 14),
	(2, '2017-12-25', 55, 11, 7, 13),
	(2, '2017-12-26', 50, 5, 6, 16),
	(2, '2017-12-27', 64, 11, 9, 13),
	(2, '2017-12-28', 48, 5, 5, 15),
	(2, '2017-12-29', 62, 9, 8, 16),
	(2, '2017-12-30', 50, 4, 6, 15);

DROP TABLE IF EXISTS #Model;
CREATE TABLE #Model
(
	ItemID INT,
	Model VARBINARY(MAX)
);
GO
DECLARE
	@ItemID INT = 1,
	@Model VARBINARY(MAX);

EXEC sp_execute_external_script
	@language = N'R',
	@script = N'
BuildPoisson = function(sales) {
	lambda <- mean(sales$NumberSold)
	class(lambda) <- append("Poisson Distribution", class(lambda))
	return(serialize(lambda, connection = NULL))
}

BuildLinearRegression = function(sales) {
	model <- lm(formula = NumberSold ~ ShopperQuestions + EmployeeRecommendations + InchesOfPrintAdvertisement, data = sales)
	class(model) <- append("Linear Regression", class(model))
	return(serialize(model, connection = NULL))
}

if (mean(Sales$NumberSold) < 6) {
    Model <- BuildPoisson(Sales)
} else {
    Model <- BuildLinearRegression(Sales)
}
',
	@input_data_1 = N'SELECT
	SaleDate,
	NumberSold,
	ShopperQuestions,
	EmployeeRecommendations,
	InchesOfPrintAdvertisement
FROM #ItemSales
WHERE
	ItemID = @ItemID',
	@input_data_1_name = N'Sales',
	@params = N'@ItemID INT, @Model VARBINARY(MAX) OUTPUT',
	@ItemID = @ItemID,
	@Model = @Model OUTPUT;

DELETE
FROM #Model
WHERE ItemID = @ItemID;

INSERT INTO #Model
(
	ItemID,
	Model
)
VALUES
	(@ItemID, @Model);
GO

DECLARE
	@ItemID INT = 2,
	@Model VARBINARY(MAX);

EXEC sp_execute_external_script
	@language = N'R',
	@script = N'
BuildPoisson = function(sales) {
	lambda <- mean(sales$NumberSold)
	class(lambda) <- append("Poisson Distribution", class(lambda))
	return(serialize(lambda, connection = NULL))
}

BuildLinearRegression = function(sales) {
	model <- lm(formula = NumberSold ~ ShopperQuestions + EmployeeRecommendations + InchesOfPrintAdvertisement, data = sales)
	class(model) <- append("Linear Regression", class(model))
	return(serialize(model, connection = NULL))
}

if (mean(Sales$NumberSold) < 6) {
    Model <- BuildPoisson(Sales)
} else {
    Model <- BuildLinearRegression(Sales)
}
',
	@input_data_1 = N'SELECT
	SaleDate,
	NumberSold,
	ShopperQuestions,
	EmployeeRecommendations,
	InchesOfPrintAdvertisement
FROM #ItemSales
WHERE
	ItemID = @ItemID',
	@input_data_1_name = N'Sales',
	@params = N'@ItemID INT, @Model VARBINARY(MAX) OUTPUT',
	@ItemID = @ItemID,
	@Model = @Model OUTPUT;

DELETE
FROM #Model
WHERE ItemID = @ItemID;

INSERT INTO #Model
(
	ItemID,
	Model
)
VALUES
	(@ItemID, @Model);
GO

This will give us two models in the #Model table, one for our fancy watch (ItemID = 1) and one for our less fancy pen (ItemID = 2).

In the previous post, I showed how to pass in data for a single product and get some results.  Today, I want to expand that and show how to generate predictions for multiple products.

Setting The Stage

First, let’s load some data into the #InputData table:

DROP TABLE IF EXISTS #InputData;
CREATE TABLE #InputData
(
	ItemID INT,
	PredictionDate DATE,
	ShopperQuestions INT,
	EmployeeRecommendations INT,
	InchesOfPrintAdvertisement INT
);

INSERT INTO #InputData
(
	ItemID,
	PredictionDate,
	ShopperQuestions,
	EmployeeRecommendations,
	InchesOfPrintAdvertisement
)
VALUES
	(1, '2018-01-01', NULL, NULL, NULL),
	(1, '2018-01-02', NULL, NULL, NULL),
	(1, '2018-01-03', NULL, NULL, NULL),
	(1, '2018-01-04', NULL, NULL, NULL),
	(1, '2018-01-05', NULL, NULL, NULL),
	(1, '2018-01-06', NULL, NULL, NULL),
	(1, '2018-01-07', NULL, NULL, NULL),
	(2, '2018-01-01', 8, 5, 14),
	(2, '2018-01-02', 12, 6, 14),
	(2, '2018-01-03', 10, 5, 14),
	(2, '2018-01-04', 8, 6, 14),
	(2, '2018-01-05', 7, 5, 14),
	(2, '2018-01-06', 9, 6, 14),
	(2, '2018-01-07', 15, 5, 14);

Notice  that I have seven records for each ItemID, so I want predictions per item for each of those days.  This is going to change my prediction function a bit.  Previously, I passed in the model (which we store as VARBINARY(MAX)) as a parameter, but now I have two models.

A First Test:  Join Models And InputData

The first answer to come to mind is to join the #Model and #InputData temp tables and pass into R a data frame containing the item ID, each prediction date, explanatory features (if required), and the model we want to use.  Inside R, I can get the distinct items and models and apply a function to generate predictions.

But it’s not quite that easy.  Let’s see what happens when I try to put together a simple data frame:

EXEC sp_execute_external_script
	@language = N'R',
	@script = N'
OutputDataSet <- InputData
',
	@input_data_1 = N'
SELECT
	ItemID,
	Model
FROM #Model',
	@input_data_1_name = N'InputData'
	WITH RESULT SETS
	(
		(
			ItemID INT,
			Model VARBINARY(MAX)
		)
	);

I have my model as an input and want to spit it out at the end as well. But when I try that, I get an error:

Msg 39017, Level 16, State 3, Line 239
Input data query returns column #1 of type ‘varbinary(max)’ which is not supported by the runtime for ‘R’ script. Unsupported types are binary, varbinary, timestamp, datetime2, datetimeoffset, time, text, ntext, image, hierarchyid, xml, sql_variant and user-defined type.

So there goes that plan—I can output a VARBINARY(MAX) model, but I cannot input one.

A Hex On R Services

I can’t pass in binary data, but I can pass in a hex-based textual representation of that binary data, and SQL Server will do that for us. Before we try to do anything fancy with the model, let’s see what we have in R:

EXEC sp_execute_external_script
	@language = N'R',
	@script = N'
print(nrow(InputData));
',
	@input_data_1 = N'
SELECT
	ItemID,
	CONVERT(VARCHAR(MAX), Model, 2) AS Model
FROM #Model',
	@input_data_1_name = N'InputData';

Note that the 2 parameter on the CONVERT function is critical.  When converting binary styles, setting the parameter to 1 will give us a hex string starting with “0x” but setting the parameter to 2 will prevent SQL Server from prepending “0x” to our hex stream.  That will be important in a moment.

Anyhow, we get our two rows of input data, and now we have to convert that hex string back to binary.  That’s one thing that the wkb package does.  Let’s take a moment and install the wkb and tidyverse packages, just in case you don’t already have them:

EXEC sys.sp_execute_external_script
	@language = N'R',
	@script = N'install.packages("wkb")
install.packages("tidyverse")'
GO

Now I can use the wkb::hex2raw function to convert an incoming hex string into a raw format that R can use.  And this gets us back to where I wanted to be:  accepting a model as an input and doing something with that model:

EXEC sp_execute_external_script
	@language = N'R',
	@script = N'library(dplyr)
OutputDataSet <- InputData %>%
					mutate(ModelBinary = wkb::hex2raw(as.character(Model))) %>%
					select(ItemID, ModelBinary)
',
	@input_data_1 = N'
SELECT
	ItemID,
	CONVERT(VARCHAR(MAX), Model, 2) AS Model
FROM #Model',
	@input_data_1_name = N'InputData'
	WITH RESULT SETS
	(
		(
			ItemID INT,
			Model VARBINARY(MAX)
		)
	);

A Basket Of Models

Now that we have figured out the mechanics, it’s time to put together the pieces of the puzzle.  I want to join #Model to #InputData and make R do some heavy lifting for me. Let’s first see the code and then I’ll explain it step by step.

EXEC sp_execute_external_script
	@language = N'R',
	@script = N'library(dplyr)
PredictPoisson = function(item_id, model, InputData) {
	lambda <- model
	predictions <- rpois(nrow(InputData), lambda)
	return(data.frame(item_id, InputData$PredictionDate, predictions))
}

PredictLinear = function(item_id, model, InputData) {
	predictions <- round(predict(object = model, newdata = InputData))
	return(data.frame(item_id, InputData$PredictionDate, predictions))
}

MakePrediction = function(item, InputData) {
	model <- unserialize(wkb::hex2raw(as.character(item[2])))
	item_id <- as.numeric(item[1])
	input_data <- InputData %>%
					filter(ItemID == item_id)
	if (class(model)[1] == "Poisson Distribution") {
		print(sprintf("Using the Poisson Distribution for ItemID = %i", item_id))
		return(PredictPoisson(item_id, model, input_data))
	} else {
		print(sprintf("Using a Linear Regression for ItemID = %i", item_id))
		return(PredictLinear(item_id, model, input_data))
	}
}

items <- InputData %>%
			distinct(ItemID, Model) %>%
			select(ItemID, Model)

OutputDataSet <- do.call(rbind, apply(items, 1, MakePrediction, InputData))
',
	@input_data_1 = N'
SELECT
	m.ItemID,
	i.PredictionDate,
	i.ShopperQuestions,
	i.EmployeeRecommendations,
	i.InchesOfPrintAdvertisement,
	CONVERT(VARCHAR(MAX), m.Model, 2) AS Model
FROM #Model m
	INNER JOIN #InputData i
		ON i.ItemID  = m.ItemID',
	@input_data_1_name = N'InputData'
	WITH RESULT SETS
	(
		(
			ItemID INT,
			PredictionDate DATE,
			NumberPredicted INT
		)
	);

Let’s start with the bottom, as that’s most familiar to us.  The input data set input_data_1 is just a join of the #Model and #InputData tables.  We’re going to have 14 total rows, 7 for ItemID = 1 and 7 for ItemID = 2.  Each of those rows will have the model, so we’ll want to pull out the distinct models.  In an ideal world, we’d be able to pass multiple input sets into R (and it kind of seems like that was the intent, given the name input_data_1), but we can’t do that today.  At the end, I’m defining what I expect the result set to look like:  an item ID, a prediction date, and a number predicted.  That way I can get the predicted sales per item per day.  Now let’s talk about the R code.

Scrolling up to the top of the script, I load the dplyr library so I can use dplyr functions like the pipe.  The PredictPoisson and PredictLinear functions are very similar to the previous form, but I added in a new parameter:  item_id.  My plan is to take the incoming item ID and slap it on the data frame I create.  But because there are multiple items, I can’t simply set OutputDataSet inside the predict function like I did last time.  Instead, I’m building a separate data frame and returning that, making the calling function combine everything together.

That calling function is MakePrediction.  MakePrediction takes two parameters:  an item and the InputData data frame.  It’s expecting item to be a list with two elements:  ItemID as the first element and the hex string model as the second element.  I decided to define model and item_id inside the function to make things easier to read.  Then, I pulled out the rows from InputData which matter to the item in question.  Finally, I have my model type check:  if the class of the model is Poisson Distribution, then I want to print out that I’m going to use Poisson for that item and call the PredictPoisson  function, returning its resulting data frame to MakePrediction’s caller.  If the type is not Poisson, we know it is a linear regression and act accordingly.

Those are the three functions we’ve created.  After that, we get to the calling code.  First up, I create a data frame called items which is the distinct set of ItemIDs and Models.  This is our workaround for not having the ability to pull in multiple data frames, and it works well enough for this case.

From there, we splice together three R functions to avoid looping through our items data frame.  Let’s start from the inside and move out.  First, we have the apply function (learn more about apply from Sharon Machlis or Pete Werner).  I’m using the apply function to call the MakePrediction function for each row in my items data frame, passing in InputData along the way.  Because each call to MakePrediction returns a separate data frame, I need to call the rbind function to concatenate the data frames together.  Typically, rbind requires that you list each data frame that you want to combine together.  The do.call function figures that part out for me, calling rbind with the appropriate set of data frames.  And from there, I get my results:

MultiplePredictionResults

Generating predictions for multiple items at once.

Conclusion

In this post, we learned two very important things.  First, you cannot use VARBINARY or BINARY as an input to an R script in SQL Server Machine Learning Services.  You can, however, convert that binary to hex, pass it in as a string, and convert it back to binary using the wkb library.

Second, using a bit of functional programming mojo, we can perform the equivalent of several foreach loops in one line of code—and in R, these set-based operations tend to be faster than their iterative equivalents.

This wraps up my series on modeling in Machine Learning Services.  The code in this series isn’t up to production quality, of course (particularly around error handling), but I hope it gives you enough to see how you can integrate SQL and R.

Saving An R Model Using ML Services

Last time, we looked at building a simple Poisson model to forecast sales.  Today, we’re going to expand upon this, building out a model that we can store in the database.  We’ll also introduce a second type of model for a more frequently-purchased item.

Storing A Model

Let’s say that we want to take our beautiful Poisson model and save it for posterity.  Let’s re-create that data set once again, though I’m adding a couple of things we’ll need soon enough.

CREATE TABLE #ItemSales
(
	ItemID INT,
	SaleDate DATE,
	NumberSold INT,
	ShopperQuestions INT,
	EmployeeRecommendations INT,
	InchesOfPrintAdvertisement INT
);

INSERT INTO #ItemSales
(
	ItemID,
	SaleDate,
	NumberSold
)
VALUES
	(1, '2017-12-01', 7),
	(1, '2017-12-02', 4),
	(1, '2017-12-03', 4),
	(1, '2017-12-04', 1),
	(1, '2017-12-05', 3),
	(1, '2017-12-06', 3),
	(1, '2017-12-07', 5),
	(1, '2017-12-08', 6),
	(1, '2017-12-09', 2),
	(1, '2017-12-10', 2),
	(1, '2017-12-11', 1),
	(1, '2017-12-12', 4),
	(1, '2017-12-13', 3),
	(1, '2017-12-14', 2),
	(1, '2017-12-15', 3),
	(1, '2017-12-16', 5),
	(1, '2017-12-17', 2),
	(1, '2017-12-18', 5),
	(1, '2017-12-19', 6),
	(1, '2017-12-20', 4),
	(1, '2017-12-21', 5),
	(1, '2017-12-22', 2),
	(1, '2017-12-23', 1),
	(1, '2017-12-24', 3),
	(1, '2017-12-25', 2),
	(1, '2017-12-26', 2),
	(1, '2017-12-27', 2),
	(1, '2017-12-28', 7),
	(1, '2017-12-29', 3),
	(1, '2017-12-30', 5);

I’ve added an Item ID to represent which product we’re looking at, and 1 will be our fancy watch.  I’ve also added a few measures that we will not talk about in front of the watch—it’s not polite.

We used sp_execute_external_script to build a model and generate some number of days worth of predictions.  Now I’d like to break that up into two operations:  training a model and generating predictions.  The biggest reason I might want to do this is if model generation is very expensive relative to prediction generation.  It’s obviously not very expensive to build a random number generator following a Poisson distribution, but imagine we had a complicated neural net that took hours or days to train—we’d want to use that model over and over, as long as the model came close enough to representing the underlying reality.

So now we’re going to create a model.  We’ll obviously need a table to store those models, so let’s create one now:

CREATE TABLE #Model
(
	ItemID INT,
	Model VARBINARY(MAX)
);

Now we’re going to build an R script which takes our input data set and builds a resulting model.  We want to serialize the model so that it gets converted to VARBINARY(MAX).  That way, we can safely store the model in SQL Server, regardless of whether it’s a Poisson distribution, a neural net, or anything else.

DECLARE
	@ItemID INT = 1,
	@Model VARBINARY(MAX);

EXEC sp_execute_external_script
	@language = N'R',
	@script = N'lambda <- mean(Sales$NumberSold)
class(lambda) <- append("Poisson Distribution", class(lambda))
Model <- serialize(lambda, connection = NULL)
',
	@input_data_1 = N'SELECT
	SaleDate,
	NumberSold
FROM #ItemSales
WHERE
	ItemID = @ItemID',
	@input_data_1_name = N'Sales',
	@params = N'@ItemID INT, @Model VARBINARY(MAX) OUTPUT',
	@ItemID = @ItemID,
	@Model = @Model OUTPUT;

DELETE
FROM #Model
WHERE ItemID = @ItemID;

INSERT INTO #Model 
(
	ItemID,
	Model
)
VALUES
	(@ItemID, @Model);
GO

Our model is incredibly simple:  we just need lambda, so we’re going to serialize that parameter and return it to the user.  Note that I append “Poisson Distribution” to the class before serializing.  That way I can note what the underlying distribution is, so when I start to incorporate multiple models, it’ll be easy to differentiate them.

Using The Model

Now that we have a model in place, let’s write a function which accepts the model and generates predictions off of it.

DECLARE
	@ItemID INT = 1,
	@Model VARBINARY(MAX);
	
SELECT
	@Model = Model
FROM #Model m
WHERE
	m.ItemID = @ItemID;	

EXEC sp_execute_external_script
	@language = N'R',
	@script = N'lambda <- unserialize(Model)
predictions <- rpois(PredictionLookaheadInDays, lambda)
dates <- format(as.POSIXlt(seq(from = StartDate, by = "day", length.out = PredictionLookaheadInDays)))
OutputDataSet <- data.frame(dates, predictions)
',
	@params = N'@Model VARBINARY(MAX), @PredictionLookaheadInDays INT, @StartDate DATE',
	@Model = @Model,
	@PredictionLookaheadInDays = 7,
	@StartDate = '2017-01-01'
	WITH RESULT SETS
	(
		(
			SaleDate DATE,
			NumberPredicted INT
		)
	);

Now we have a procedure call which gets us back predictions based on our model.  I don’t need to pass in the input data anymore, and can instead simply give it a model.

A New Product And A New Type

Now let’s introduce a second item.  Instead of this being a fancy watch, it is instead a moderately fancy pen.  We sell dozens and dozens of these moderately fancy pens per day, so that’s way more than what we could safely predict with a Poisson distribution.  We’re going to need some more data.

Fortunately, we know that there are three factors which help determine the number of pens sold:  the number of shoppers who come inquire about the pen, the number of times our sales clerk recommends this pen, and the number of inches of print advertising in our local newspaper (because the Internet is a fad, people).  Anyhow, armed with our vital analytical measures, we can move forward and insert product number 2 into our table:

INSERT INTO #ItemSales 
(
	ItemID,
	SaleDate,
	NumberSold,
	ShopperQuestions,
	EmployeeRecommendations,
	InchesOfPrintAdvertisement
)
VALUES
	(2, '2017-12-01', 52, 9, 6, 14),
	(2, '2017-12-02', 49, 7, 5, 15),
	(2, '2017-12-03', 53, 13, 6, 14),
	(2, '2017-12-04', 48, 8, 6, 13),
	(2, '2017-12-05', 66, 8, 9, 16),
	(2, '2017-12-06', 58, 8, 8, 15),
	(2, '2017-12-07', 70, 8, 10, 16),
	(2, '2017-12-08', 68, 8, 10, 16),
	(2, '2017-12-09', 43, 12, 3, 14),
	(2, '2017-12-10', 41, 13, 2, 15),
	(2, '2017-12-11', 25, 3, 1, 14),
	(2, '2017-12-12', 42, 2, 4, 15),
	(2, '2017-12-13', 32, 8, 2, 12),
	(2, '2017-12-14', 61, 11, 8, 15),
	(2, '2017-12-15', 58, 14, 6, 16),
	(2, '2017-12-16', 67, 10, 9, 15),
	(2, '2017-12-17', 57, 8, 7, 15),
	(2, '2017-12-18', 49, 8, 6, 13),
	(2, '2017-12-19', 46, 9, 5, 13),
	(2, '2017-12-20', 63, 5, 9, 15),
	(2, '2017-12-21', 45, 9, 4, 15),
	(2, '2017-12-22', 69, 8, 9, 17),
	(2, '2017-12-23', 70, 9, 11, 12),
	(2, '2017-12-24', 70, 2, 12, 14),
	(2, '2017-12-25', 55, 11, 7, 13),
	(2, '2017-12-26', 50, 5, 6, 16),
	(2, '2017-12-27', 64, 11, 9, 13),
	(2, '2017-12-28', 48, 5, 5, 15),
	(2, '2017-12-29', 62, 9, 8, 16),
	(2, '2017-12-30', 50, 4, 6, 15);

We now have two different types of products to model:  simple products which fit a Poisson and more complicated products which we’re going to model with a linear regression.

A New Model

So now we’d like to have two models:  a Poisson distribution and a linear model based on regressing three numeric variables against our sales data.  We’ll need some way to figure out whether to use a Poisson model or a linear regression model.  For a real system, you’d want to split the data into training and test data sets, train each model, and compare them against the test data set to see which fits it closest.  But for our case, the separator will be simple:  if the mean number of items sold per day is less than 6, then we will go with the Poisson distribution.  If the mean is greater than or equal to six, then we will build a linear regression.  The following code does this for us:

DECLARE
	@ItemID INT = 1,
	@Model VARBINARY(MAX);

EXEC sp_execute_external_script
	@language = N'R',
	@script = N'
BuildPoisson = function(sales) {
	lambda <- mean(sales$NumberSold)
	class(lambda) <- append("Poisson Distribution", class(lambda))
	return(serialize(lambda, connection = NULL))
}

BuildLinearRegression = function(sales) {
	model <- lm(formula = NumberSold ~ ShopperQuestions + EmployeeRecommendations + InchesOfPrintAdvertisement, data = sales)
	class(model) <- append("Linear Regression", class(model))
	return(serialize(model, connection = NULL))
}

if (mean(Sales$NumberSold) < 6) {
    Model <- BuildPoisson(Sales)
} else {
    Model <- BuildLinearRegression(Sales)
}
',
	@input_data_1 = N'SELECT
	SaleDate,
	NumberSold,
	ShopperQuestions,
	EmployeeRecommendations,
	InchesOfPrintAdvertisement
FROM #ItemSales
WHERE
	ItemID = @ItemID',
	@input_data_1_name = N'Sales',
	@params = N'@ItemID INT, @Model VARBINARY(MAX) OUTPUT',
	@ItemID = @ItemID,
	@Model = @Model OUTPUT;

DELETE
FROM #Model
WHERE ItemID = @ItemID;

INSERT INTO #Model 
(
	ItemID,
	Model
)
VALUES
	(@ItemID, @Model);
GO

If you run this once for ItemID 1 and once for ItemID 2, you’ll have two models in the #Models temp table.  Not surprisingly, the Poisson distribution’s binary model size is about 2 orders of magnitude smaller than the linear regression’s (106 bytes versus 16,510 bytes in my scenario).  For larger models, you can use the COMPRESS() and DECOMPRESS() functions in SQL Server to shrink model size when writing to disk.

Now that we have the model serialized, we need a slightly smarter deserialization function.

DECLARE
	@ItemID INT = 2,
	@Model VARBINARY(MAX);
	
SELECT
	@Model = Model
FROM #Model m
WHERE
	m.ItemID = @ItemID;	

EXEC sp_execute_external_script
	@language = N'R',
	@script = N'
PredictPoisson = function(model, InputData) {
	lambda <- model
	predictions <- rpois(nrow(InputData), lambda)
	return(data.frame(InputData$PredictionDate, predictions))
}

PredictLinear = function(model, InputData) {
	predictions <- round(predict(object = model, newdata = InputData))
	return(data.frame(InputData$PredictionDate, predictions))
}

model <- unserialize(Model)
if (class(model)[1] == "Poisson Distribution") {
	print("Using the Poisson Distribution")
	OutputDataSet <- PredictPoisson(model, InputData)
} else {
	print("Using a Linear Regression")
	OutputDataSet <- PredictLinear(model, InputData)
}',
	@input_data_1 = N'
SELECT
	PredictionDate,
	ShopperQuestions,
	EmployeeRecommendations,
	InchesOfPrintAdvertisement
FROM #InputData',
	@input_data_1_name = N'InputData',
	@params = N'@Model VARBINARY(MAX)',
	@Model = @Model
	WITH RESULT SETS
	(
		(
			PredictionDate DATE,
			NumberPredicted INT
		)
	);

Instead of passing in the prediction lookahead in days and the start date, let’s pass in a set of the days we want to predict.  For the linear regression, we will need to include a few more features, namely number of shopper questions, number of employee recommendations, and inches of print advertisement.  Let’s say that we have some way of estimating the first two (way to assume away the problem, me!) and we fix the print advertisement to 14″ per day because that’s our preferred print size.

When we run this, we get back the prediction dates and predicted values for our respective models.  Try the code with @ItemID = 1 and @ItemID = 2 and you can see that we do make use of the correct model for each item.

Conclusion

In today’s post, we expanded from using one type of model to two types of models.  We’re still focused on predicting one item at a time, though.  Next time, in the thrilling conclusion, we’ll see how to predict for both of these items at once.

Building An R Model Using ML Services

This will be part one of a three-part series.  During the course of this, we will look at building a simple model in R, extending our process to cover two models, and then making predictions on already-existing models.  In this first post, I will build a simple model.

Telling A Story

Let’s say that you work at an upscale boutique which sells a few items.  Your boss wants to figure out how many of each item to stock over the course of the week and has given you some data to play with.  The first item on the list is a fancy watch.  Here’s a table of how often that watch has sold over the past 30 days:

CREATE TABLE #ItemSales
(
	SaleDate DATE,
	NumberSold INT
);

INSERT INTO #ItemSales
(
	SaleDate,
	NumberSold
)
VALUES
	('2017-12-01', 7),
	('2017-12-02', 4),
	('2017-12-03', 4),
	('2017-12-04', 1),
	('2017-12-05', 3),
	('2017-12-06', 3),
	('2017-12-07', 5),
	('2017-12-08', 6),
	('2017-12-09', 2),
	('2017-12-10', 2),
	('2017-12-11', 1),
	('2017-12-12', 4),
	('2017-12-13', 3),
	('2017-12-14', 2),
	('2017-12-15', 3),
	('2017-12-16', 5),
	('2017-12-17', 2),
	('2017-12-18', 5),
	('2017-12-19', 6),
	('2017-12-20', 4),
	('2017-12-21', 5),
	('2017-12-22', 2),
	('2017-12-23', 1),
	('2017-12-24', 3),
	('2017-12-25', 2),
	('2017-12-26', 2),
	('2017-12-27', 2),
	('2017-12-28', 7),
	('2017-12-29', 3),
	('2017-12-30', 5);

Glancing at the numbers, we range from 1 to 7, so there’s not very much variance in day-to-day sales.  We don’t get that much of a variance in our week to week sales either, with weekly sales counts of 27, 20, 30, and 19 over our four full weeks.  Basically, we’re looking at selling somewhere around 3-4 items per day.  We might be able to figure out what’s behind the differences in sales (like why we had 7 sales on the 28th of December but only 1 on the 23rd), but this is actually a pretty good case to use my favorite distribution.

The Poisson Distribution

I am a huge fan of the Poisson distribution.  It is special in that its one parameter (lambda) represents both the mean and the variance of the distribution.  At the limit, a Poisson distribution becomes normal.  But it’s most useful in helping us pattern infrequently-occurring events.  For example, selling 3-4 watches per day.

Estimating a Poisson is also easy in R:  lambda is simply the mean of your sample counts, and there is a function called rpois() which takes two parameters:  the number of events you want to generate and the value of lambda.

So what I want to do is take my data from SQL Server, feed it into R, and return back a prediction for the next seven days.

Executing An External Stored Procedure

If you have not installed SQL Server ML Services, I’m not going to show you how here; you can find documentation and good examples out there already.

Here’s the stored procedure call.  I’m passing in when I want the process to start predicting and the number of days for prediction, and I will get back a data frame with the date and our Poisson model’s prediction for the number of watches sold:

EXEC sp_execute_external_script
	@language = N'R',
	@script = N'lambda <- mean(Sales$NumberSold)
predictions <- rpois(PredictionLookaheadInDays, lambda)
dates <- format(as.POSIXlt(seq(from = StartDate, by = "day", length.out = PredictionLookaheadInDays)))
OutputDataSet <- data.frame(dates, predictions)
',
	@input_data_1 = N'SELECT
	SaleDate,
	NumberSold
FROM #ItemSales',
	@input_data_1_name = N'Sales',
	@params = N'@PredictionLookaheadInDays INT, @StartDate DATE',
	@PredictionLookaheadInDays = 7,
	@StartDate = '2017-01-01'
	WITH RESULT SETS
	(
		(
			SaleDate DATE,
			NumberPredicted INT
		)
	);

I’d like to emphasize that this model is not very good for predicting how many watches we’ll sell in a day, but does a much better job at predicting how many we’ll sell in a week.  We’re generating random numbers pulled from a Poisson distribution, so the longer the timeframe we look at, the closer our model will be to reality (assuming that the underlying distribution of reality in this case follows a Poisson with the same lambda).

Conclusion

This wraps up part one of the series.  So far, we’ve generated some results following a Poisson.  Next, I’m going to build a model that we can store in the database; that way, I don’t need to keep passing in all of my data.

The Feasel Challenge

This is part eight of a series on dashboard visualization.

Well, at this point we’ve nearly crossed the finish line.  The last part of this series covers The Feasel Challenge, which has given rise to the single most important dashboard I have:  the Victory Burrito dashboard.  My challenge is to eat a Victory Burrito in 50 different cities around the world.  A Victory Burrito is defined as a burrito that I eat after a speaking engagement.  I’d like to track my Victory Burrito consumption and see how close I am to beating this challenge.  Specifically, I have four measures that I’d like to track:

  1. How close I am to 50 distinct cities
  2. Where I have had a victory burrito (city, state, country)
  3. Breakdown of victory burritos by chain/restaurant
  4. How many victory burritos I’ve had per year (not unique cities)

Because the data doesn’t update that frequently—maybe once a week—I can create a strategic dashboard which would refresh periodically, like whenever I feel like updating the underlying data source.  Because it’s a strategic dashboard, that helps me define how I want the thing to look and what sorts of visuals fit the motif.

Cousin Joey caught wind of what I was doing and offered to create a dashboard for me.  Now, Cousin Joey doesn’t know much about dashboards but was just learning about Power BI and wanted to do something cool, so I decided to let Cousin Joey run wild.  Here is what he came back with:

Bad Dashboard

Cousin Joey is now banned from creating dashboards.

This is close to the worst dashboard possible, and it took me Cousin Joey more time than you’d think coming up with something this awful.  Let me enumerate the problems:

  1. There is a totally unnecessary background image which is a bit distracting.
  2. The burrito tile on the sides is awful and needs to go.
  3. We don’t need to know that Cousin Joey (why does he refer to himself as Cousin Joey?  Wouldn’t he just think of himself as Joey?) made this dashboard.
  4. We really don’t even need a title and if I did, it wouldn’t sound like a third-rate wrestling competition.
  5. My eye gets twitchy when I see that pie chart, and the donut chart isn’t doing much better.  Those are awful choices to display this information.
  6. The waterfall chart is also a bad use of a good visual—burritos eaten by year is a count which is always positive, and years are independent of one another.  Waterfall charts are good for things like profit and loss metrics over a time frame, but not good in this case.
  7. The word cloud is never a good choice for a dashboard.
  8. This dashboard fails to answer the most important of my four questions:  how many unique cities have I hit so far?

Here is a better version of that dashboard:

Better Dashboard

Made by Cousin Joey?  Nope.

So let’s talk about the changes.  In the top-left corner, I have a gauge.  As I’ve mentioned, gauges are very risky.  But in this case, they tell me my progress toward the goal.  My official goal is 50, but I wanted to say that I can go above and beyond, aiming for 60.  I’m currently at 34 cities, but that number should increase by 3 over the course of the next two months.

Then, moving to the right, I have a line chart and a bar chart.  The line chart shows a count of burritos per year.  I chose a line chart for this because I’m thinking about time series data and want to watch the ups and downs.  This shows that I slacked off a little bit in 2017 in the burrito category (though to be fair, you try to find a Mexican restaurant in Vienna!), so it gives me a little motivation for 2018.

To the right of it, I have a bar chart of burritos eaten by country.  The US dominates this, but I expect (or at least  hope) that I’ll get a chance to add a few more countries to the list this year.

In the bottom right corner, I have a treemap which shows burritos by state/province and city, so I can break it down further.  As the number of countries increases, I can squish this treemap in, or even replace it with a different kind of visual.

Finally, in the bottom left, I have a breakdown of burritos by restaurant.  It’s not that I like Chipotle more than anything else; it’s that you can find a Chipotle nearby in most American cities.  I do plan to make some effort in increasing my Victory Burrito restaurant choices this upcoming year, and that chart gives me some motivation to do so.  One thing that I might think about doing is bundling up the single-serving entries into one category and labeling it as “Others” or something.  If I start getting a space crunch, that will buy me some room.

I’m not going to claim that this dashboard is perfect.  It does, however, answer my four important questions and helps me tell a story of a man traveling around in search of a good burrito.  If you have thoughts on how to make the dashboard even better, let me know in the comments.  And if you want Cousin Joey’s contact information for dashboard consulting, I’m sure I can dig it up…