Saving An R Model Using ML Services

Last time, we looked at building a simple Poisson model to forecast sales.  Today, we’re going to expand upon this, building out a model that we can store in the database.  We’ll also introduce a second type of model for a more frequently-purchased item.

Storing A Model

Let’s say that we want to take our beautiful Poisson model and save it for posterity.  Let’s re-create that data set once again, though I’m adding a couple of things we’ll need soon enough.

CREATE TABLE #ItemSales
(
	ItemID INT,
	SaleDate DATE,
	NumberSold INT,
	ShopperQuestions INT,
	EmployeeRecommendations INT,
	InchesOfPrintAdvertisement INT
);

INSERT INTO #ItemSales
(
	ItemID,
	SaleDate,
	NumberSold
)
VALUES
	(1, '2017-12-01', 7),
	(1, '2017-12-02', 4),
	(1, '2017-12-03', 4),
	(1, '2017-12-04', 1),
	(1, '2017-12-05', 3),
	(1, '2017-12-06', 3),
	(1, '2017-12-07', 5),
	(1, '2017-12-08', 6),
	(1, '2017-12-09', 2),
	(1, '2017-12-10', 2),
	(1, '2017-12-11', 1),
	(1, '2017-12-12', 4),
	(1, '2017-12-13', 3),
	(1, '2017-12-14', 2),
	(1, '2017-12-15', 3),
	(1, '2017-12-16', 5),
	(1, '2017-12-17', 2),
	(1, '2017-12-18', 5),
	(1, '2017-12-19', 6),
	(1, '2017-12-20', 4),
	(1, '2017-12-21', 5),
	(1, '2017-12-22', 2),
	(1, '2017-12-23', 1),
	(1, '2017-12-24', 3),
	(1, '2017-12-25', 2),
	(1, '2017-12-26', 2),
	(1, '2017-12-27', 2),
	(1, '2017-12-28', 7),
	(1, '2017-12-29', 3),
	(1, '2017-12-30', 5);

I’ve added an Item ID to represent which product we’re looking at, and 1 will be our fancy watch.  I’ve also added a few measures that we will not talk about in front of the watch—it’s not polite.

We used sp_execute_external_script to build a model and generate some number of days worth of predictions.  Now I’d like to break that up into two operations:  training a model and generating predictions.  The biggest reason I might want to do this is if model generation is very expensive relative to prediction generation.  It’s obviously not very expensive to build a random number generator following a Poisson distribution, but imagine we had a complicated neural net that took hours or days to train—we’d want to use that model over and over, as long as the model came close enough to representing the underlying reality.

So now we’re going to create a model.  We’ll obviously need a table to store those models, so let’s create one now:

CREATE TABLE #Model
(
	ItemID INT,
	Model VARBINARY(MAX)
);

Now we’re going to build an R script which takes our input data set and builds a resulting model.  We want to serialize the model so that it gets converted to VARBINARY(MAX).  That way, we can safely store the model in SQL Server, regardless of whether it’s a Poisson distribution, a neural net, or anything else.

DECLARE
	@ItemID INT = 1,
	@Model VARBINARY(MAX);

EXEC sp_execute_external_script
	@language = N'R',
	@script = N'lambda <- mean(Sales$NumberSold)
class(lambda) <- append("Poisson Distribution", class(lambda))
Model <- serialize(lambda, connection = NULL)
',
	@input_data_1 = N'SELECT
	SaleDate,
	NumberSold
FROM #ItemSales
WHERE
	ItemID = @ItemID',
	@input_data_1_name = N'Sales',
	@params = N'@ItemID INT, @Model VARBINARY(MAX) OUTPUT',
	@ItemID = @ItemID,
	@Model = @Model OUTPUT;

DELETE
FROM #Model
WHERE ItemID = @ItemID;

INSERT INTO #Model 
(
	ItemID,
	Model
)
VALUES
	(@ItemID, @Model);
GO

Our model is incredibly simple:  we just need lambda, so we’re going to serialize that parameter and return it to the user.  Note that I append “Poisson Distribution” to the class before serializing.  That way I can note what the underlying distribution is, so when I start to incorporate multiple models, it’ll be easy to differentiate them.

Using The Model

Now that we have a model in place, let’s write a function which accepts the model and generates predictions off of it.

DECLARE
	@ItemID INT = 1,
	@Model VARBINARY(MAX);
	
SELECT
	@Model = Model
FROM #Model m
WHERE
	m.ItemID = @ItemID;	

EXEC sp_execute_external_script
	@language = N'R',
	@script = N'lambda <- unserialize(Model)
predictions <- rpois(PredictionLookaheadInDays, lambda)
dates <- format(as.POSIXlt(seq(from = StartDate, by = "day", length.out = PredictionLookaheadInDays)))
OutputDataSet <- data.frame(dates, predictions)
',
	@params = N'@Model VARBINARY(MAX), @PredictionLookaheadInDays INT, @StartDate DATE',
	@Model = @Model,
	@PredictionLookaheadInDays = 7,
	@StartDate = '2017-01-01'
	WITH RESULT SETS
	(
		(
			SaleDate DATE,
			NumberPredicted INT
		)
	);

Now we have a procedure call which gets us back predictions based on our model.  I don’t need to pass in the input data anymore, and can instead simply give it a model.

A New Product And A New Type

Now let’s introduce a second item.  Instead of this being a fancy watch, it is instead a moderately fancy pen.  We sell dozens and dozens of these moderately fancy pens per day, so that’s way more than what we could safely predict with a Poisson distribution.  We’re going to need some more data.

Fortunately, we know that there are three factors which help determine the number of pens sold:  the number of shoppers who come inquire about the pen, the number of times our sales clerk recommends this pen, and the number of inches of print advertising in our local newspaper (because the Internet is a fad, people).  Anyhow, armed with our vital analytical measures, we can move forward and insert product number 2 into our table:

INSERT INTO #ItemSales 
(
	ItemID,
	SaleDate,
	NumberSold,
	ShopperQuestions,
	EmployeeRecommendations,
	InchesOfPrintAdvertisement
)
VALUES
	(2, '2017-12-01', 52, 9, 6, 14),
	(2, '2017-12-02', 49, 7, 5, 15),
	(2, '2017-12-03', 53, 13, 6, 14),
	(2, '2017-12-04', 48, 8, 6, 13),
	(2, '2017-12-05', 66, 8, 9, 16),
	(2, '2017-12-06', 58, 8, 8, 15),
	(2, '2017-12-07', 70, 8, 10, 16),
	(2, '2017-12-08', 68, 8, 10, 16),
	(2, '2017-12-09', 43, 12, 3, 14),
	(2, '2017-12-10', 41, 13, 2, 15),
	(2, '2017-12-11', 25, 3, 1, 14),
	(2, '2017-12-12', 42, 2, 4, 15),
	(2, '2017-12-13', 32, 8, 2, 12),
	(2, '2017-12-14', 61, 11, 8, 15),
	(2, '2017-12-15', 58, 14, 6, 16),
	(2, '2017-12-16', 67, 10, 9, 15),
	(2, '2017-12-17', 57, 8, 7, 15),
	(2, '2017-12-18', 49, 8, 6, 13),
	(2, '2017-12-19', 46, 9, 5, 13),
	(2, '2017-12-20', 63, 5, 9, 15),
	(2, '2017-12-21', 45, 9, 4, 15),
	(2, '2017-12-22', 69, 8, 9, 17),
	(2, '2017-12-23', 70, 9, 11, 12),
	(2, '2017-12-24', 70, 2, 12, 14),
	(2, '2017-12-25', 55, 11, 7, 13),
	(2, '2017-12-26', 50, 5, 6, 16),
	(2, '2017-12-27', 64, 11, 9, 13),
	(2, '2017-12-28', 48, 5, 5, 15),
	(2, '2017-12-29', 62, 9, 8, 16),
	(2, '2017-12-30', 50, 4, 6, 15);

We now have two different types of products to model:  simple products which fit a Poisson and more complicated products which we’re going to model with a linear regression.

A New Model

So now we’d like to have two models:  a Poisson distribution and a linear model based on regressing three numeric variables against our sales data.  We’ll need some way to figure out whether to use a Poisson model or a linear regression model.  For a real system, you’d want to split the data into training and test data sets, train each model, and compare them against the test data set to see which fits it closest.  But for our case, the separator will be simple:  if the mean number of items sold per day is less than 6, then we will go with the Poisson distribution.  If the mean is greater than or equal to six, then we will build a linear regression.  The following code does this for us:

DECLARE
	@ItemID INT = 1,
	@Model VARBINARY(MAX);

EXEC sp_execute_external_script
	@language = N'R',
	@script = N'
BuildPoisson = function(sales) {
	lambda <- mean(sales$NumberSold)
	class(lambda) <- append("Poisson Distribution", class(lambda))
	return(serialize(lambda, connection = NULL))
}

BuildLinearRegression = function(sales) {
	model <- lm(formula = NumberSold ~ ShopperQuestions + EmployeeRecommendations + InchesOfPrintAdvertisement, data = sales)
	class(model) <- append("Linear Regression", class(model))
	return(serialize(model, connection = NULL))
}

if (mean(Sales$NumberSold) < 6) {
    Model <- BuildPoisson(Sales)
} else {
    Model <- BuildLinearRegression(Sales)
}
',
	@input_data_1 = N'SELECT
	SaleDate,
	NumberSold,
	ShopperQuestions,
	EmployeeRecommendations,
	InchesOfPrintAdvertisement
FROM #ItemSales
WHERE
	ItemID = @ItemID',
	@input_data_1_name = N'Sales',
	@params = N'@ItemID INT, @Model VARBINARY(MAX) OUTPUT',
	@ItemID = @ItemID,
	@Model = @Model OUTPUT;

DELETE
FROM #Model
WHERE ItemID = @ItemID;

INSERT INTO #Model 
(
	ItemID,
	Model
)
VALUES
	(@ItemID, @Model);
GO

If you run this once for ItemID 1 and once for ItemID 2, you’ll have two models in the #Models temp table.  Not surprisingly, the Poisson distribution’s binary model size is about 2 orders of magnitude smaller than the linear regression’s (106 bytes versus 16,510 bytes in my scenario).  For larger models, you can use the COMPRESS() and DECOMPRESS() functions in SQL Server to shrink model size when writing to disk.

Now that we have the model serialized, we need a slightly smarter deserialization function.

DECLARE
	@ItemID INT = 2,
	@Model VARBINARY(MAX);
	
SELECT
	@Model = Model
FROM #Model m
WHERE
	m.ItemID = @ItemID;	

EXEC sp_execute_external_script
	@language = N'R',
	@script = N'
PredictPoisson = function(model, InputData) {
	lambda <- model
	predictions <- rpois(nrow(InputData), lambda)
	return(data.frame(InputData$PredictionDate, predictions))
}

PredictLinear = function(model, InputData) {
	predictions <- round(predict(object = model, newdata = InputData))
	return(data.frame(InputData$PredictionDate, predictions))
}

model <- unserialize(Model)
if (class(model)[1] == "Poisson Distribution") {
	print("Using the Poisson Distribution")
	OutputDataSet <- PredictPoisson(model, InputData)
} else {
	print("Using a Linear Regression")
	OutputDataSet <- PredictLinear(model, InputData)
}',
	@input_data_1 = N'
SELECT
	PredictionDate,
	ShopperQuestions,
	EmployeeRecommendations,
	InchesOfPrintAdvertisement
FROM #InputData',
	@input_data_1_name = N'InputData',
	@params = N'@Model VARBINARY(MAX)',
	@Model = @Model
	WITH RESULT SETS
	(
		(
			PredictionDate DATE,
			NumberPredicted INT
		)
	);

Instead of passing in the prediction lookahead in days and the start date, let’s pass in a set of the days we want to predict.  For the linear regression, we will need to include a few more features, namely number of shopper questions, number of employee recommendations, and inches of print advertisement.  Let’s say that we have some way of estimating the first two (way to assume away the problem, me!) and we fix the print advertisement to 14″ per day because that’s our preferred print size.

When we run this, we get back the prediction dates and predicted values for our respective models.  Try the code with @ItemID = 1 and @ItemID = 2 and you can see that we do make use of the correct model for each item.

Conclusion

In today’s post, we expanded from using one type of model to two types of models.  We’re still focused on predicting one item at a time, though.  Next time, in the thrilling conclusion, we’ll see how to predict for both of these items at once.