This will be part one of a three-part series. During the course of this, we will look at building a simple model in R, extending our process to cover two models, and then making predictions on already-existing models. In this first post, I will build a simple model.
Telling A Story
Let’s say that you work at an upscale boutique which sells a few items. Your boss wants to figure out how many of each item to stock over the course of the week and has given you some data to play with. The first item on the list is a fancy watch. Here’s a table of how often that watch has sold over the past 30 days:
CREATE TABLE #ItemSales ( SaleDate DATE, NumberSold INT ); INSERT INTO #ItemSales ( SaleDate, NumberSold ) VALUES ('2017-12-01', 7), ('2017-12-02', 4), ('2017-12-03', 4), ('2017-12-04', 1), ('2017-12-05', 3), ('2017-12-06', 3), ('2017-12-07', 5), ('2017-12-08', 6), ('2017-12-09', 2), ('2017-12-10', 2), ('2017-12-11', 1), ('2017-12-12', 4), ('2017-12-13', 3), ('2017-12-14', 2), ('2017-12-15', 3), ('2017-12-16', 5), ('2017-12-17', 2), ('2017-12-18', 5), ('2017-12-19', 6), ('2017-12-20', 4), ('2017-12-21', 5), ('2017-12-22', 2), ('2017-12-23', 1), ('2017-12-24', 3), ('2017-12-25', 2), ('2017-12-26', 2), ('2017-12-27', 2), ('2017-12-28', 7), ('2017-12-29', 3), ('2017-12-30', 5);
Glancing at the numbers, we range from 1 to 7, so there’s not very much variance in day-to-day sales. We don’t get that much of a variance in our week to week sales either, with weekly sales counts of 27, 20, 30, and 19 over our four full weeks. Basically, we’re looking at selling somewhere around 3-4 items per day. We might be able to figure out what’s behind the differences in sales (like why we had 7 sales on the 28th of December but only 1 on the 23rd), but this is actually a pretty good case to use my favorite distribution.
The Poisson Distribution
I am a huge fan of the Poisson distribution. It is special in that its one parameter (lambda) represents both the mean and the variance of the distribution. At the limit, a Poisson distribution becomes normal. But it’s most useful in helping us pattern infrequently-occurring events. For example, selling 3-4 watches per day.
Estimating a Poisson is also easy in R: lambda is simply the mean of your sample counts, and there is a function called
rpois() which takes two parameters: the number of events you want to generate and the value of lambda.
So what I want to do is take my data from SQL Server, feed it into R, and return back a prediction for the next seven days.
Executing An External Stored Procedure
Here’s the stored procedure call. I’m passing in when I want the process to start predicting and the number of days for prediction, and I will get back a data frame with the date and our Poisson model’s prediction for the number of watches sold:
EXEC sp_execute_external_script @language = N'R', @script = N'lambda <- mean(Sales$NumberSold) predictions <- rpois(PredictionLookaheadInDays, lambda) dates <- format(as.POSIXlt(seq(from = StartDate, by = "day", length.out = PredictionLookaheadInDays))) OutputDataSet <- data.frame(dates, predictions) ', @input_data_1 = N'SELECT SaleDate, NumberSold FROM #ItemSales', @input_data_1_name = N'Sales', @params = N'@PredictionLookaheadInDays INT, @StartDate DATE', @PredictionLookaheadInDays = 7, @StartDate = '2017-01-01' WITH RESULT SETS ( ( SaleDate DATE, NumberPredicted INT ) );
I’d like to emphasize that this model is not very good for predicting how many watches we’ll sell in a day, but does a much better job at predicting how many we’ll sell in a week. We’re generating random numbers pulled from a Poisson distribution, so the longer the timeframe we look at, the closer our model will be to reality (assuming that the underlying distribution of reality in this case follows a Poisson with the same lambda).
This wraps up part one of the series. So far, we’ve generated some results following a Poisson. Next, I’m going to build a model that we can store in the database; that way, I don’t need to keep passing in all of my data.