In this post, I am going to look at using SQL Server 2016 temporal tables as slowly changing dimension tables. We will walk through the setup and then look at what I consider to be the biggest disappointment with temporal tables; I will then show a workaround for this issue.
Act 1: The Setup
Temporal tables were introduced in SQL Server 2016. They take a lot of the busy work out of creating history tables, handling things like data movement and building interfaces to query historical data for you. This makes them very interesting from a T-SQL developer’s perspective, as it potentially allows us to keep our primary table smaller while still retaining historical data for user query.
In my example, I am going to build a table to predict quantity sold for a series of products. In the real version of these tables, I have some fancy R code generating product models, and I store them in a table called ProductModel. I want to have the latest version of the model readily available to me, but I would also like to store historical models for a product; that way, if I see a persistent problem with the model, I can load an older version of the model from SQL Server back into R and analyze the model further.
For this post, I’m going to do something a bit simpler; I’ll create a ProductModel table with a product ID and a model number. This model number represents my R model and allows me to simplify the problem considerably by skipping the whole model creation part.
Here’s my ProductModel table:
CREATE TABLE [dbo].[ProductModel] ( ProductID INT NOT NULL, ModelNumber INT NOT NULL, StartDateGMT DATETIME2(3) GENERATED ALWAYS AS ROW START CONSTRAINT [DF_ProductModel_StartDateGMT] DEFAULT (SYSUTCDATETIME()), EndDateGMT DATETIME2(3) GENERATED ALWAYS AS ROW END CONSTRAINT [DF_ProductModel_EndDateGMT] DEFAULT(SYSUTCDATETIME()), PERIOD FOR SYSTEM_TIME (StartDateGMT, EndDateGMT), CONSTRAINT [PK_ProductModel] PRIMARY KEY CLUSTERED ( ProductID )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 100, DATA_COMPRESSION = PAGE) ) WITH(SYSTEM_VERSIONING = ON (HISTORY_TABLE = dbo.ProductModelHistory)); GO
Note that I am creating a history table called
dbo.ProductModelHistory, and the valid date range for each model will be from StartDateGMT until EndDateGMT.
Next up, I want to create a table to store my predictions. In this table, I store the product ID, the date when I made the prediction, and how much I predicted we would sell of the product. Note that this table does not have a model number: the people who care about the predictions don’t have any idea what an R model is and don’t care how we store that model.
CREATE TABLE [dbo].[QuantitySoldPrediction] ( ProductID INT NOT NULL, DatePredictionMade DATE NOT NULL, PredictedQuantitySold INT NOT NULL, CONSTRAINT [PK_QuantitySoldPrediction] PRIMARY KEY CLUSTERED ( ProductID, DatePredictionMade )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 100, DATA_COMPRESSION = PAGE) ); GO
Now I want to populate these tables. Because I want to load historical data, I need to set system versioning off temporarily.
--Insert some historical data ALTER TABLE dbo.ProductModel SET (SYSTEM_VERSIONING = OFF); INSERT INTO dbo.ProductModelHistory ( ProductID, ModelNumber, StartDateGMT, EndDateGMT ) VALUES (1, 1, '2017-06-01', '2017-06-14'), (1, 2, '2017-06-14', '2017-06-28'), (1, 3, '2017-06-28', '2017-07-03 14:29:21'), (2, 1, '2017-07-01', '2017-07-04 08:00:00'); ALTER TABLE dbo.ProductModel SET (SYSTEM_VERSIONING = ON (HISTORY_TABLE = dbo.ProductModelHistory)); INSERT INTO dbo.ProductModel ( ProductID, ModelNumber ) VALUES (1, 4), (2, 2), (3, 1); --Insert predictions, some for the historical data and some for now. INSERT INTO dbo.QuantitySoldPrediction ( ProductID, DatePredictionMade, PredictedQuantitySold ) VALUES (1, '2017-06-02', 8), (1, '2017-06-15', 11), (1, '2017-06-19', 9), (1, '2017-06-29', 21), (1, '2017-07-02', 23), (2, '2017-07-02', 2), (2, '2017-07-03', 1), (1, GETUTCDATE(), 44), (2, GETUTCDATE(), 6), (3, GETUTCDATE(), 16);
At this point, we have the following:
- Three products in our ProductModel table
- Four historical ProductModel records. Note that there is a gap between historical end dates and the start dates in the ProductModel table; that’s okay for what we’re doing.
- Ten predictions, including seven historical predictions and three current predictions.
Act Two: The Falling Action
Now that I have my data, I want to do something simple: I want to show the model number associated with each prediction. This involves looking at the ProductModel and ProductModelHistory tables based on StartDateGMT.
The documentation for temporal tables shows how you can use
FOR SYSTEM_TIME AS OF to look at time frames. For example:
SELECT pm.ProductID, pm.ModelNumber, pm.StartDateGMT, pm.EndDateGMT FROM dbo.ProductModel FOR SYSTEM_TIME AS OF '2017-07-02 08:00:00' pm;
This brings us back the following result:
We can also use a variable, allowing us to look at what the table looked like at a user-defined point in time:
DECLARE @InterestingTime DATETIME = '2017-07-02 08:00:00'; SELECT pm.ProductID, pm.ModelNumber, pm.StartDateGMT, pm.EndDateGMT FROM dbo.ProductModel FOR SYSTEM_TIME AS OF @InterestingTime pm;
This gives us back the same result. Thinking about this syntax, what I want to do is join to
dbo.QuantitySoldPrediction table. Let’s start simple:
DECLARE @InterestingTime DATETIME = '2017-07-02 08:00:00'; SELECT pm.ProductID, pm.ModelNumber, pm.StartDateGMT, pm.EndDateGMT FROM dbo.ProductModel FOR SYSTEM_TIME AS OF @InterestingTime pm INNER JOIN dbo.QuantitySoldPrediction qsp ON pm.ProductID = qsp.ProductID;
This query succeeds but returns results we don’t really want:
This brings back all 9 records tied to products 1 and 2 (because product 3 didn’t exist on July 2nd at 8 AM UTC). But it gives us the same start and end date, so that’s not right. What I really want to do is replace
DatePredictionMade, so let’s try that:
This returns a syntax error. It would appear that at the time
FOR SYSTEM_TIME is resolved,
QuantitySoldPrediction does not yet exist. This stops us dead in our tracks.
Act Three: The Denouement
We don’t have an easy way of solving the problem, but we do have a complicated way. Under the covers,
dbo.ProductModelHistory is just a table. That means we can still query it like any other table. What we want is the latest single product model date that fits within each quantity sold prediction’s date. To solve this, we pull one of my favorite rabbits out of the hat: the APPLY operator.
SELECT q.ProductID, q.DatePredictionMade, q.PredictedQuantitySold, pm.ModelNumber FROM dbo.QuantitySoldPrediction q CROSS APPLY ( SELECT TOP(1) pmbuild.ModelNumber FROM ( SELECT pm.ModelNumber, pm.StartDateGMT FROM dbo.ProductModel pm WHERE pm.ProductID = q.ProductID UNION ALL SELECT TOP(1) pmh.ModelNumber, pmh.StartDateGMT FROM dbo.ProductModelHistory pmh WHERE pmh.ProductID = q.ProductID AND pmh.StartDateGMT <= q.DatePredictionMade ORDER BY pmh.StartDateGMT DESC ) pmbuild ORDER BY pmbuild.StartDateGMT ASC ) pm;
This query is quite a bit longer than I’d like, as I need to build a sub-query to union together current and historical data, and then grab the top row for each quantity sold prediction. On the plus side, the results are what I want:
This is one of two reasons that make me think that temporal tables do not make for great slowly changing dimension implementations. The other reason is that you can’t insert into the history table as long as you have
SYSTEM_VERSIONING on. These tables can be quite useful for storing fact table history, as the
FOR SYSTEM_TIME AS OF point-in-time snapshot makes a lot more sense in that case. And as we saw above, if you want to try hard enough when writing your queries to retrieve data, you can make them work. In my case, I had one query like this I needed to write, so the benefits still outweighed the costs.