Temporal Tables As SCD2 Tables

In this post, I am going to look at using SQL Server 2016 temporal tables as slowly changing dimension tables.  We will walk through the setup and then look at what I consider to be the biggest disappointment with temporal tables; I will then show a workaround for this issue.

Act 1:  The Setup

Temporal tables were introduced in SQL Server 2016.  They take a lot of the busy work out of creating history tables, handling things like data movement and building interfaces to query historical data for you.  This makes them very interesting from a T-SQL developer’s perspective, as it potentially allows us to keep our primary table smaller while still retaining historical data for user query.

In my example, I am going to build a table to predict quantity sold for a series of products.  In the real version of these tables, I have some fancy R code generating product models, and I store them in a table called ProductModel.  I want to have the latest version of the model readily available to me, but I would also like to store historical models for a product; that way, if I see a persistent problem with the model, I can load an older version of the model from SQL Server back into R and analyze the model further.

For this post, I’m going to do something a bit simpler; I’ll create a ProductModel table with a product ID and a model number.  This model number represents my R model and allows me to simplify the problem considerably by skipping the whole model creation part.

Here’s my ProductModel table:

CREATE TABLE [dbo].[ProductModel]
(
	ProductID INT NOT NULL,
	ModelNumber INT NOT NULL,
	StartDateGMT DATETIME2(3) GENERATED ALWAYS AS ROW START CONSTRAINT [DF_ProductModel_StartDateGMT] DEFAULT (SYSUTCDATETIME()),
	EndDateGMT DATETIME2(3) GENERATED ALWAYS AS ROW END CONSTRAINT [DF_ProductModel_EndDateGMT] DEFAULT(SYSUTCDATETIME()),
	PERIOD FOR SYSTEM_TIME (StartDateGMT, EndDateGMT),
	CONSTRAINT [PK_ProductModel] PRIMARY KEY CLUSTERED
	(
		ProductID
	)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 100, DATA_COMPRESSION = PAGE)
) WITH(SYSTEM_VERSIONING = ON (HISTORY_TABLE = dbo.ProductModelHistory));
GO

Note that I am creating a history table called dbo.ProductModelHistory, and the valid date range for each model will be from StartDateGMT until EndDateGMT.

Next up, I want to create a table to store my predictions.  In this table, I store the product ID, the date when I made the prediction, and how much I predicted we would sell of the product.  Note that this table does not have a model number:  the people who care about the predictions don’t have any idea what an R model is and don’t care how we store that model.

CREATE TABLE [dbo].[QuantitySoldPrediction]
(
	ProductID INT NOT NULL,
	DatePredictionMade DATE NOT NULL,
	PredictedQuantitySold INT NOT NULL,
	CONSTRAINT [PK_QuantitySoldPrediction] PRIMARY KEY CLUSTERED
	(
		ProductID,
		DatePredictionMade
	)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 100, DATA_COMPRESSION = PAGE)
);
GO

Now I want to populate these tables. Because I want to load historical data, I need to set system versioning off temporarily.

--Insert some historical data
ALTER TABLE dbo.ProductModel SET (SYSTEM_VERSIONING = OFF);

INSERT INTO dbo.ProductModelHistory
(
	ProductID,
	ModelNumber,
	StartDateGMT,
	EndDateGMT
)
VALUES
(1, 1, '2017-06-01', '2017-06-14'),
(1, 2, '2017-06-14', '2017-06-28'),
(1, 3, '2017-06-28', '2017-07-03 14:29:21'),
(2, 1, '2017-07-01', '2017-07-04 08:00:00');

ALTER TABLE dbo.ProductModel SET (SYSTEM_VERSIONING = ON (HISTORY_TABLE = dbo.ProductModelHistory));

INSERT INTO dbo.ProductModel
(
	ProductID,
	ModelNumber
)
VALUES
(1, 4),
(2, 2),
(3, 1);

--Insert predictions, some for the historical data and some for now.
INSERT INTO dbo.QuantitySoldPrediction
(
	ProductID,
	DatePredictionMade,
	PredictedQuantitySold
)
VALUES
(1, '2017-06-02', 8),
(1, '2017-06-15', 11),
(1, '2017-06-19', 9),
(1, '2017-06-29', 21),
(1, '2017-07-02', 23),
(2, '2017-07-02', 2),
(2, '2017-07-03', 1),
(1, GETUTCDATE(), 44),
(2, GETUTCDATE(), 6),
(3, GETUTCDATE(), 16);

At this point, we have the following:

  • Three products in our ProductModel table
  • Four historical ProductModel records.  Note that there is a gap between historical end dates and the start dates in the ProductModel table; that’s okay for what we’re doing.
  • Ten predictions, including seven historical predictions and three current predictions.

Act Two:  The Falling Action

Now that I have my data, I want to do something simple:  I want to show the model number associated with each prediction.  This involves looking at the ProductModel and ProductModelHistory tables based on StartDateGMT.

The documentation for temporal tables shows how you can use FOR SYSTEM_TIME AS OF to look at time frames.  For example:

SELECT
	pm.ProductID,
	pm.ModelNumber,
	pm.StartDateGMT,
	pm.EndDateGMT
FROM dbo.ProductModel FOR SYSTEM_TIME AS OF '2017-07-02 08:00:00' pm;

This brings us back the following result:

ProductModelTemporal

We can also use a variable, allowing us to look at what the table looked like at a user-defined point in time:

DECLARE
	@InterestingTime DATETIME = '2017-07-02 08:00:00';

SELECT
	pm.ProductID,
	pm.ModelNumber,
	pm.StartDateGMT,
	pm.EndDateGMT
FROM dbo.ProductModel FOR SYSTEM_TIME AS OF @InterestingTime pm;

This gives us back the same result. Thinking about this syntax, what I want to do is join to dbo.QuantitySoldPrediction table. Let’s start simple:

DECLARE
	@InterestingTime DATETIME = '2017-07-02 08:00:00';

SELECT
	pm.ProductID,
	pm.ModelNumber,
	pm.StartDateGMT,
	pm.EndDateGMT
FROM dbo.ProductModel FOR SYSTEM_TIME AS OF @InterestingTime pm
	INNER JOIN dbo.QuantitySoldPrediction qsp
		ON pm.ProductID = qsp.ProductID;

This query succeeds but returns results we don’t really want:

ProductModelTemporalSameDate

This brings back all 9 records tied to products 1 and 2 (because product 3 didn’t exist on July 2nd at 8 AM UTC). But it gives us the same start and end date, so that’s not right. What I really want to do is replace @InterestingTime with qsp‘s DatePredictionMade, so let’s try that:

ProductModelTemporalInvalid

This returns a syntax error. It would appear that at the time FOR SYSTEM_TIME is resolved, QuantitySoldPrediction does not yet exist. This stops us dead in our tracks.

Act Three: The Denouement

We don’t have an easy way of solving the problem, but we do have a complicated way. Under the covers, dbo.ProductModelHistory is just a table. That means we can still query it like any other table. What we want is the latest single product model date that fits within each quantity sold prediction’s date. To solve this, we pull one of my favorite rabbits out of the hat: the APPLY operator.

SELECT
	q.ProductID,
	q.DatePredictionMade,
	q.PredictedQuantitySold,
	pm.ModelNumber
FROM dbo.QuantitySoldPrediction q
	CROSS APPLY
	(
		SELECT TOP(1)
			pmbuild.ModelNumber
		FROM
		(
			SELECT
				pm.ModelNumber,
				pm.StartDateGMT
			FROM dbo.ProductModel pm
			WHERE
				pm.ProductID = q.ProductID

			UNION ALL

			SELECT TOP(1)
				pmh.ModelNumber,
				pmh.StartDateGMT
			FROM dbo.ProductModelHistory pmh
			WHERE
				pmh.ProductID = q.ProductID
				AND pmh.StartDateGMT <= q.DatePredictionMade
			ORDER BY
				pmh.StartDateGMT DESC
		) pmbuild
		ORDER BY
			pmbuild.StartDateGMT ASC
	) pm;

This query is quite a bit longer than I’d like, as I need to build a sub-query to union together current and historical data, and then grab the top row for each quantity sold prediction. On the plus side, the results are what I want:

ProductModelTemporalVictory

This is one of two reasons that make me think that temporal tables do not make for great slowly changing dimension implementations.  The other reason is that you can’t insert into the history table as long as you have SYSTEM_VERSIONING on.  These tables can be quite useful for storing fact table history, as the FOR SYSTEM_TIME AS OF point-in-time snapshot makes a lot more sense in that case.  And as we saw above, if you want to try hard enough when writing your queries to retrieve data, you can make them work.  In my case, I had one query like this I needed to write, so the benefits still outweighed the costs.

Advertisements

2 thoughts on “Temporal Tables As SCD2 Tables

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s