Compress Those Models

I’m currently working on a project to generate a model in SQL Server R Services.  There are some nice demos on how to store a model (like this one that Microsoft has involving ski resort room rental), but there’s one thing that most of these leave out:  scale.

When you serialize a model to the database, it’s going to take up some amount of space.  In our case, we’re using a few different types of models.  Neural nets are relatively svelte, at roughly 40K per model.  On the other end, decision trees can take 2-4 MB per model.  When you’re storing a few models in SQL Server, size isn’t generally much of an issue.  Even if you’re using temporal tables to store model history, it’s probably not a big problem.  But when you’re storing a lot of data, it’s a big problem.

How much is a lot?  Well, let’s start with this:


These are some numbers for an early beta.  We have 91,130 products modeled and a model history of 194,830 products.  Each model, on its own, averages out to 177K in our ProductModel table and 212K in ProductModelHistory.  As mentioned, different types of models take up different amounts of space (and I didn’t elaborate upon all the types of models we’re using), so we wouldn’t expect the aggregates to be exactly the same.  I included a contrast:  we’ve generated 16.4 million quantity sold predictions during this time, but the total disk space is under 1 GB and 2/3 of that is a unique nonclustered index.  The 372MB is columnstore.  I love columnstore…


So right now, we’re burning roughly 200K per model.  My stated goal is to be able to store several years worth of data for 10 million products.  Let’s say that I need 10 million products in ProductModel and 1 billion rows in ProductModelHistory.  That means that we’d end up with 1.86 TB of data in the ProductModel table and 186 TB in ProductModelHistory.  This seems…excessive.

As a result, I decided to try using the COMPRESS() function in SQL Server 2016.  The COMPRESS function simply uses GZip compression.  Yeah, there are compression algorithms which tend to be more compact (e.g., bz2 or 7z), but GZip is relatively CPU efficient and I can wrap my SQL statements with COMPRESS() and DECOMPRESS() and not have to change any calling code; I just need to update the two stored procedures I use to insert and then retrieve product models.

With compression in place, my data on disk looks much less unreasonable:


That’s a drop from 200K per model down to about 45K per model, or an overall 4.7x compression ratio.  I noticed that the type of model was instrumental in the amount of compression.  Our neural nets compressed roughly 2x, going from approximately 40K per model down to 20K per model.  Meanwhile, those decision trees went from 2-4MB down to about 200K per model.

At 45K per model, storing 10 million models is about 429 GB, and 1 billion rows is just shy of 43 TB.  For my needs, that saves approximately 140 TB of disk space.  Your mileage will probably vary, but if you’re generating a large number of models, think about compression.

Feasel’s Law

I’ve had this idea on my mind for a while and figured I’d might as well write it down…

Feasel’s Law – Any sufficiently advanced data retrieval process will eventually have a SQL interface.

This is a bit tongue-in-cheek, but it plays out pretty well; typically, a major sign of database maturity is its reimplementation (at least in part) of SQL.  Examples:

  • Hadoop started out as writing map-reduce jobs in Java.  Yahoo then created Pig to work with ETL processes.  After that, Facebook created Hive, an implementation of SQL for Hadoop.
  • Spark started out using Java and Scala, and then Python.  Not too long after, we had Shark and then Spark SQL.
  • HBase is a NoSQL database on top of Hadoop.  Now we have Phoenix, which is a SQL interface to HBase.
  • The primary methods of operation with Cassandra are SQL-like statements.
  • Riak TS has a SQL interface.  Incidentally, I love the author’s comment that SQL isn’t dead yet…
  • The easiest way to access CosmosDB data?  SQL.  That interface isn’t fully baked yet—it doesn’t allow grouping, for example—but it’s getting there.

One of the few strong counter-examples is MongoDB, which doesn’t have a SQL interface but does have a SQL translation guide.  DynamoDB also does not offer a SQL interface, though there are third-party interfaces and management tools which give you the same effect.

Otherwise, if you’re using a database that was created after 1990, has any significant user base, and is a mature platform, chances are it’s got a native SQL interface.

Marketplaces In Computing, Part 2

Last time around, I looked at the tragedy of the commons as it relates to server resources.  Today, I want to introduce two patterns for optimal resource utilization given different sets of actors with independent goals.

The Microservice Solution


The first use pattern is the easier one:  remove the contention by having independent teams own their own hardware.  Each team gets its own servers, its own databases, and its own disk and network capacity.  Teams develop on their own servers and communicate via proscribed APIs.  Teams have budgets to purchase hardware and have a couple of options:  they can rent servers from a cloud provider like AWS or Azure, they could rent hardware from an internal operations team, or they could purchase and operate their own hardware.  Most likely, organizations won’t like option #3, so teams are probably looking at some mixture of options 1 and 2.


If a team depends upon another team’s work—for example, suppose that we have a core service like inventory lookup, and other teams call the inventory lookup service—then the team owning the service has incentives to prevent system over-subscription.  That team will be much more likely to throttle API calls, improve call efficiency by optimizing queries or introducing better caching, or find some other way to make best use of their limited resources.

This situation removes the commons by isolating teams onto their own resources and making the teams responsible for these resources.


There are a couple downsides to this arrangement.  First, doing your work through APIs could be a major latency and resource drag.  In a not-so-extreme case, think about two tables within a common service boundary.  The ProductInventory table tells us the quantity of each product we sell.  The ProductAdvertisement table tells us which products we are currently advertising.  Suppose we want to do a simple join of Product to ProductAdvertisement to ProductInventory and get the products where we have fewer than 5 in stock and are paying at least $1 per click on our digital marketing campaign.

In a standard SQL query running against a modern relational database, the query optimizer use statistics to figure out which table to start from:  ProductAdvertisement or ProductInventory.  If we have relatively few products with fewer than 5 in stock, we drive from ProductInventory.  If we have relatively few advertisements with a $1 CPC, we drive from ProductAdvertisement.  If neither of those is sensitive on its own but the inner join of those two is tiny, the optimizer might start with ProductInventory and join to ProductAdvertisement first, taking the relatively small number of results and joining to Product to flesh out those product details.  Regardless of which is the case, the important thing here is that the database engine has the opportunity to optimize this query because the database engine has direct access to all three tables.

If we put up an API guarding ProductInventory, we can now run into an issue where our advertising team needs to drive from the ProductAdvertisement table.  Suppose that there are tens of thousands of products with at least $1 CPC but only a few with fewer than 5 in stock.  We don’t know which products have fewer than 5 in stock and unless the inventory API has a “get all products with fewer than X in stock” method, we’re going to need to get the full list of products with ads with $1+ CPC and work our way through the inventory API until we find those products with fewer than 5 in stock.  So instead of a single, reasonably efficient database query, we could end up with upwards of tens of thousands of database lookups.  Each lookup on its own is efficient but the sum total is rather inefficient in comparison, both in terms of the amount of time spent pushing data for hundreds of thousands of products across the network and then filtering in code (or another database) and in terms of total resource usage.

The other problem, aside from resource latency, is that the throttling team does not always have the best signals available to determine what to throttle and why.  If our inventory team sees the advertising team slamming the inventory API, the inventory team may institute code which starts returning 429 Too Many Requests response codes, forcing the advertising team to slow down their calls.  This fixes the over-subscription problem but this might be a chainsaw solution to a scalpel problem.  In other words, suppose the advertisements team has two different operations:  a low-value operation with a number of requests, and a high-value operation with a number of requests.  The inventory team doesn’t necessarily know which operation is which, so without coordination between the teams, the inventory team might accidentally block high-value operations while letting low-value operations through.  Or they may cooperate and block low-value operations, but do so much blocking that they starve the low-value operation instead of simply throttling it back.  Neither of those answers is great.

The Price Signal

Instead of having different teams own their own hardware and try to live in silos, my preferred solution is to institute price signals.  What follows is an ideal (some might say overkill) implementation.

In this setup, Operations owns the servers.  Like the farmer in my previous example, ops teams want their property to remain as useful as possible.  Operations wants to make their servers enticing and prohibit over-subscription.  To do this, they price resource utilization.  On a server with a very low load, teams can use the server for pennies per hour; when the server is at 100% CPU utilization, it might spike to $20 or $30 per hour to use that server.  There are three important factors here:

  1. All teams have real-time (or very close to real-time) knowledge of the spot price of each server.
  2. Operations may set the price as they see fit.  That might lead to out-of-equilibrium prices, but there’s a factor that counteracts that quite well:
  3. The prices are actual dollar prices.

Operations is no longer a cost center within the company; it now has the opportunity to charge other teams for resource utilization.  If Operations does a good job keeping servers running efficiently and prices their products well, they end up earning the money to expand and improve; if they struggle, they shape up or ship out.  That’s because teams have alternatives.

Viable Alternatives

Suppose the Operations team can’t manage a server to save their lives.  Product teams are free to go back to options #1 or #3:  they can use cloud offerings and set up their own services there, or they can purchase their own hardware and build internal operations teams within the product team.  These real forms of competition force the Operations team to perform at least well enough to keep their customers.  If I’m going to pay more for servers from Operations than I am from AWS, I had better get something in return.  Sure, lock-in value is real and will play a role in keeping me on the Operations servers, but ops needs to provide additional value:  lower-latency connections, the ability to perform queries without going through APIs (one of the big downsides to the microservices issue above), people on staff when things go bump in the night, etc.

These viable alternatives will keep the prices that Operations charge fairly reasonable; if they’re charging $3 per gigabyte of storage per month, I’ll laugh at them and store in S3 or Azure Blob Storage for a hundredth of the price.  If they offer 5 cents per gigabyte per month on local flash storage arrays, I’ll play ball.

Taking Pricing Into Account

FA Hayek explained the importance of price as a signal in his paper, The Use of Knowledge in Society.  Price is the mechanism people use to share disparate, situational, and sometimes contradictory or at least paradoxical information that cannot otherwise be aggregated or collected in a reasonable time or in a reasonable fashion.  Tying this work to our situation, we want to explain resource utilization to disparate teams at various points in time.  We can return a bunch of numbers back and hope for the best, but if I tell the inventory team that they’re using 48% of a database server’s CPU resources and that the current CPU utilization is 89%, what does that mean?  Does that mean they can increase their load?  That they should decrease their load?  That things are doing just fine?

By contrast, we tell the inventory team that right now, the spot price of this server is $5 per CPU hour, when the normal price is 20 cents per CPU hour.  This is a signal that the server is under heavy load and maybe I should cut back on those giant warehouse-style queries burning up 16 cores.

When teams know that the price has jumped like this, they now have a number of options available:

  1. Prioritize resource utilization.  Are there any low-priority operations going on right now?  Maybe it’d be wise to reschedule those for later, when the server is cheap again.
  2. Refocus efforts elsewhere.  If there’s a server which regularly gets too expensive, maybe it’d be wise to relocate to someplace else, where the price is cheaper.  This can spread the load among servers and make resource utilization more efficient.  As a reminder, that server doesn’t need to be on-prem where Operations owns it; it’s a pretty big world out there with plenty of groups willing to rent some space.
  3. Tune expensive operations.  When dollars and cents come into play, it’s easy to go to a product owner with an ROI.  If my advertising team just got hit with a $5000 bill for a month’s worth of processing on this server and I know I can cut resource usage down to a tenth, I’m saving $4500 per month.  If my next-best alternative does not bring my team at least that much (plus the incremental cost of resource utilization) in revenue, it’s a lot harder for product owners to keep engineering teams from doing tech debt cleanup and resource optimization work.
  4. Burn the money.  Sometimes, a team just has to take it on the chin; all of this work is important and the team needs to get that work done.

Getting There From Here

Okay, so now that I’ve spent some time going over what the end game looks like, how do we possibly get there from here?  I’ll assume that “here” is like most companies I’ve worked at:  there’s a fairly limited understanding of what’s causing server heartache and a limited amount of responsibility that product teams take.

Here are the steps as I see them:

  1. Implement resource tracking.  Start with resource tracking as a whole if you don’t already have it.  Cover per-minute (or per some other time period) measures of CPU load, memory utilization, disk queue length, network bandwidth, and disk utilization.  Once those are in place, start getting resource usage by team.  In SQL Server, that might mean tagging by application name.
  2. Figure out pricing.  Without solid knowledge of exactly where to begin, there are still two pieces of interesting information:  what other suppliers are charging and break-even cost for staff + servers + whatnot.  Unless you’ve got junk servers and terrible ops staff, you should be able to charge at least a little bit more than AWS/Azure/Google/whatever server prices.  And if your ops team is any good, you can charge a good bit more because you’re doing things like waking up at night when the server falls over.
  3. Figure out budgeting.  This is something that has to come from the top, and it has to make sense.  Your higher-value teams probably will get bigger budgets.  You may not know at the start what the budgets “should” be for teams, but at least you can start with current resource utilization shares.
  4. Figure out the market.  You’ll need an API to show current server price.  Teams can call the API and figure out the current rate.  Ideally, you’re also tracking per-team utilization and pricing like Azure or AWS does to limit surprise.

Once this is in place, it gives teams a way of throttling their own utilization.  There’s still a chance for over-subscription, though, so let’s talk about one more strategy:  auctions.


Thus far, we’ve talked about this option as a specific decision that teams make.  But when it comes to automated processes, companies like Google have proven that auctions work best.  In the Google advertisement example, there is a limited resource—the number of advertisement slots on a particular search’s results—and different people compete for those slots.  They compete by setting a maximum cost per click, and Google takes that (plus some other factors) and builds a ranking, giving the highest score the best slot, followed by the next-highest score, etc. until all the slots are filled.

So let’s apply that to our circumstance here.  Instead of simply having teams work through their resource capacity issues—a long-term solution but one which requires human intervention—we could auction off resource utilization.  Suppose the current spot price for a server is 5 cents per CPU hour because there’s little load.  Each team has an auction price for each process—maybe we’re willing to pay $10 per CPU hour for the most critical requests, followed by $1 per hour for our mid-importance requests, followed by 50 cents per hour for our least-important requests.  Other teams have their requests, too, and somebody comes in and floods the server with requests.  As resource utilization spikes, the cost of the server jumps up to 75 cents per CPU hour, and our 50-cent job stops automatically.  It jumps again to $4 per CPU hour and our $1 job shuts off.

That other team is running their stuff for a really long time, long enough that it’s important to run the mid-importance request.  Our team’s internal API knows this and therefore automatically sets the bid rate up to $5 temporarily, setting it back down to $1 once we’ve done enough work to satisfy our team’s processing requirements.

Implementing this strategy requires a bit more sophistication, as well as an understanding on the part of the product teams of what happens when the spot price goes above the auction price—that jobs can stop, and it’s up to product teams to spin them down nicely.

Another Spin:  Funbucks

Okay, so most companies don’t like the idea of giving product teams cash and having them transfer real money to an Operations team.  So instead of doing this, you can still have a market system.  It isn’t quite as powerful because there are fewer options available—you might not be able to threaten abandoning the current set of servers for AWS—but it can still work.  Each team still has a budget, but the budget is in an internal scrip.  If you run out of that internal scrip, it’s up to higher-ups to step in.  This makes it a weaker solution, but still workable.

So, You’re Not Serious…Right?

Of course I am, doubting title.  I’m so serious that I’ll even point out cases where what I’m talking about is already in place!

First of all, AWS offers spot pricing on EC2 instances.  These prices tend to be less than the on-demand price and can be a great deal for companies which can afford to run processes at off hours.  You can write code to check the spot price and, if the spot price is low enough, snag an instance and do some work.

As a great example of this, Databricks offers their Community Edition free of charge and uses AWS spot instances to host these.  That keeps prices down for Databricks because they have a hard cap on how high they’re willing to go—I’ve had cases where I’ve tried to spin up a CE cluster and gotten a failure indicating that the spot price was too high and that I should try again later.

For the graybeards in the audience, you’ll also appreciate this next example:  mainframe time slicing.  This was a common strategy for pricing computer utilization and is very similar to what I’m describing.


We’ve spent the past couple of days looking at how development teams can end up in a tragedy of the commons, and different techniques we can use to extricate ourselves from it.  The main purpose of these posts is to show that there are several options available, including creating markets internally.  We still haven’t talked about agorics yet, but let’s save that for another day.

Marketplaces In Computing, Part 1

I tend to see resource over-subscription problems frequently at work.  We have a set of product teams, and each team has a manager, a product owner, and a set of employees.  These teams share computing resources, though:  they use the same servers, access the same databases, and use the same networks.  This leads to a tragedy of the commons scenario.

Tragedy of the Commons

The tragedy of the commons is a classic concept in economics:  the basic idea is that there exists some common area accessible to all and owned (directly) by none.  Each member of a group is free to make use of this common area.  Let’s get a little more specific and talk about a common grazing field that a group of shepherd share.  Each shepherd has a total desired utilization of the field:  shepherd A has x sheep which will consume f(x) grass, where 0 <= f(x) <= 1.  Shepherd B has y sheep consuming f(y) grass, and shepherd C has z sheep, consuming f(z) grass.

If f(x) + f(y) + f(z) < 1, then the three shepherds can share the commons.  But let’s say that f(x) = 0.5, f(y) = 0.4, and f(z) = 0.3.  That adds up to 1.2, but we only have enough grass for 1.0.  This means that at least one of the three shepherds will end up losing out on this arrangement—if shepherds A and B get there early, their sheep are going to consume 90% of the available vegetation, leaving shepherd C with 10% instead of his needed 30%.

The tragedy of the commons goes one step further:  because there is overgrazing on this land, eventually the vegetation dies out and the land goes fallow for some time, leaving the shepherds without that place to graze.

There are different spins on the tragedy of the commons, but this is the scenario I want to walk through today.

Solutions to the Tragedy

There are three common solutions to a tragedy of the commons scenario:  private ownership, a governing body, or commons rules.

Private Ownership

This is the easiest solution to understand.  Instead of the three shepherds grazing freely on the commons, suppose that farmer D purchases the land with the intent of subletting it for grazing.  Farmer D will then charge the shepherds some amount to graze on the land, with the goal being to earn as much as he can from use of the land.  Right now, the shepherds are oversubscribed to the land at a cost of 0.  Once the farmer starts charging the shepherds, then we see movement.

As a simple example (without loss of generality), let’s say that the farmer knows that he can’t allow more than 80% of the land to be grazed; if he goes above that mark, we get into overgrazing territory and run the risk of the land going fallow for some time.  Thus, the farmer wants to set prices such that no more than 80% of the land gets grazed and the farmer maximizes his profits.

We won’t set up the equilibrium equation here, but if you really want to go down that route, you’re welcome to.  Let’s say that the equilibrium price is $10 per acre, and at $10 per acre, shepherd B realizes that he can go graze somewhere else for less.  That leaves shepherds A and C, whose total use adds up to 80%.  As an aside, we could just as easily have imagined a scenario where all three shepherds still use the plot but at least one uses less than the f(x, $0) amount would indicate, so that the sum was still somewhere around 80% utilization.

Central Planning

In contrast to the private property solution, central planning involves an individual or committee laying down edicts on use patterns.  In this case, we have a centralplanner telling each shepherd how much of the land he is allowed to use.  We can work from the assumption that the central owner also knows that overgrazing happens at more than 80% of utilization, so the central planner won’t allow shepherds to graze beyond this.  How specifically the central planner allocates the resource is beyond the scope of discussion here, as are important issues like public choice theory.

Commons Rules

The third method of commons resource planning comes from Elinor Ostrom’s work (with this being a pretty good summary in a short blog post).  The gist of it is that local groups tend to formulate rules for behavior on the commons.  This is group decision making in an evolutionary context, in contrast to spontaneous order (the finding of equilibrium using market prices) or enforced order (the central planner making choices).

All three of these mechanisms work in different contexts at different levels.  Historically, much of the literature has argued in favor of centralization closer to the point of decision and decentralization further out—for example, look at Coase’s The Nature of the Firm.  In his work, Coase works to explain why firms exist, and his argument is that it is more efficient for firms to internalize some transaction costs.  His work also tries to explain when firms tend to be hierarchical in nature (closer to central planning) versus market-oriented or commons-oriented—though a careful reading of Coase shows that even within a firm, spontaneous orders can emerge and can be efficient…but now I’m going down a completely different track.

Let’s Get Back To Computers

Okay, so we have three working theories of how to manage resources in a commons scenario.  Let’s shift our commons scenario around just a little bit:  now our common resource is a database server.  We have a certain number of CPU cores, a certain amount of RAM, a certain amount of disk, a certain disk throughput, and a certain network bandwidth.  Instead of shepherds grazing, we have product teams using the database.

Each product team makes use of the database server differently, where product team A makes x calls and uses f(x) resources, product team B makes y calls and uses f(y) resources, and product team C makes z calls and uses f(z) resources, again, where 0 <= f(x) / f(y) / f(z) <= 1.  We hope that f(x) + f(y) + f(z) < 1; if that’s the case, we don’t have a resource allocation problem:  nobody has saturated the database server and all is well.  But realistically, most of us end up with resource contention, where every team pushes more and more product out and eventually team desires overwhelm server resources.  In this case, the equivalent of overgrazing is resource saturation, which leads to performance degradation across the board:  when CPU is at 100% or disk queue keeps growing and growing, everybody suffers, not just the marginal over-user.

As a quick note, “1” in this case is a composite measure of multiple resources:  CPU, memory, disk capacity, disk throughput, and network throughput.  More realistically, f(x) describes a vector, each element of which corresponds to a specific resource and each element lies between 0 and 1.  For the sake of simplicity, I’m just going to refer to this as one number, but it’s certainly possible for a team to kill a server’s CPU while hardly touching disk (ask me how I know).

Thinking About Implications

In my experience, there are two factors which have led to servers being increasingly more like a commons:  Operations’s loss of power and Agile product teams.

Gone are the days when an Operations team had a serious chance of spiking a project due to limited resources.  Instead, they’re expected to make do with what they have and find ways to say yes.  Maybe it’s always been like this—any curmudgeonly sysadmins want to chime in?—but it definitely seem to be the case now.

On the other side, the rise of Agile product teams has turned server resources into a commons problem for two reasons.  First, far too many people interpret “Agile” as “I don’t need to plan ahead.”  So they don’t plan ahead, instead just assuming that resource planning is some combination of premature optimization and something to worry about when we know we have a problem.

Second, Agile teams tend to have their own management structure:  teams run on their own with their own product owners, their own tech leads, and their own hierarchies and practices.  Those tech leads report to managers in IT/Engineering, sure, but they tend to be fairly independent.  There often isn’t a layer above the Agile team circumscribing their resource utilization, and the incentives are stacked for teams to behave with the servers like our example shepherds did with the common grazing field:  they focus on their own needs, often to the exclusion of the other shepherds and often to everybody’s detriment.

Thinking About Solutions

Given the server resource problem above, how would each of the three solutions we’ve covered apply to this problem?  Let’s look at them one by one, but in a slightly different order.

Central Planning

This is the easiest answer to understand.  Here, you have an architect or top-level manager or somebody with authority to determine who gets what share of the resources.  When there’s resource contention, the central planner says what’s less important and what needs to run.  The central planner often has the authority to tell product teams what to do, including having different teams spend sprint time to make their applications more efficient.  This is the first solution people tend to think about, and for good reason:  it’s what we’re used to, and it works reasonably well in many cases.

The problem with this solution (again, excluding public choice issues, incomplete information issues, and a litany of other, similar issues central planning has) is that typically, it’s management by crisis.  Lower-level teams act until they end up causing an issue, at which point the central planner steps in to determine who has a better claim on the resources.  Sometimes these people proactively monitor and allocate resources, but typically, they have other, more important things to do and if the server’s not on fire, well, something else is, so this can wait.

Commons Rules

In the “rules of the commons” solution, the different product teams work together with Operations to come up with rules of the game.  The different teams can figure out what to do when there is resource over-subscription, and there are built-in mechanisms to reduce over-utilization.  For example, Operations may put different teams into different resource groups, throttling the biggest users during a high-utilization moment.  Operations can also publish resource utilization by team, showing which teams are causing problems.

Let’s give two concrete examples of this from work.  First, each team a separate application name that they use in SQL Server connection strings.  That way, our Database Operations group can use Resource Governor to throttle CPU, memory, and I/O utilization by team.  This reduces some of the contention in some circumstances, but it doesn’t solve all of the issues:  assuming that there is no cheating on the parts of teams, there are still parts of the application we don’t want to throttle—particularly, those parts of the application which drive the user interface.  It’s okay to throttle back-end batch operations, but if we’re slowing down the user experience, we are in for trouble.

Another example of how these rules can work is in Splunk logging.  We have a limited amount of logging we can do to Splunk per day, based on our license.  Every day, there is a report which shows how much of our Splunk logging we’ve used up in the prior day, broken down by team.  That way, if we go over the limit, the teams with the biggest logs know who they are and can figure out what’s going on.

The downside to commons rules is similar to that of the central planner:  teams don’t think about resource utilization until it’s too late and we’re struggling under load.  Sure, Operations can name and shame, but the incentives are stacked in such a way that nobody cares until there’s a crisis.  If you’re well under your logging limit, why work on minimizing logging when there are so many other things on the backlog?

Next Time:  The Price Signal

Despite the downsides I listed, the two systems above are not failures when it comes to IT/Engineering organizations.  They are both viable options, and I’m not saying to ditch what you currently do.  In the next post, I will cover the price option and spend an inordinate number of words writing about marketplaces in computing.

Temporal Tables As SCD2 Tables

In this post, I am going to look at using SQL Server 2016 temporal tables as slowly changing dimension tables.  We will walk through the setup and then look at what I consider to be the biggest disappointment with temporal tables; I will then show a workaround for this issue.

Act 1:  The Setup

Temporal tables were introduced in SQL Server 2016.  They take a lot of the busy work out of creating history tables, handling things like data movement and building interfaces to query historical data for you.  This makes them very interesting from a T-SQL developer’s perspective, as it potentially allows us to keep our primary table smaller while still retaining historical data for user query.

In my example, I am going to build a table to predict quantity sold for a series of products.  In the real version of these tables, I have some fancy R code generating product models, and I store them in a table called ProductModel.  I want to have the latest version of the model readily available to me, but I would also like to store historical models for a product; that way, if I see a persistent problem with the model, I can load an older version of the model from SQL Server back into R and analyze the model further.

For this post, I’m going to do something a bit simpler; I’ll create a ProductModel table with a product ID and a model number.  This model number represents my R model and allows me to simplify the problem considerably by skipping the whole model creation part.

Here’s my ProductModel table:

CREATE TABLE [dbo].[ProductModel]
	ModelNumber INT NOT NULL,

Note that I am creating a history table called dbo.ProductModelHistory, and the valid date range for each model will be from StartDateGMT until EndDateGMT.

Next up, I want to create a table to store my predictions.  In this table, I store the product ID, the date when I made the prediction, and how much I predicted we would sell of the product.  Note that this table does not have a model number:  the people who care about the predictions don’t have any idea what an R model is and don’t care how we store that model.

CREATE TABLE [dbo].[QuantitySoldPrediction]
	DatePredictionMade DATE NOT NULL,
	PredictedQuantitySold INT NOT NULL,

Now I want to populate these tables. Because I want to load historical data, I need to set system versioning off temporarily.

--Insert some historical data

INSERT INTO dbo.ProductModelHistory
(1, 1, '2017-06-01', '2017-06-14'),
(1, 2, '2017-06-14', '2017-06-28'),
(1, 3, '2017-06-28', '2017-07-03 14:29:21'),
(2, 1, '2017-07-01', '2017-07-04 08:00:00');

ALTER TABLE dbo.ProductModel SET (SYSTEM_VERSIONING = ON (HISTORY_TABLE = dbo.ProductModelHistory));

INSERT INTO dbo.ProductModel
(1, 4),
(2, 2),
(3, 1);

--Insert predictions, some for the historical data and some for now.
INSERT INTO dbo.QuantitySoldPrediction
(1, '2017-06-02', 8),
(1, '2017-06-15', 11),
(1, '2017-06-19', 9),
(1, '2017-06-29', 21),
(1, '2017-07-02', 23),
(2, '2017-07-02', 2),
(2, '2017-07-03', 1),
(1, GETUTCDATE(), 44),
(2, GETUTCDATE(), 6),
(3, GETUTCDATE(), 16);

At this point, we have the following:

  • Three products in our ProductModel table
  • Four historical ProductModel records.  Note that there is a gap between historical end dates and the start dates in the ProductModel table; that’s okay for what we’re doing.
  • Ten predictions, including seven historical predictions and three current predictions.

Act Two:  The Falling Action

Now that I have my data, I want to do something simple:  I want to show the model number associated with each prediction.  This involves looking at the ProductModel and ProductModelHistory tables based on StartDateGMT.

The documentation for temporal tables shows how you can use FOR SYSTEM_TIME AS OF to look at time frames.  For example:

FROM dbo.ProductModel FOR SYSTEM_TIME AS OF '2017-07-02 08:00:00' pm;

This brings us back the following result:


We can also use a variable, allowing us to look at what the table looked like at a user-defined point in time:

	@InterestingTime DATETIME = '2017-07-02 08:00:00';

FROM dbo.ProductModel FOR SYSTEM_TIME AS OF @InterestingTime pm;

This gives us back the same result. Thinking about this syntax, what I want to do is join to dbo.QuantitySoldPrediction table. Let’s start simple:

	@InterestingTime DATETIME = '2017-07-02 08:00:00';

FROM dbo.ProductModel FOR SYSTEM_TIME AS OF @InterestingTime pm
	INNER JOIN dbo.QuantitySoldPrediction qsp
		ON pm.ProductID = qsp.ProductID;

This query succeeds but returns results we don’t really want:


This brings back all 9 records tied to products 1 and 2 (because product 3 didn’t exist on July 2nd at 8 AM UTC). But it gives us the same start and end date, so that’s not right. What I really want to do is replace @InterestingTime with qsp‘s DatePredictionMade, so let’s try that:


This returns a syntax error. It would appear that at the time FOR SYSTEM_TIME is resolved, QuantitySoldPrediction does not yet exist. This stops us dead in our tracks.

Act Three: The Denouement

We don’t have an easy way of solving the problem, but we do have a complicated way. Under the covers, dbo.ProductModelHistory is just a table. That means we can still query it like any other table. What we want is the latest single product model date that fits within each quantity sold prediction’s date. To solve this, we pull one of my favorite rabbits out of the hat: the APPLY operator.

FROM dbo.QuantitySoldPrediction q
			FROM dbo.ProductModel pm
				pm.ProductID = q.ProductID


			FROM dbo.ProductModelHistory pmh
				pmh.ProductID = q.ProductID
				AND pmh.StartDateGMT <= q.DatePredictionMade
				pmh.StartDateGMT DESC
		) pmbuild
			pmbuild.StartDateGMT ASC
	) pm;

This query is quite a bit longer than I’d like, as I need to build a sub-query to union together current and historical data, and then grab the top row for each quantity sold prediction. On the plus side, the results are what I want:


This is one of two reasons that make me think that temporal tables do not make for great slowly changing dimension implementations.  The other reason is that you can’t insert into the history table as long as you have SYSTEM_VERSIONING on.  These tables can be quite useful for storing fact table history, as the FOR SYSTEM_TIME AS OF point-in-time snapshot makes a lot more sense in that case.  And as we saw above, if you want to try hard enough when writing your queries to retrieve data, you can make them work.  In my case, I had one query like this I needed to write, so the benefits still outweighed the costs.