I’m currently working on a project to generate a model in SQL Server R Services. There are some nice demos on how to store a model (like this one that Microsoft has involving ski resort room rental), but there’s one thing that most of these leave out: scale.
When you serialize a model to the database, it’s going to take up some amount of space. In our case, we’re using a few different types of models. Neural nets are relatively svelte, at roughly 40K per model. On the other end, decision trees can take 2-4 MB per model. When you’re storing a few models in SQL Server, size isn’t generally much of an issue. Even if you’re using temporal tables to store model history, it’s probably not a big problem. But when you’re storing a lot of data, it’s a big problem.
How much is a lot? Well, let’s start with this:
These are some numbers for an early beta. We have 91,130 products modeled and a model history of 194,830 products. Each model, on its own, averages out to 177K in our ProductModel table and 212K in ProductModelHistory. As mentioned, different types of models take up different amounts of space (and I didn’t elaborate upon all the types of models we’re using), so we wouldn’t expect the aggregates to be exactly the same. I included a contrast: we’ve generated 16.4 million quantity sold predictions during this time, but the total disk space is under 1 GB and 2/3 of that is a unique nonclustered index. The 372MB is columnstore. I love columnstore…
So right now, we’re burning roughly 200K per model. My stated goal is to be able to store several years worth of data for 10 million products. Let’s say that I need 10 million products in ProductModel and 1 billion rows in ProductModelHistory. That means that we’d end up with 1.86 TB of data in the ProductModel table and 186 TB in ProductModelHistory. This seems…excessive.
As a result, I decided to try using the COMPRESS() function in SQL Server 2016. The COMPRESS function simply uses GZip compression. Yeah, there are compression algorithms which tend to be more compact (e.g., bz2 or 7z), but GZip is relatively CPU efficient and I can wrap my SQL statements with COMPRESS() and DECOMPRESS() and not have to change any calling code; I just need to update the two stored procedures I use to insert and then retrieve product models.
With compression in place, my data on disk looks much less unreasonable:
That’s a drop from 200K per model down to about 45K per model, or an overall 4.7x compression ratio. I noticed that the type of model was instrumental in the amount of compression. Our neural nets compressed roughly 2x, going from approximately 40K per model down to 20K per model. Meanwhile, those decision trees went from 2-4MB down to about 200K per model.
At 45K per model, storing 10 million models is about 429 GB, and 1 billion rows is just shy of 43 TB. For my needs, that saves approximately 140 TB of disk space. Your mileage will probably vary, but if you’re generating a large number of models, think about compression.