Not too long ago, I ended up taking the DP-203 certification exam for sundry reasons. On that exam, they ask a lot about Azure Synapse Analytics, including indexing, distribution, and partitioning strategies. Because these can be a bit different from on-premises SQL Server, I wanted to cover what options are available and when you might choose them. Let’s start with distributions, as that’s the biggest change in thought process.
Distributions and MPP
One of the big differences between Azure Synapse Analytics dedicated SQL pools and on-premises SQL Server instances (or Azure SQL Database or Azure SQL Managed Instance for that matter) is that dedicated SQL pools break your data up into 60 distributions, allowing you to run queries against up to 60 compute nodes. Of course, unless you’re made of money you aren’t actually going to execute against 60 separate compute nodes, but it all works the same way anyhow. By the way, a quick check for how many nodes you have is to take your DWU count and divide it by 500, with a minimum of one node. If you have DWU 100, you get part of a node, so performance will suck. For a full node, start at DWU 500 and accept nothing less. At DWU 1000, it’s two nodes. This goes on and on up to DWU 30,000, which gets you 60 compute nodes.
The next question becomes, how much data do we put on each of the servers and how do we do this? Well, there are three options. Let’s start with the easiest and move on from there.
The easiest answer is, spray all the data to all of the servers. Take that, meticulous planning! When you set
DISTRIBUTION = REPLICATE, you’re telling Azure Synapse Analytics to make a copy of your data available for each of the compute nodes you have. This means that if your table is 500 GB, you’re copying 500 GB to every compute node. This is…not efficient.
Replicated tables work really well in certain circumstances:
- Your data does not change frequently. If you have 20 nodes and replicated data is constantly changing, you’re going to have a lot of message-passing between these servers as they try to sync up. Also, updates are not synchronous, meaning that some nodes may have slightly stale data.
- Total data size is small, maxing out at about 2 GB before you start getting into “Should we really replicate this?” territory.
- Your data is dimensional in nature, not facts.
- Your data is used frequently and you want to limit data moving between nodes. Think about things like a date dimension or a customer dimension.
The hash distribution type is the only distribution which requires you include a column for distribution. For example, if you want to distribute on TransactionKey, you’d use
DISTRIBUTION = HASH(TransactionKey) in your table definition.
The idea of hashing is that you want to spread data across your compute nodes so that each node has a fairly similar amount of workload to cover. Here are some considerations for when you should use hashing:
- Hashing should be your default for fact tables and large dimensions. Think tables which are at least 2 GB in size and probably a lot more.
- Hashing should be your default choice when dealing with heavy-write tables. Hashing allows you to spread out the work between distributions (and therefore compute nodes).
And here are a few thoughts on what makes for a good hash key:
- Use a column with a large number of unique values. If you have 20 compute nodes but only 5 unique values of your hashed key, at least 15 of the nodes will be idle during queries, wasting you money in the process.
- Use a column with a fairly uniform distribution. Identity columns, GUIDs, and other unique columns work great for this because the hashing algorithm will spread them out across the compute nodes. If you pick something like Departure City as the hash key, that’s probably not good, as large cities like New York will all run from the same compute node, meaning that you’ll probably end up having some nodes doing much more work than others.
- Use a column that you frequently use in the
GROUP BYclause or as part of a join with another large table. This way, you can minimize data movement by ensuring that each distribution can join or group independently of the other distributions. This can save you a lot of time, especially when joining together really big tables.
- Use a column which does not have many NULLs. NULLs get hashed to one distribution.
- Don’t use a column which is part of your common
WHEREclauses. If you do distribute on Departure City and you filter
WHERE DepartureCity = 'NYC'then you’re limiting the effectiveness of your hash key.
- Similarly, do not use dates for the hash key! Using dates makes it much more likely that you will end up with one or two compute nodes doing all of the work and the rest of them sitting around, twiddling their thumbs.
The final distribution type is round robin, which you set with
DISTRIBUTION = ROUND_ROBIN. This is the catch-all distribution. It’s not particularly great for performance, but it’ll do and it won’t have the underlying problems of Replicate when you have lots of data.
One important thing to keep in mind with Round Robin is that, despite its name, it does not send one row at a time to each distribution, as that would be incredibly inefficient. It does, however, tend to spread out the load across the distributions, especially when inserting large batches.
Use Round Robin in the following cases:
- You are working with heap tables.
- You do not have an obvious distribution key.
- You do not need to join this table to any other tables—typically this will be something like a staging table.
- If you do join this table, you typically get columns from it but don’t usually filter based on this table. For example, you might have a fact like FactFlights and this could be a dimension like DimAmenities, which describes the amenities available on each flight. You won’t filter on amenities, so there’s no real need to try to find a hash key.
Now that we’ve covered distributions, let’s look at indexing strategy.
There are four options available to us in Azure Synapse Analytics dedicated SQL pools:
- Clustered index
- Non-clustered index
- Clustered columnstore index
Sorry, XML index fans, but no joy for you.
Indexing strategy is pretty similar between the box product and dedicated SQL pools, so I’ll blaze through this bit.
- Use heaps for staging tables, as they’ll allow for fast writes. Don’t use heaps for large tables you regularly query because performance won’t be great.
- Use clustered columnstore indexes for fact tables. Specifically, you want at least 1024 * 1024 = 1,048,576 rows in order to fill up a row group. But remember that data is broken out into 60 distributions, so you really want 1024 * 1024 * 60 = 62,914,560 rows at a minimum in your clustered columnstore indexed fact table. Fewer than 62.9 million rows and you may end up with all of your rows in the delta store, which is a glorified heap. If you have a small fact table, it may make sense just to keep it a heap, at least until it grows beyond that 60-million row mark.
- Use clustered indexes for point lookups, especially on dimensions. If you just need to get a few rows back from a table, or if you are joining to that table from some main fact table, a clustered index is a good choice here. The advice here is a bit different than SQL Server on-premises, where I’d say always have a clustered index. In the MPP warehousing world, clustered indexes can be a bad thing if you need to scan ranges of data, especially on smaller dimension tables.
- Non-clustered indexes also have their place, but in a Kimball-style warehouse, you generally don’t create too many of them, as you tend to drive from the fact tables (including most of your relevant filters) and join via surrogate keys to the dimension tables.
Last up is partitioning.
Here is a quick reminder that partitioning is not intended to be a performance improvement technique. It’s intended to make data loading and off-loading more efficient. It can make queries perform more quickly, but that’s not a guarantee.
Another thing to remember about partitioning is that if you’re partitioning a fact table with a clustered columnstore index, you’ll need to multiply by the number of partitions to hit the “makes sense” break-even point. In other words, you want a minimum of 62.9 million rows per partition. If you partition data monthly, that means you should anticipate at least 754,974,720 rows per year in your fact table. If you aren’t pushing that many rows per year, you might want to drop the partition ranges down to quarterly (188,743,680 rows per quarter) or annually (62.9 million rows per year). To give you a really concrete example, let’s say we store 5 years of data in a fact table and we want to partition the data monthly for a roll-off process. Based on this, we should have a minimum of 5 years * 12 months/year * 60 distributions * (1024 * 1024 rows per rowgroup) = 3,774,873,600 rows before the clustered columnstore index makes sense. By contrast, with on-premises SQL Server, your break-even point is 1/60th of that, or 62,914,560 rows. “Massive” is part of the title, after all.
When choosing a partitioning column, we’re typically going to choose something like a date, to make roll-off easier.
One last thing is that partitioning is easier in a dedicated SQL pool than on-premises, as you don’t have to create a partition schema, and your partition function is inlined. You’d write something like
PARTITION([Column] RANGE RIGHT FOR VALUES (v1,v2,v3,...,vn)). When choosing
RANGE RIGHT or
RANGE LEFT, standard practice applies as with SQL Server:
RANGE RIGHT tends to make more sense for dates, as the values represent everything greater than or equal to the last partition value and less than the current partition value, so if you partition monthly, a range right on
(20210101, 20210201, 20210301) means that you’ll have four partitions: everything before 2021-01-01, 2021-01-01 up to but not including 2021-02-01, 2021-02-01 up to but not including 2021-03-01, and everything after 2021-03-01.
Designing tables in Azure Synapse Analytics dedicated SQL pools is pretty similar to on-premises SQL Server, but as you can see, there are a few differences which matter. Distributions are unique to dedicated SQL pools (and the APS edition of SQL Server, but c’mon, how many of us could afford that thing?), but indexing and partitioning are both available on-premises. If you want to learn more about these strategies, the Azure Synapse Analytics product documentation gives you a lot of good information and clear guidance on what to do and why.