T-SQL Anti-Patterns: Clustered GUIDs

My final T-SQL anti-pattern that I’m going to discuss is using GUIDs as a clustered index.  This is a particularly pernicious problem because it’s so easy for developers to fall into it.

A GUID, or globally unique identifier, is a 16-byte field which is (supposed to be) truly unique.  You can run into duplicates when dealing with large enough data sets on enough servers, but let’s not worry about that here.  Instead, I’m going to focus on the performance implications of this.  Let’s say that we have a web application which hits three tables:  Contract, Contact, and ContractContact.  We have people (Contacts), we have agreements (Contracts), and we have a listing of which people are associated with which agreements (ContractContacts).  Our web developer came up with a great page in which you can enter the contract details and tie a person to a contract on the same page at the same time.  That web developer knows about identity integer columns, but because he wants all of the logic to happen in a C# business layer but take place in a single T-SQL transaction, he can’t get the identity integer value.  As such, he grabs the Guid.NewGuid() method in .NET and passes in a GUID as a unique identifier for Contract.  Because this GUID is unique, our developer creates a primary key on the GUID.  He tests the logic out and it works fine in development.

Then he moves it out to prod and failure occurs.  Paul Randal and Kim Tripp pointed out that this could have been a plausible explanation for why the Windows 7 RC download process was so problematic.  The reason is that, if you use a GUID as a primary key, by default, SQL Server also makes that the clustered key.  And having a GUID as a clustered key is a terrible idea.  Given how unique and random GUIDs are, they’re a perfect way of maximizing page splits and I/O operations whenever you insert new rows.  This introduces additional, unnecessary table fragmentation and hurts performance of SELECT queries.

So what alternatives do we have here?  Let’s list a few:

  1. Use identity integer columns.  in our Contract example, the web developer didn’t want to do that, mostly because he didn’t want to create a stored procedure with Contract inputs followed by ContractContact_ContactID1, ContractContact_ContactID2, etc.  Avoiding this is totally reasonable, and if you’re using SQL Server 2008 or later, writing a stored procedure like this is totally unnecessary.  Instead, use a table-valued parameter to store your list of contact IDs.  The great news is that you can build them in .NET quite easily.  Then, in your creation stored procedure, you can get the contract ID’s identity value and insert ContractContact records within the same transaction, so if there’s a problem during insertion, you don’t have to worry about backing part of the data out.
  2. Use a sequence column instead of an identity integer value.  Starting in SQL Server 2012, we can use sequences rather than identity integers.  The main advantage here is that, by following a standard method, we can get guaranteed unique values before inserting rows into the Contract table.  That way, we can assign ContractID on each ContractContact object and still perform all of our .NET operations in a single T-SQL transaction.
  3. Use a non-clustered GUID as your primary key.  This is a last alternative, when the other two just won’t work.  In this case, you can create a GUID in the business layer and use it for web operations, but have an identity column as your clustered key.  There are a couple of problems with this solution as opposed to #1 and #2.  First of all, you have an additional 16 bytes per row, so with a large enough table, that adds up:  1 billion rows means 16 GB of additional storage.  In addition to that base storage addition, you’re also responsible for a non-clustered index on the GUID, so there’s another 16 GB of data.  Finally, whenever you look up an object by its GUID value, you’re going to perform a bookmark lookup, meaning that the number of reads to grab a row will double.  This will make SELECT operations slower.

Getting off of GUIDs can be an arduous process, but the advantage is that your database will now scale a lot better.