Over on the right-hand bar, I have a page listing all of the T-SQL anti-patterns in this series. There are additional anti-patterns, code smells, and things to avoid, but I think 14 is enough for one presentation. Speaking of presentations, check it out at SQL Saturday #277 in Richmond, Virginia on Saturday, March 22nd, 2014.
March 4, 2014
March 3, 2014
My final T-SQL anti-pattern that I’m going to discuss is using GUIDs as a clustered index. This is a particularly pernicious problem because it’s so easy for developers to fall into it.
A GUID, or globally unique identifier, is a 16-byte field which is (supposed to be) truly unique. You can run into duplicates when dealing with large enough data sets on enough servers, but let’s not worry about that here. Instead, I’m going to focus on the performance implications of this. Let’s say that we have a web application which hits three tables: Contract, Contact, and ContractContact. We have people (Contacts), we have agreements (Contracts), and we have a listing of which people are associated with which agreements (ContractContacts). Our web developer came up with a great page in which you can enter the contract details and tie a person to a contract on the same page at the same time. That web developer knows about identity integer columns, but because he wants all of the logic to happen in a C# business layer but take place in a single T-SQL transaction, he can’t get the identity integer value. As such, he grabs the Guid.NewGuid() method in .NET and passes in a GUID as a unique identifier for Contract. Because this GUID is unique, our developer creates a primary key on the GUID. He tests the logic out and it works fine in development.
Then he moves it out to prod and failure occurs. Paul Randal and Kim Tripp pointed out that this could have been a plausible explanation for why the Windows 7 RC download process was so problematic. The reason is that, if you use a GUID as a primary key, by default, SQL Server also makes that the clustered key. And having a GUID as a clustered key is a terrible idea. Given how unique and random GUIDs are, they’re a perfect way of maximizing page splits and I/O operations whenever you insert new rows. This introduces additional, unnecessary table fragmentation and hurts performance of SELECT queries.
So what alternatives do we have here? Let’s list a few:
- Use identity integer columns. in our Contract example, the web developer didn’t want to do that, mostly because he didn’t want to create a stored procedure with Contract inputs followed by ContractContact_ContactID1, ContractContact_ContactID2, etc. Avoiding this is totally reasonable, and if you’re using SQL Server 2008 or later, writing a stored procedure like this is totally unnecessary. Instead, use a table-valued parameter to store your list of contact IDs. The great news is that you can build them in .NET quite easily. Then, in your creation stored procedure, you can get the contract ID’s identity value and insert ContractContact records within the same transaction, so if there’s a problem during insertion, you don’t have to worry about backing part of the data out.
- Use a sequence column instead of an identity integer value. Starting in SQL Server 2012, we can use sequences rather than identity integers. The main advantage here is that, by following a standard method, we can get guaranteed unique values before inserting rows into the Contract table. That way, we can assign ContractID on each ContractContact object and still perform all of our .NET operations in a single T-SQL transaction.
- Use a non-clustered GUID as your primary key. This is a last alternative, when the other two just won’t work. In this case, you can create a GUID in the business layer and use it for web operations, but have an identity column as your clustered key. There are a couple of problems with this solution as opposed to #1 and #2. First of all, you have an additional 16 bytes per row, so with a large enough table, that adds up: 1 billion rows means 16 GB of additional storage. In addition to that base storage addition, you’re also responsible for a non-clustered index on the GUID, so there’s another 16 GB of data. Finally, whenever you look up an object by its GUID value, you’re going to perform a bookmark lookup, meaning that the number of reads to grab a row will double. This will make SELECT operations slower.
Getting off of GUIDs can be an arduous process, but the advantage is that your database will now scale a lot better.
March 2, 2014
Yesterday’s anti-pattern, non-SARGable ORs in joins, was all about writing code which the SQL Server database engine can optimize. Today’s anti-pattern is in that vein, but this time we’re looking at cross-server queries.
Linked servers are a powerful feature within SQL Server, and I don’t want to throw them out entirely. What I do want to do, however, is limit how often they are used in regular, business-critical code. The reason is that, prior to SQL Server 2012 SP1, table statistics were not available to linked server users without sysadmin, db_owner, or db_ddladmin rights. The number of cases in which I would want a linked server role with sysadmin, db_owner, or db_ddladmin rights is very small, and I would definitely not want to give those rights to a web application or anything else user-facing. Without those table statistics, SQL Server would make poor choices with linked server queries, and so joins against remote tables would get very expensive.
Even with those statistics, cross-server joins are going to be expensive. The reason is that in order to join row sets from two servers, you need to pass all of one server’s row set over to the other server, perform the join operation, and then (potentially) pass the full result set back to the first server. If you’re joining two tables with a billion rows, that could be extremely expensive, even if you have great indexes on both sides.
So how do we get around this? Here are a few potential methods:
- Get rid of the linked servers altogether and use another method, such as replication or Always-On Availability Groups, to propagate the data you need from your remote server. This makes a lot of sense if you’re looking at relatively static lookup data.
- For truly static data, you might keep non-replicated copies on both servers. Replication can be a pain to troubleshoot and manage and AGs require 2012 Enterprise Edition so they can be very expensive. Thus, if the data never changes or only changes once a year, it might make more sense to create the yearly change script and execute it on multiple servers.
- If you absolutely need the remote data, try to re-write queries to avoid joining local tables to remote tables. Create a temporary table on your local server, pull in as few rows from the remote server as possible, and then join your local tables to the temp table you created. You still take the linked server hit, but there’s a much smaller chance of that query blowing up.
March 1, 2014
Our last anti-pattern had us discussing unnecessary repetition of queries, but this time, I’m going to hit the opposite problem: trying to merge two queries which should remain separate.
Let’s start with an example of a query. This is a simplified version of a query which I ran into at a prior job:
SELECT ACCT.[Id]. ACCT.[TypeId]. ACCT.[EntityId]. ACCT.[Name]. ACCT.[Purpose], TF.Amount As TransferAmount FROM dbo.Account ACCT INNER JOIN dbo.Transfer TF ON ACCT.Id = TF.AccountToId OR ACCT.Id = TF.AccountFromId;
There’s nothing objectionable in this query, per se…at least until you try to run it. The problem here is that the join from Account to Transfer happens on one of two keys: ID to AccountToID, or ID to AccountFromID. If we have indexes on neither AccountToID nor AccountFromID, we’re going to scan the Transfer table once for each account. So, our DBA looks at the scan and says, “Hey, let’s put an index on this.” Unfortunately, the indexes do nothing:
- If you put an index on AccountToID, we still need to scan to get the rows with a matching AccountFromID.
- If you put an index on AccountFromID, we still need to scan to get rows with a matching AccountToID.
- If you put an index on both AccountToID and AccountFromID, that still doesn’t help: rows matching on AccountToID typically won’t be the same as those matching AccountFromID (you don’t transfer money from one account to the same account, after all).
But let’s say you put separate indexes on AccountToID and AccountFromID. In that case, one of the indexes might be useful, but SQL Server won’t be able to use both of them effectively in a single query.
The performance-improving correction I made was as follows:
SELECT ACCT.[Id]. ACCT.[TypeId]. ACCT.[EntityId]. ACCT.[Name]. ACCT.[Purpose], TF.Amount As TransferAmount FROM dbo.Account ACCT INNER JOIN dbo.Transfer TF ON ACCT.Id = TF.AccountFromId UNION ALL SELECT ACCT.[Id]. ACCT.[TypeId]. ACCT.[EntityId]. ACCT.[Name]. ACCT.[Purpose], TF.Amount As TransferAmount FROM dbo.Account ACCT INNER JOIN dbo.Transfer TF ON ACCT.Id = TF.AccountToId;
In our case, we had a check constraint making sure that AccountToID and AccountFromID were not the same on a single transfer, so I could use a UNION ALL. By separating this out into two queries, we can now take full advantage of the two separate indexes. This leads to two separate accesses of the Account table rather than one, but cuts down on I/O tremendously, especially as Account and Transfer get large.
So when is it OK to use OR in a join? In the work that I’ve done, I’ve seen three major cases:
- TableX x INNER JOIN TableY y ON (x.ColumnA = y.ColumnA OR x.ColumnB = y.ColumnB) — Two separate columns with two separate joins are OK.
- TableX x INNER JOIN TableY y ON (x.ColumnA = y.ColumnA OR x.ColumnA = 6) — Join on a match or when one table matches a constant. That second part could be ColumnB as well.
- TableX x INNER JOIN TableY y ON (x.ColumnA = y.ColumnA OR z.ColumnA = y.ColumnA) — Join on another table match, as long as z was brought in on a join earlier in the query. Watch out for accidental Cartesian products here.
In contrast, I would simplify our scenario as: TableX x INNER JOIN TableY y ON (x.ColumnA = y.ColumnA OR x.ColumnB = y.ColumnA). Whenever we have two separate columns matching to one column in the other table, we will want to look at an alternative form of joining these tables together. Otherwise, we run the risk of terrible performance.
February 28, 2014
Row By Agonizing Row (RBAR, pronounced “re-bar”) was our last anti-pattern. Today, I’m going to extend it to a term I’m coining, Query By Agonizing Query (QBAQ, pronounced “cue-back”). According to a quick Google search, nobody has used that term before, so that’s my claim to fame.
What I mean by QBAQ is the following anti-pattern:
UPDATE dbo.SomeTable SET ColumnA = @Value WHERE KeyColumn = @ID; UPDATE dbo.SomeTable SET ColumnB = @SomeOtherValue WHERE KeyColumn = @ID; UPDATE dbo.SomeTable SET ColumnC = ColumnA + ColumnB WHERE KeyColumn = @ID;
In this case, we see multiple queries where just one would do nicely:
UPDATE dbo.SomeTable SET ColumnA = @Value, ColumnB = @SomeOtherValue, ColumnC = ColumnA + ColumnB WHERE KeyColumn = @ID;
The top query hits SomeTable three times. Even though this is going to be three table seeks (because we’re joining on the primary key column), it’s also two more seeks than necessary. If you run this code on a regular basis, performance will be worse than it needs to be. Also, if these are running in a single transaction, you’re going to lock pages in SomeTable for a longer period of time than necessary.
Often, QBAQ gets combined with RBAR, so that the full query might be something like:
WHILE EXISTS(SELECT * FROM #RelevantValues) BEGIN SELECT TOP 1 @ID = ID FROM #RelevantValues; UPDATE dbo.SomeTable SET ColumnA = @Value WHERE KeyColumn = @ID; UPDATE dbo.SomeTable SET ColumnB = @SomeOtherValue WHERE KeyColumn = @ID; UPDATE dbo.SomeTable SET ColumnC = ColumnA + ColumnB WHERE KeyColumn = @ID; DELETE FROM #RelevantValues WHERE ID = @ID; END
Here is a piece of code which you could optimize quite nicely, removing all kinds of table hits:
UPDATE st SET ColumnA = @Value, ColumnB = @SomeOtherValue, ColumnC = ColumnA + ColumnB FROM dbo.SomeTable st INNER JOIN #RelevantValues rv ON st.KeyColumn = rv.ID;
This T-SQL-specific form of the UPDATE operator eliminates a loop, scanning (or seeking) #RelevantValues N times, deleting from #RelevantValues N times, and 2/3 of our table update hits. It’s also shorter, more concise, and easier to understand.
UPDATE statements aren’t the only time I see QBAQ, but they’re typically a good start. I also sometimes see it with UNION abuse:
SELECT A, B, C FROM dbo.SomeTable WHERE SomeValue = 1 UNION ALL SELECT A, B, C FROM dbo.SomeTable WHERE SomeValue = 2 UNION ALL SELECT A, B, C FROM dbo.SomeTable WHERE SomeValue = 3 UNION ALL SELECT A, B, C FROM dbo.SomeTable WHERE SomeOtherValue = 'A'
In this case, we have four seeks or scans against SomeTable, and we can get that down to one:
SELECT A, B, C FROM dbo.SomeTable WHERE SomeValue IN (1, 2, 3) OR SomeOtherValue = 'A';
If you’re concerned about SARGability, you can also try that as two separate queries; in any event, we cut the number of table seeks/scans at least in half, and that adds up.
Most of the time, QBAQ comes from being stuck in a procedural mindset: “I need to perform operation X, then operation Y, then operation Z.” Once you think in those terms, you’re biasing your answers toward performing discrete operations, rather than seeing if there is a good way of combining X, Y, and Z together in one query. Being able to do this is a great indicator that you truly understand and can apply set-based concepts.
February 26, 2014
The last couple of days, I poked a bit of fun at object-oriented developers who move to a relational platform and try to do things like nesting views or using functions inappropriately (although, to be fair, it’s not so much that developers are trying to use functions as that the way Microsoft implemented functions leads generally to horribly inefficient outcomes). Today’s anti-pattern will be quick but important. Today, I’m going to talk about Row-By-Agonizing-Row (RBAR) queries, something I tend to rant about (such as in my tally tables presentation).
Most students learning to program tend to get introduced first to structural or object-oriented languages like C and C#/Java, respectively. With these languages, we see certain structural patterns emerge: conditional statements, method redirection, and loops. These work wonders in C-based languages, as well as other structural/OO languages, and so it makes sense for a developer whose primary focus is C# to import that paradigm into writing SQL. It is also a devastating performance problem.
Here’s a concrete example: suppose that you have to loop through a set of users and set their IsActive flag to be false if their last recorded login is earlier than three months ago. Armed solely with structural programming tools, there’s an obvious solution here. The obvious solution is to use two separate loops:
- Loop through each user in the User table, or at least those users who are currently active.
- For each user, we create another loop which goes through the UserLogin table. We store as a local variable the latest login date for that user which we have found.
As far as the logic goes, this is sound reasoning. Unfortunately, it’s also a performance hog within SQL Server if you actually try to implement this as two separate cursors. The reason is that we’ll be scanning the User table N times (once for each user) and the UserLogin M*N times (once for each user login—for each user). We can improve this a bit by putting filters on, like getting only active users and making M*N into m*N, where m is the number of UserLogin records associated with that user. Nevertheless, you’re hitting the database N + M*N (or N + m*N) times.
Some people see cursors and say, “Hey, I’ll improve performance here: I’m going to get rid of this cursor and replace it with a WHILE loop.” Those people are wrong. At best, WHILE loops are the same thing as cursors; at worst, you might actually write a cursor which performs better than the equivalent WHILE loop. In either event, it’s not the appropriate solution. The appropriate solution is to start thinking in terms of sets.
For our previous example, we can re-write the problem to be handled in one query, whose pseudocode would look a bit like:
UPDATE dbo.User SET IsActive = 0 WHERE NOT EXISTS ( SELECT * FROM dbo.UserLogin ul WHERE ul.UserID = UserID AND ul.LoginDate >= DATEADD(MONTH, -3, CURRENT_TIMESTAMP) );
In this case, we’re hitting the database one time. With an appropriate index on dbo.UserLogin, we can make the non-existence query fast, so we’re scanning the User table and seeking against UserLogin. Performance will be tremendously better than the RBAR solution. And fortunately for us, almost every cursor-based solution can be re-written using set-based logic. There are a few maintenance functions that you’d probably want to use a cursor for (and that’s how things like sp_msforeachdb work), but those should be few and far between.
Also, it’s important to note that in the Oracle world, cursors typically perform much better than in SQL Server. This means that even a skilled Oracle developer, upon entering the world of SQL Server, could mess this up pretty badly. But if you’re going to spend a lot of time writing T-SQL, you really need to be able to think in terms of sets.
February 25, 2014
Yesterday’s anti-pattern (nested views) was part one of the “Do Repeat Yourself [Sometimes]” mini-series. Today, I’d like to get into part two of the series: user-defined functions. When Microsoft introduced user-defined functions in SQL Server 2000, it was an object-oriented developer’s dream come true. It showed that Microsoft wanted to make true encapsulation—and not just views—a reality. You would be able to define your business logic in functions (either scalar functions to perform calculations or return result sets with table-valued functions), and have everything in just one location. The idea was fantastic, but the implementation was…problematic.
The biggest problem with user-defined functions is that they simply don’t perform well. With a scalar function, you’re performing the operation for every row in a result set. If this is a quick calculation that doesn’t require anything more than a bit of CPU and memory (i.e., all of the fields are already defined as part of the result set and you’re just performing mathematical or string operations on them) and you aren’t filtering by the function, a scalar function won’t kill you. But if you need to hit a table inside your scalar function, you’re going to be accessing that table once for every row in your result set. That will kill you. Even worse is if you make the scalar function part of your WHERE clause. In that case, you need to perform the operation for every single row, regardless of whether that row makes it into your final result set. In other words, scalar user-defined functions are not SARGable.
Table-valued functions can be better (as the Dieter post above notes), but even they will come with a performance cost over inlining the function’s code in your queries. Here is an older article (from the SQL Server 2000 days) which shows what has been my typical experience: inline TVFs are much faster than scalar or multi-table functios, but tend to have higher I/O requirements than the raw queries. This makes inline TVFs potentially a reasonable solution, but they’re something you need to test in a production-sized environment before using.
So now that we’re throwing out scalar and multi-table functions, what’s left? If you need a result set back, an inline TVF can work, as we mentioned. But if you need to perform some complex calculations against a result set, I’d recommend using the CLR. The CLR is faster for a number of scenarios, such as string splitting or calculating a median. It’s still a little scary to DBAs who are insulated from the general .NET world, but it’d be a shame if people miss out on such an important tool nearly a decade later just because they don’t know its value.
February 24, 2014
Our next T-SQL anti-pattern is a performance killer. Similar to EAV, it happens when developers get a little too smart for their own good.
Most good developers know about DRY: Don’t Repeat Yourself. It’s an admirable principal and one to which I subscribe in many languages. The whole idea is, once you see yourself repeating code sections, turn that code section into its own method. Once you see yourself repeating the same patterns, look for a layer of abstraction which gets rid of the repetition and allows you to centralize all of that code in one place. That way, you don’t have to worry about finding and fixing all of the places if you ever need to change that logic, and it’s a lot easier to test one single module than dozens of modules interspersed throughout a system.
In T-SQL, modularization comes in a couple of flavors. Today’s topic is views. A view is nothing more than a stored SELECT query. In this example, I’m going to use the AdventureWorks2012 database.
First, let’s say that we have a query which we run regularly. It focuses on sales order headers and details, pulling back information that we require. This gets called from a few locations, so we turn it into a view, like this one:
CREATE VIEW vSalesOrderDetail_1 AS SELECT soh.SalesOrderID, soh.OrderDate, soh.DueDate, soh.Status, st.Name as SalesTerritory, cc.CardType, cc.CardNumber, soh.SubTotal, soh.TotalDue, sod.SalesOrderDetailID, sod.OrderQty, sod.UnitPrice, sod.LineTotal FROM Sales.SalesOrderHeader soh INNER JOIN Sales.SalesOrderDetail sod ON soh.SalesOrderID = sod.SalesOrderID INNER JOIN Sales.SalesTerritory st ON soh.TerritoryID = st.TerritoryID INNER JOIN Sales.CreditCard cc ON soh.CreditCardID = cc.CreditCardID;
The nice part here is that when we execute the view, the query plan is exactly the same as if we had run the query itself:
Once we have that view, we see how much easier life is with the code in one central location, so we start creating other views. Eventually, the object-oriented programming portion of our brain kicks in and says that, hey, we could use the views as an encapsulation layer, letting us repeat less and less code. Why remember all of these joins when we can just do it in a couple of views? Furthermore, we could have one main view and just get the parts that we want. For example, let’s imagine that we start with our base view, but we only need a couple of items: SalesOrderID, SalesOrderDetailID, and LineTotal. Here’s what our execution plan looks like afterward:
Well, that’s a little more complex than we expected. I mean, we’re really only getting data from two tables, so why are we still getting credit card info? Here’s what the raw query looks like:
That’s more like it. As far as it goes, SQL Server estimates that raw query would be about 1/5 of the cost of getting the same records back from our view.
The reason for this is that our view had several INNER JOINs, meaning that we put explicit filters on the query, and so SQL Server needed to make those joins. Let’s change all of the INNER JOINs to LEFT OUTER JOINs instead. Here’s the new query plan:
Now that’s more like it. So, problem solved: we’ll just use LEFT OUTER JOINs for everything. This individual performance gain comes at a potential overall performance cost, however: when you use INNER JOINs, you allow the SQL Server optimizer to start from any table and drive through the query as it sees fit. When you change a join to a LEFT OUTER JOIN instead of an INNER JOIN, you force the left table to be accessed first. If you have an excellent filter on the right table, it won’t be used as early as if the tables were INNER JOINed together.
Furthermore, once we get more complicated queries, the optimizer sort of gives up on us. Let’s see an example. First, I’m going to create a couple more views. The first builds off our example, adding in the salesperson’s e-mail address:
CREATE VIEW vEmail AS SELECT v.SalesOrderID, v.SalesOrderDetailID, v.LineTotal, be.BusinessEntityID, p.FirstName, p.LastName, e.EmailAddress FROM vSalesOrderDetail_1 v LEFT OUTER JOIN Sales.SalesOrderHeader soh ON v.SalesOrderID = soh.SalesOrderID LEFT OUTER JOIN Sales.SalesPerson sp ON soh.SalesPersonID = sp.BusinessEntityID LEFT OUTER JOIN Person.BusinessEntity be ON sp.BusinessEntityID = be.BusinessEntityID LEFT OUTER JOIN Person.Person p ON p.BusinessEntityID = be.BusinessEntityID LEFT OUTER JOIN Person.EmailAddress e ON e.BusinessEntityID = p.BusinessEntityID;
Then, we’ll create one more query which reads from vEmail and joins back to sales order header a third time to get one more field. Here’s what our final query looks like:
SELECT v.SalesOrderID, soh.AccountNumber, v.SalesOrderDetailID, v.LineTotal, sp.CommissionPct, v.BusinessEntityID FROM vEmail v INNER JOIN Sales.SalesPerson sp ON v.BusinessEntityID = sp.BusinessEntityID INNER JOIN Sales.SalesOrderHeader soh ON soh.SalesOrderID = v.SalesOrderID WHERE v.SalesOrderID IN (43659, 43660, 43661);
And here is our execution plan:
You’ll notice that our execution plan includes table joins like EmailAddress, even though we never specified that in our query. It also joins SalesOrderHeader three times and SalesPerson twice. The reason is that our query joining to a view inside a view was complex enough that SQL Server basically gave up on trying to come up with the optimal re-write of our query. For those interested, the optimal form of this query is:
SELECT soh.SalesOrderID, soh.AccountNumber, sod.SalesOrderDetailID, sod.LineTotal, sp.CommissionPct, sp.BusinessEntityID FROM Sales.SalesOrderHeader soh LEFT OUTER JOIN Sales.SalesOrderDetail sod ON soh.SalesOrderID = sod.SalesOrderID LEFT OUTER JOIN Sales.SalesPerson sp ON soh.SalesPersonID = sp.BusinessEntityID WHERE soh.SalesOrderID IN (43659, 43660, 43661);
When running these two queries side-by-side, the nested view form of the query had a cost estimate of 77% of the batch, whereas the optimal query was only 23%. In terms of I/O, the nested view query had a total of 199 logical reads versus 24 for the optimal form. If we run this query a huge number of itmes, that starts to add up fast.
The general rule to follow is that the optimizer is smart, but it can get tricked pretty easily. In this case, it was pretty easy to tell what a superior query would look like, but the optimizer didn’t have enough time to whittle down the sets of tables, figure out which ones were actually necessary, and pull the correct data from those. Instead, it saw views inside of views and pretty much gave up, deconstructing the views as-is rather than merging their component tables together like a human would. Because of this, nesting views is a dangerous anti-pattern. A junior developer may not understand what’s going on and won’t be able to figure out why a simple query with just two joins performs so poorly. The answer is, because there are more than just two joins; the two joins are just the surface and it’s views all the way down.
February 23, 2014
Yesterday’s anti-pattern ended with me talking about performance. The rest of my anti-patterns have to do primarily with performance. The first entry in this category is the king of poorly-performing solutions: the Entity-Attribute-Value model. The single best presentation of the EAV model and its constituent failings is this one by Phil Factor. What follows is my quick presentation of the problem.
Developers love to have clockwork systems in which they perform the work once and don’t have to deal with their systems afterward. It’s a laudable goal—after all, who wants to deal constantly with the same problems over and over? Unfortunately, the most difficult part of systems design from this perspective is dealing with databases. You specify your entities and attributes, but when the business changes requirements, you have to modify tables. Then there’s the problem that sometimes, people don’t know beforehand what attributes they want to collect and analyze. As a developer, you go to your thinking place and try to come up with a workable system which doesn’t require that you constantly add and move columns around on tables. After a spark of genius, it hits you: the One True Lookup Table:
CREATE TABLE dbo.ItemAttribute ( ItemAttributeID int identity(1,1) NOT NULL, ItemID int NOT NULL, Name varchar(100) NOT NULL, Value varchar(max) NOT NULL ); ALTER TABLE dbo.ItemAttribute ADD CONSTRAINT [PK_ItemAttribute] PRIMARY KEY CLUSTERED(ID); ALTER TABLE dbo.ItemAttribute ADD CONSTRAINT [UKC_ItemAttribute] UNIQUE (ItemID, Name); ALTER TABLE dbo.ItemAttribute ADD CONSTRAINT [FK_ItemAttribute_Item] FOREIGN KEY (ItemID) REFERENCES dbo.Item(ItemID);
Your annoying DBA can’t complain about this; it’s normalized, you have good constraints, and your fields are exactly what they say. In addition, you only need this one table for all item attributes. Best of all, your job is done: if they want to collect color, they can collect color; if they want to see size, they can see size; if they want number of units for one item and washing instructions for another, you just need this one table for everything. After declaring yourself Super-Genius of the Century and implementing your system, you start daydreaming about the Ferrari you’re going to buy with the hefty bonus management has to give you when you get interrupted by the phone. This was management alright, but not to tell you to come collect your giant check; rather, it’s them telling you that your system is dog-slow.
That’s the fundamental problem with EAV: inserting data is easy, but retrieving the data can be a mess. Let’s take a fairly basic business request: for all clothing-related items, get the manufacturer’s name, the product color, the product size, and the country of origin. But we only want to get the records which cost at least $14 per unit. All of these are item attributes; the Item entity is basically a stub with a couple of common properties, but because the company handles clothes, electronics, jewelry, condos, and landscaping services, you decided that all of the non-common attributes go into ItemAttribute.
Here’s what the above query looks like:
SELECT i.ID as ItemID, i.Name as ItemName, COALESCE(iam.Value, 'UNKNOWN') as ManufacturerName, COALESCE(iac.Value, 'UNKNOWN') as ItemColor, COALESCE(ias.Value, 'UNKNOWN') as ProductSize, COALESCE(iao.Value, 'UNKNOWN') as CountryOfOrigin FROM dbo.Item i LEFT OUTER JOIN dbo.ItemAttribute iam ON i.ItemID = iam.ItemID AND iam.Name = 'Manufacturer' LEFT OUTER JOIN dbo.ItemAttribute iac ON i.ItemID = iac.ItemID AND iac.Name = 'Color' LEFT OUTER JOIN dbo.ItemAttribute ias ON i.ItemID = ias.ItemID AND ias.Name = 'Size' LEFT OUTER JOIN dbo.ItemAttribute iao ON i.ItemID = iao.ItemID AND iao.Name = 'Country Of Origin' INNER JOIN dbo.ItemAttribute iapt ON i.ItemID = iapt.ItemID AND iapt.Name = 'Product Type' INNER JOIN dbo.ItemAttribute iamc ON i.ItemID = iamc.ItemID AND iamc.Name = 'Unit Cost' WHERE iapt.Value = 'Clothing' AND CAST(iamc.Value AS DECIMAL(8,2)) >= 14;
This is a monster of a query. We have to scan the ItemAttribute table six times just for a simple query! To make matters worse, indexing this query is next to impossible because our Value attribute is a VARCHAR(MAX) type, meaning that it’s not going on any indexes. This means that your annoying DBA can’t do anything easy to make your query go faster.
Aside from terrible performance, there are several other problems with the Entity-Attribute-Value model. First of all, there is no real data integrity. You created meaningful constraints, but you’re not constraining the one part which needs it: Value. In one case, we’re casting Value as a decimal type to perform a comparison. What happens if somebody accidentally puts in a non-decimal value? ”Okay,” the developer may think, “I can just create a few columns: IntValue, ShortVarcharValue, VarcharValue, DecimalValue, etc.” If this sounds absurd, it should. You’re mixing metadata (attribute type) with data (the attribute itself).
Aside from that data integrity problem, we also have a referential integrity problem. With the Name column, one person could use “Color” and another “Colour.” Different spellings, misspellings, and rephrasing product attributes happens. You could create a lookup table called ItemAttributeName which lists valid values, but now you have a half-dozen more joins.
The EAV model is great for mockups and systems with a few rows of data. When it comes to handling a serious system, however, you need a good data model. There’s just no easy way around it.
February 22, 2014
Like the previous T-SQL anti-pattern, this is really all about taking advantage of what a relational database management system offers you. The benefits of an RDBMS come from the relational algebra upon which they are based, and in order to get the best outcomes, you need to follow the rules of the game. When it comes to relational database systems, normalization is the primary rule.
I’m not going to go into detail on normalization—that’s a talk of its own (and one which you can read on my corporate site; right now, that’s a sneak preview, as I haven’t given that talk anywhere yet). Basically, there are a whole slew of anti-patterns around not normalizing your data. They include (but are not limited to):
- Non-atomic attributes. Think of a comma-separated list of phone numbers.
- Multi-valued attributes. In this case, you might have a table with attributes ID, Name, PhoneNumber1, PhoneNumber2, and PhoneNumber3.
- Repeated columns. This is very similar to the multi-valued attribute example above; imagine a table with ID, Name, ChemistryScore, EnglishScore, MathScore, ReadingScore, etc.
In each of these examples, you throw away some of the value of a relational database model. By doing that, you make it easier to introduce bad data and can degrade performance—which is ironic, considering that the primary reason people tend not to normalize their data is to try to improve performance.