Our next T-SQL anti-pattern is a performance killer. Similar to EAV, it happens when developers get a little too smart for their own good.
Most good developers know about DRY: Don’t Repeat Yourself. It’s an admirable principal and one to which I subscribe in many languages. The whole idea is, once you see yourself repeating code sections, turn that code section into its own method. Once you see yourself repeating the same patterns, look for a layer of abstraction which gets rid of the repetition and allows you to centralize all of that code in one place. That way, you don’t have to worry about finding and fixing all of the places if you ever need to change that logic, and it’s a lot easier to test one single module than dozens of modules interspersed throughout a system.
In T-SQL, modularization comes in a couple of flavors. Today’s topic is views. A view is nothing more than a stored SELECT query. In this example, I’m going to use the AdventureWorks2012 database.
First, let’s say that we have a query which we run regularly. It focuses on sales order headers and details, pulling back information that we require. This gets called from a few locations, so we turn it into a view, like this one:
CREATE VIEW vSalesOrderDetail_1 AS SELECT soh.SalesOrderID, soh.OrderDate, soh.DueDate, soh.Status, st.Name as SalesTerritory, cc.CardType, cc.CardNumber, soh.SubTotal, soh.TotalDue, sod.SalesOrderDetailID, sod.OrderQty, sod.UnitPrice, sod.LineTotal FROM Sales.SalesOrderHeader soh INNER JOIN Sales.SalesOrderDetail sod ON soh.SalesOrderID = sod.SalesOrderID INNER JOIN Sales.SalesTerritory st ON soh.TerritoryID = st.TerritoryID INNER JOIN Sales.CreditCard cc ON soh.CreditCardID = cc.CreditCardID;
The nice part here is that when we execute the view, the query plan is exactly the same as if we had run the query itself:
Once we have that view, we see how much easier life is with the code in one central location, so we start creating other views. Eventually, the object-oriented programming portion of our brain kicks in and says that, hey, we could use the views as an encapsulation layer, letting us repeat less and less code. Why remember all of these joins when we can just do it in a couple of views? Furthermore, we could have one main view and just get the parts that we want. For example, let’s imagine that we start with our base view, but we only need a couple of items: SalesOrderID, SalesOrderDetailID, and LineTotal. Here’s what our execution plan looks like afterward:
Well, that’s a little more complex than we expected. I mean, we’re really only getting data from two tables, so why are we still getting credit card info? Here’s what the raw query looks like:
That’s more like it. As far as it goes, SQL Server estimates that raw query would be about 1/5 of the cost of getting the same records back from our view.
The reason for this is that our view had several INNER JOINs, meaning that we put explicit filters on the query, and so SQL Server needed to make those joins. Let’s change all of the INNER JOINs to LEFT OUTER JOINs instead. Here’s the new query plan:
Now that’s more like it. So, problem solved: we’ll just use LEFT OUTER JOINs for everything. This individual performance gain comes at a potential overall performance cost, however: when you use INNER JOINs, you allow the SQL Server optimizer to start from any table and drive through the query as it sees fit. When you change a join to a LEFT OUTER JOIN instead of an INNER JOIN, you force the left table to be accessed first. If you have an excellent filter on the right table, it won’t be used as early as if the tables were INNER JOINed together.
Furthermore, once we get more complicated queries, the optimizer sort of gives up on us. Let’s see an example. First, I’m going to create a couple more views. The first builds off our example, adding in the salesperson’s e-mail address:
CREATE VIEW vEmail AS SELECT v.SalesOrderID, v.SalesOrderDetailID, v.LineTotal, be.BusinessEntityID, p.FirstName, p.LastName, e.EmailAddress FROM vSalesOrderDetail_1 v LEFT OUTER JOIN Sales.SalesOrderHeader soh ON v.SalesOrderID = soh.SalesOrderID LEFT OUTER JOIN Sales.SalesPerson sp ON soh.SalesPersonID = sp.BusinessEntityID LEFT OUTER JOIN Person.BusinessEntity be ON sp.BusinessEntityID = be.BusinessEntityID LEFT OUTER JOIN Person.Person p ON p.BusinessEntityID = be.BusinessEntityID LEFT OUTER JOIN Person.EmailAddress e ON e.BusinessEntityID = p.BusinessEntityID;
Then, we’ll create one more query which reads from vEmail and joins back to sales order header a third time to get one more field. Here’s what our final query looks like:
SELECT v.SalesOrderID, soh.AccountNumber, v.SalesOrderDetailID, v.LineTotal, sp.CommissionPct, v.BusinessEntityID FROM vEmail v INNER JOIN Sales.SalesPerson sp ON v.BusinessEntityID = sp.BusinessEntityID INNER JOIN Sales.SalesOrderHeader soh ON soh.SalesOrderID = v.SalesOrderID WHERE v.SalesOrderID IN (43659, 43660, 43661);
And here is our execution plan:
You’ll notice that our execution plan includes table joins like EmailAddress, even though we never specified that in our query. It also joins SalesOrderHeader three times and SalesPerson twice. The reason is that our query joining to a view inside a view was complex enough that SQL Server basically gave up on trying to come up with the optimal re-write of our query. For those interested, the optimal form of this query is:
SELECT soh.SalesOrderID, soh.AccountNumber, sod.SalesOrderDetailID, sod.LineTotal, sp.CommissionPct, sp.BusinessEntityID FROM Sales.SalesOrderHeader soh LEFT OUTER JOIN Sales.SalesOrderDetail sod ON soh.SalesOrderID = sod.SalesOrderID LEFT OUTER JOIN Sales.SalesPerson sp ON soh.SalesPersonID = sp.BusinessEntityID WHERE soh.SalesOrderID IN (43659, 43660, 43661);
When running these two queries side-by-side, the nested view form of the query had a cost estimate of 77% of the batch, whereas the optimal query was only 23%. In terms of I/O, the nested view query had a total of 199 logical reads versus 24 for the optimal form. If we run this query a huge number of itmes, that starts to add up fast.
The general rule to follow is that the optimizer is smart, but it can get tricked pretty easily. In this case, it was pretty easy to tell what a superior query would look like, but the optimizer didn’t have enough time to whittle down the sets of tables, figure out which ones were actually necessary, and pull the correct data from those. Instead, it saw views inside of views and pretty much gave up, deconstructing the views as-is rather than merging their component tables together like a human would. Because of this, nesting views is a dangerous anti-pattern. A junior developer may not understand what’s going on and won’t be able to figure out why a simple query with just two joins performs so poorly. The answer is, because there are more than just two joins; the two joins are just the surface and it’s views all the way down.
” It’s an admirable principal and one to which I subscribe in many languages.”
Not English, apparently.