T-SQL Anti-Patterns: Query By Agonizing Query (QBAQ)

Row By Agonizing Row (RBAR, pronounced “re-bar”) was our last anti-pattern.  Today, I’m going to extend it to a term I’m coining, Query By Agonizing Query (QBAQ, pronounced “cue-back”).  According to a quick Google search, nobody has used that term before, so that’s my claim to fame.

What I mean by QBAQ is the following anti-pattern:

UPDATE dbo.SomeTable
SET ColumnA = @Value
WHERE KeyColumn = @ID;

UPDATE dbo.SomeTable
SET ColumnB = @SomeOtherValue
WHERE KeyColumn = @ID;

UPDATE dbo.SomeTable
SET ColumnC = ColumnA + ColumnB
WHERE KeyColumn = @ID;

In this case, we see multiple queries where just one would do nicely:

UPDATE dbo.SomeTable
SET
	ColumnA = @Value,
	ColumnB = @SomeOtherValue,
	ColumnC = ColumnA + ColumnB
WHERE KeyColumn = @ID;

The top query hits SomeTable three times. Even though this is going to be three table seeks (because we’re joining on the primary key column), it’s also two more seeks than necessary. If you run this code on a regular basis, performance will be worse than it needs to be. Also, if these are running in a single transaction, you’re going to lock pages in SomeTable for a longer period of time than necessary.

Often, QBAQ gets combined with RBAR, so that the full query might be something like:

WHILE EXISTS(SELECT * FROM #RelevantValues)
BEGIN
	SELECT TOP 1 @ID = ID
	FROM #RelevantValues;

	UPDATE dbo.SomeTable
	SET ColumnA = @Value
	WHERE KeyColumn = @ID;

	UPDATE dbo.SomeTable
	SET ColumnB = @SomeOtherValue
	WHERE KeyColumn = @ID;

	UPDATE dbo.SomeTable
	SET ColumnC = ColumnA + ColumnB
	WHERE KeyColumn = @ID;

	DELETE FROM #RelevantValues
	WHERE ID = @ID;
END

Here is a piece of code which you could optimize quite nicely, removing all kinds of table hits:

UPDATE st
SET
	ColumnA = @Value,
	ColumnB = @SomeOtherValue,
	ColumnC = ColumnA + ColumnB
FROM dbo.SomeTable st
	INNER JOIN #RelevantValues rv ON st.KeyColumn = rv.ID;

This T-SQL-specific form of the UPDATE operator eliminates a loop, scanning (or seeking) #RelevantValues N times, deleting from #RelevantValues N times, and 2/3 of our table update hits. It’s also shorter, more concise, and easier to understand.

UPDATE statements aren’t the only time I see QBAQ, but they’re typically a good start. I also sometimes see it with UNION abuse:

SELECT A, B, C FROM dbo.SomeTable WHERE SomeValue = 1
UNION ALL
SELECT A, B, C FROM dbo.SomeTable WHERE SomeValue = 2
UNION ALL
SELECT A, B, C FROM dbo.SomeTable WHERE SomeValue = 3
UNION ALL
SELECT A, B, C FROM dbo.SomeTable WHERE SomeOtherValue = 'A'

In this case, we have four seeks or scans against SomeTable, and we can get that down to one:

SELECT 
	A, 
	B, 
	C 
FROM dbo.SomeTable 
WHERE
	SomeValue IN (1, 2, 3) 
	OR SomeOtherValue = 'A';

If you’re concerned about SARGability, you can also try that as two separate queries; in any event, we cut the number of table seeks/scans at least in half, and that adds up.

Most of the time, QBAQ comes from being stuck in a procedural mindset: “I need to perform operation X, then operation Y, then operation Z.” Once you think in those terms, you’re biasing your answers toward performing discrete operations, rather than seeing if there is a good way of combining X, Y, and Z together in one query. Being able to do this is a great indicator that you truly understand and can apply set-based concepts.

Haircut Accomplished

I’m interrupting my T-SQL anti-patterns series for an important announcement:  I got a haircut.  This is the first haircut in 4 years, so it’s a pretty big deal for me.  Somewhere around 16″ of hair will be going to Locks of Love; when the stylist cut them off, they looked four bushy like fox tails.

Photos will probably be forthcoming.

T-SQL Anti-Patterns: Row By Agonizing Row (RBAR) Queries

The last couple of days, I poked a bit of fun at object-oriented developers who move to a relational platform and try to do things like nesting views or using functions inappropriately (although, to be fair, it’s not so much that developers are trying to use functions as that the way Microsoft implemented functions leads generally to horribly inefficient outcomes).  Today’s anti-pattern will be quick but important.  Today, I’m going to talk about Row-By-Agonizing-Row (RBAR) queries, something I tend to rant about (such as in my tally tables presentation).

Most students learning to program tend to get introduced first to structural or object-oriented languages like C and C#/Java, respectively.  With these languages, we see certain structural patterns emerge:  conditional statements, method redirection, and loops.  These work wonders in C-based languages, as well as other structural/OO languages, and so it makes sense for a developer whose primary focus is C# to import that paradigm into writing SQL.  It is also a devastating performance problem.

Here’s a concrete example:  suppose that you have to loop through a set of users and set their IsActive flag to be false if their last recorded login is earlier than three months ago.  Armed solely with structural programming tools, there’s an obvious solution here.  The obvious solution is to use two separate loops:

  1. Loop through each user in the User table, or at least those users who are currently active.
  2. For each user, we create another loop which goes through the UserLogin table.  We store as a local variable the latest login date for that user which we have found.

As far as the logic goes, this is sound reasoning.  Unfortunately, it’s also a performance hog within SQL Server if you actually try to implement this as two separate cursors.  The reason is that we’ll be scanning the User table N times (once for each user) and the UserLogin M*N times (once for each user login—for each user).  We can improve this a bit by putting filters on, like getting only active users and making M*N into m*N, where m is the number of UserLogin records associated with that user.  Nevertheless, you’re hitting the database N + M*N (or N + m*N) times.

Some people see cursors and say, “Hey, I’ll improve performance here:  I’m going to get rid of this cursor and replace it with a WHILE loop.”  Those people are wrong.  At best, WHILE loops are the same thing as cursors; at worst, you might actually write a cursor which performs better than the equivalent WHILE loop.  In either event, it’s not the appropriate solution.  The appropriate solution is to start thinking in terms of sets.

For our previous example, we can re-write the problem to be handled in one query, whose pseudocode would look a bit like:

UPDATE dbo.User
SET IsActive = 0
WHERE
	NOT EXISTS
	(
		SELECT *
		FROM dbo.UserLogin ul
		WHERE
			ul.UserID = UserID
			AND ul.LoginDate >= DATEADD(MONTH, -3, CURRENT_TIMESTAMP)
	);

In this case, we’re hitting the database one time.  With an appropriate index on dbo.UserLogin, we can make the non-existence query fast, so we’re scanning the User table and seeking against UserLogin.  Performance will be tremendously better than the RBAR solution.  And fortunately for us, almost every cursor-based solution can be re-written using set-based logic.  There are a few maintenance functions that you’d probably want to use a cursor for (and that’s how things like sp_msforeachdb work), but those should be few and far between.

Also, it’s important to note that in the Oracle world, cursors typically perform much better than in SQL Server.  This means that even a skilled Oracle developer, upon entering the world of SQL Server, could mess this up pretty badly.  But if you’re going to spend a lot of time writing T-SQL, you really need to be able to think in terms of sets.

We’ll always have Madden, D’Qwell

Sorry to hear it’s official, but D’Qwell Jackson got released as a cap casualty. You may be saying, “But the Browns have a shit ton of cap space!” That’s true, but he’s also got a bit of an injury history and is on the wrong side of 30. More importantly, he racks up tons of tackles and that’s about it; he can’t really rush the passer and his ability to play pass defense is meh. He’s just not a good fit in a 3-4.

I would have resigned him if I were running the Browns, but only for a year or two. This is unfortunate, but not the end of the world, especially if it means keeping T. J. Ward and/or Alex Mack.

Bill Livingston: Man of strong (and wrong) opinions

Bill Livingston, coot, continues his ramblings about the Browns and his love affair with Johnny Manziel.

What began as a “boo hoo, I used to like the Cowboys and the Browns beat us” piece eventually became “DRAFT JOHNNY MANZIEL!” I’d stop reading his stories, if not for the fact that they serve as a barometer of my own opinion: if he likes something, I should dislike it.

What really bothers me is that I’m actually not opposed to Johnny Manziel — I am opposed to mortgaging a fantastic draft for one player, especially one with a history like Manziel. Yet the more I hear people sing his praises, the more I remember Tim Tebow and Colt McCoy — “they’re leaders of men!” and so forth.

 

T-SQL Anti-Patterns: User-Defined Functions

Yesterday’s anti-pattern (nested views) was part one of the “Do Repeat Yourself [Sometimes]” mini-series.  Today, I’d like to get into part two of the series:  user-defined functions.  When Microsoft introduced user-defined functions in SQL Server 2000, it was an object-oriented developer’s dream come true.  It showed that Microsoft wanted to make true encapsulation—and not just views—a reality.  You would be able to define your business logic in functions (either scalar functions to perform calculations or return result sets with table-valued functions), and have everything in just one location.  The idea was fantastic, but the implementation was…problematic.

The biggest problem with user-defined functions is that they simply don’t perform well.  With a scalar function, you’re performing the operation for every row in a result set.  If this is a quick calculation that doesn’t require anything more than a bit of CPU and memory (i.e., all of the fields are already defined as part of the result set and you’re just performing mathematical or string operations on them) and you aren’t filtering by the function, a scalar function won’t kill you.  But if you need to hit a table inside your scalar function, you’re going to be accessing that table once for every row in your result setThat will kill you.  Even worse is if you make the scalar function part of your WHERE clause.  In that case, you need to perform the operation for every single row, regardless of whether that row makes it into your final result set.  In other words, scalar user-defined functions are not SARGable.

Table-valued functions can be better (as the Dieter post above notes), but even they will come with a performance cost over inlining the function’s code in your queries.  Here is an older article (from the SQL Server 2000 days) which shows what has been my typical experience:  inline TVFs are much faster than scalar or multi-table functios, but tend to have higher I/O requirements than the raw queries.  This makes inline TVFs potentially a reasonable solution, but they’re something you need to test in a production-sized environment before using.

So now that we’re throwing out scalar and multi-table functions, what’s left?  If you need a result set back, an inline TVF can work, as we mentioned.  But if you need to perform some complex calculations against a result set, I’d recommend using the CLR.  The CLR is faster for a number of scenarios, such as string splitting or calculating a median.  It’s still a little scary to DBAs who are insulated from the general .NET world, but it’d be a shame if people miss out on such an important tool nearly a decade later just because they don’t know its value.

T-SQL Anti-Patterns: Nested Views

Our next T-SQL anti-pattern is a performance killer.  Similar to EAV, it happens when developers get a little too smart for their own good.

Most good developers know about DRY:  Don’t Repeat Yourself.  It’s an admirable principal and one to which I subscribe in many languages.  The whole idea is, once you see yourself repeating code sections, turn that code section into its own method.  Once you see yourself repeating the same patterns, look for a layer of abstraction which gets rid of the repetition and allows you to centralize all of that code in one place.  That way, you don’t have to worry about finding and fixing all of the places if you ever need to change that logic, and it’s a lot easier to test one single module than dozens of modules interspersed throughout a system.

In T-SQL, modularization comes in a couple of flavors.  Today’s topic is views.  A view is nothing more than a stored SELECT query.  In this example, I’m going to use the AdventureWorks2012 database.

First, let’s say that we have a query which we run regularly.  It focuses on sales order headers and details, pulling back information that we require.  This gets called from a few locations, so we turn it into a view, like this one:

CREATE VIEW vSalesOrderDetail_1 AS
SELECT
	soh.SalesOrderID,
	soh.OrderDate,
	soh.DueDate,
	soh.Status,
	st.Name as SalesTerritory,
	cc.CardType,
	cc.CardNumber,
	soh.SubTotal,
	soh.TotalDue,
	sod.SalesOrderDetailID,
	sod.OrderQty,
	sod.UnitPrice,
	sod.LineTotal
FROM
	Sales.SalesOrderHeader soh
	INNER JOIN Sales.SalesOrderDetail sod ON soh.SalesOrderID = sod.SalesOrderID
	INNER JOIN Sales.SalesTerritory st ON soh.TerritoryID = st.TerritoryID
	INNER JOIN Sales.CreditCard cc ON soh.CreditCardID = cc.CreditCardID;

The nice part here is that when we execute the view, the query plan is exactly the same as if we had run the query itself:

1 - ViewPlan

Once we have that view, we see how much easier life is with the code in one central location, so we start creating other views.  Eventually, the object-oriented programming portion of our brain kicks in and says that, hey, we could use the views as an encapsulation layer, letting us repeat less and less code.  Why remember all of these joins when we can just do it in a couple of views?  Furthermore, we could have one main view and just get the parts that we want.  For example, let’s imagine that we start with our base view, but we only need a couple of items:  SalesOrderID, SalesOrderDetailID, and LineTotal.  Here’s what our execution plan looks like afterward:

2 - View Reduction

Well, that’s a little more complex than we expected.  I mean, we’re really only getting data from two tables, so why are we still getting credit card info?  Here’s what the raw query looks like:

3 - Raw Query

That’s more like it.  As far as it goes, SQL Server estimates that raw query would be about 1/5 of the cost of getting the same records back from our view.

The reason for this is that our view had several INNER JOINs, meaning that we put explicit filters on the query, and so SQL Server needed to make those joins.  Let’s change all of the INNER JOINs to LEFT OUTER JOINs instead.  Here’s the new query plan:

4 - LOJ Query

Now that’s more like it.  So, problem solved:  we’ll just use LEFT OUTER JOINs for everything.  This individual performance gain comes at a potential overall performance cost, however:  when you use INNER JOINs, you allow the SQL Server optimizer to start from any table and drive through the query as it sees fit.  When you change a join to a LEFT OUTER JOIN instead of an INNER JOIN, you force the left table to be accessed first.  If you have an excellent filter on the right table, it won’t be used as early as if the tables were INNER JOINed together.

Furthermore, once we get more complicated queries, the optimizer sort of gives up on us.  Let’s see an example.  First, I’m going to create a couple more views.  The first builds off our example, adding in the salesperson’s e-mail address:

CREATE VIEW vEmail AS
SELECT
	v.SalesOrderID,
	v.SalesOrderDetailID,
	v.LineTotal,
	be.BusinessEntityID,
	p.FirstName,
	p.LastName,
	e.EmailAddress
FROM
	vSalesOrderDetail_1 v
	LEFT OUTER JOIN Sales.SalesOrderHeader soh ON v.SalesOrderID = soh.SalesOrderID
	LEFT OUTER JOIN Sales.SalesPerson sp ON soh.SalesPersonID = sp.BusinessEntityID
	LEFT OUTER JOIN Person.BusinessEntity be ON sp.BusinessEntityID = be.BusinessEntityID
	LEFT OUTER JOIN Person.Person p ON p.BusinessEntityID = be.BusinessEntityID
	LEFT OUTER JOIN Person.EmailAddress e ON e.BusinessEntityID = p.BusinessEntityID;

Then, we’ll create one more query which reads from vEmail and joins back to sales order header a third time to get one more field.  Here’s what our final query looks like:

SELECT
	v.SalesOrderID,
	soh.AccountNumber,
	v.SalesOrderDetailID,
	v.LineTotal,
	sp.CommissionPct,
	v.BusinessEntityID
FROM
	vEmail v
	INNER JOIN Sales.SalesPerson sp ON v.BusinessEntityID = sp.BusinessEntityID
	INNER JOIN Sales.SalesOrderHeader soh ON soh.SalesOrderID = v.SalesOrderID
WHERE
	v.SalesOrderID IN (43659, 43660, 43661);

And here is our execution plan:

5 - Nested Views

You’ll notice that our execution plan includes table joins like EmailAddress, even though we never specified that in our query.  It also joins SalesOrderHeader three times and SalesPerson twice.  The reason is that our query joining to a view inside a view was complex enough that SQL Server basically gave up on trying to come up with the optimal re-write of our query.  For those interested, the optimal form of this query is:

SELECT
	soh.SalesOrderID,
	soh.AccountNumber,
	sod.SalesOrderDetailID,
	sod.LineTotal,
	sp.CommissionPct,
	sp.BusinessEntityID
FROM
	Sales.SalesOrderHeader soh
	LEFT OUTER JOIN Sales.SalesOrderDetail sod ON soh.SalesOrderID = sod.SalesOrderID
	LEFT OUTER JOIN Sales.SalesPerson sp ON soh.SalesPersonID = sp.BusinessEntityID
WHERE
	soh.SalesOrderID IN (43659, 43660, 43661);

When running these two queries side-by-side, the nested view form of the query had a cost estimate of 77% of the batch, whereas the optimal query was only 23%.  In terms of I/O, the nested view query had a total of 199 logical reads versus 24 for the optimal form.  If we run this query a huge number of itmes, that starts to add up fast.

The general rule to follow is that the optimizer is smart, but it can get tricked pretty easily.  In this case, it was pretty easy to tell what a superior query would look like, but the optimizer didn’t have enough time to whittle down the sets of tables, figure out which ones were actually necessary, and pull the correct data from those.  Instead, it saw views inside of views and pretty much gave up, deconstructing the views as-is rather than merging their component tables together like a human would.  Because of this, nesting views is a dangerous anti-pattern.  A junior developer may not understand what’s going on and won’t be able to figure out why a simple query with just two joins performs so poorly.  The answer is, because there are more than just two joins; the two joins are just the surface and it’s views all the way down.

Where art thou, Ray Farmer?

Ray Farmer is now the sole GM of the Cleveland Browns. This is cause for much rejoicing. Yet he has two incredibly easy decisions to make and he isn’t making them. Re-signing Alex Mack (C) and T. J. Ward (SS).

Mack is on a humanitarian trip to Brazil at the moment and won’t return until the end of the month. Okay, fine, but then why aren’t you re-signing Boss Ward (that’s actually his nickname, and I’ve grown to like it)?

I won’t use the “they were Pro Bowlers” argument here. I’ll simply state that T. J. Ward is a fantastic old-school safety (in that he hits the bejeesus out of people) and Alex Mack is one of the best centers in the NFL. The Browns have a shit ton of cap space. I mean, there’s literally no reason why they couldn’t re-sign both, use all of their draft picks, and enjoy the results. Maybe even add a free agent or two.

Here’s Football Outsiders’ take on the situation, by the way. FO also says we ought to trade up with the Rams, though. If the Rams would take the #4 pick, a second rounder, and something else, I’d do it in a heartbeat, but I’d be shocked if they wouldn’t at the very least the second first rounder, and I’d be hard pressed to give that up, were I Ray Farmer.

The theory of gaming: Alhambra

Note: All times are assuming a two player game; please adjust accordingly for more players

Publisher: Queen Games/Dirk Henn

Type of game: Tile game

Number of players: 2-6

Learning curve: Easy

Estimated time to play (first time/subsequent times): 45 minutes/30 minutes

Estimated setup time: 5 minutes

This game is a pretty big contrast from Call of Cthulhu: The Card Game. There isn’t one cosmic horror from beyond. Not one! It is, however, a surprisingly fun little game that my wife picked out.

The concept is pretty simple: you are trying to build an Alhambra that’s more awesome than the other player’s. You do so using currency and buying tiles to expand it. In the basic version, that’s all there is to it, although there are many expansions. You gain points in various rounds until you reach the end.

There is a certain amount of luck, but a surprising amount of skill as well. It’s probably even better with more than two players!