The last couple of days, I poked a bit of fun at object-oriented developers who move to a relational platform and try to do things like nesting views or using functions inappropriately (although, to be fair, it’s not so much that developers are trying to use functions as that the way Microsoft implemented functions leads generally to horribly inefficient outcomes). Today’s anti-pattern will be quick but important. Today, I’m going to talk about Row-By-Agonizing-Row (RBAR) queries, something I tend to rant about (such as in my tally tables presentation).
Most students learning to program tend to get introduced first to structural or object-oriented languages like C and C#/Java, respectively. With these languages, we see certain structural patterns emerge: conditional statements, method redirection, and loops. These work wonders in C-based languages, as well as other structural/OO languages, and so it makes sense for a developer whose primary focus is C# to import that paradigm into writing SQL. It is also a devastating performance problem.
Here’s a concrete example: suppose that you have to loop through a set of users and set their IsActive flag to be false if their last recorded login is earlier than three months ago. Armed solely with structural programming tools, there’s an obvious solution here. The obvious solution is to use two separate loops:
- Loop through each user in the User table, or at least those users who are currently active.
- For each user, we create another loop which goes through the UserLogin table. We store as a local variable the latest login date for that user which we have found.
As far as the logic goes, this is sound reasoning. Unfortunately, it’s also a performance hog within SQL Server if you actually try to implement this as two separate cursors. The reason is that we’ll be scanning the User table N times (once for each user) and the UserLogin M*N times (once for each user login—for each user). We can improve this a bit by putting filters on, like getting only active users and making M*N into m*N, where m is the number of UserLogin records associated with that user. Nevertheless, you’re hitting the database N + M*N (or N + m*N) times.
Some people see cursors and say, “Hey, I’ll improve performance here: I’m going to get rid of this cursor and replace it with a WHILE loop.” Those people are wrong. At best, WHILE loops are the same thing as cursors; at worst, you might actually write a cursor which performs better than the equivalent WHILE loop. In either event, it’s not the appropriate solution. The appropriate solution is to start thinking in terms of sets.
For our previous example, we can re-write the problem to be handled in one query, whose pseudocode would look a bit like:
UPDATE dbo.User SET IsActive = 0 WHERE NOT EXISTS ( SELECT * FROM dbo.UserLogin ul WHERE ul.UserID = UserID AND ul.LoginDate >= DATEADD(MONTH, -3, CURRENT_TIMESTAMP) );
In this case, we’re hitting the database one time. With an appropriate index on dbo.UserLogin, we can make the non-existence query fast, so we’re scanning the User table and seeking against UserLogin. Performance will be tremendously better than the RBAR solution. And fortunately for us, almost every cursor-based solution can be re-written using set-based logic. There are a few maintenance functions that you’d probably want to use a cursor for (and that’s how things like sp_msforeachdb work), but those should be few and far between.
Also, it’s important to note that in the Oracle world, cursors typically perform much better than in SQL Server. This means that even a skilled Oracle developer, upon entering the world of SQL Server, could mess this up pretty badly. But if you’re going to spend a lot of time writing T-SQL, you really need to be able to think in terms of sets.