Aren’t Functions A Good Thing?
As developers, we want to modularize code in order to present duplication. There are several laudable reasons to prevent code duplication. The simplest reason is that we all have better things to do than type the same lines of code over and over. Putting code into a common module or function means that we make calling that code simpler and we type less.
Another reason to prevent code duplication is to prevent errors. If we have five versions of a calculation and that calculation later changes, it’s easy accidentally to miss one of those changes and end up with two separate versions of a calculation. The best case scenario is that this will cause users to doubt the veracity of your data; the worst case scenario is that you will permanently damage your application by unknowingly storing incorrect data that you cannot later fix.
Yet another reason developers want to use functions is to make testing easier. When a piece of logic exists in one place, we can write a single set of tests to cover that logic. If you have logic strewn about in many different functions, it makes testing harder for the same reason as above: you might miss one version of the calculation or operation.
All of this shows that there are definite advantages to writing modular code using functions or other language constructs.
So Why Are They A Bad Thing?
In SQL server, modularization usually comes in three forms: stored procedures, functions, and views. I am a huge fan of stored procedures, am somewhat wary of views, and am strongly anti-function. The reason comes down to performance. Stored procedures have no performance overhead as opposed to ad hoc SQL, and so there is no performance optimization reason not to use them. Views theoretically should not have any performance impact on an environment, but when you nest views deeply enough or hit complex enough views, the query optimizer can get confused and throw out an unnecessarily complex execution plan which performs poorly. Often times, an easy fix to a poorly-performing query involving a view is to bring the relevant view code in-line.
Functions are like views, except much, much worse. The problem with functions tends to be the same, regardless of whether you use scalar, multi-set, or table functions: they perform poorly, much worse than their alternatives. If you create a user-defined scalar function and use it in your WHERE clause or in a JOIN condition, you can pretty much guarantee that SQL Server will go through your table row-by-row, running that function every time. This is bad enough if your function is a fairly simple calculation, but if you need to look up data from another table, you’re making a separate table join every single time. Suppose you have 10 million rows in your main table and a function which does a quick lookup against a tiny table with just 30 rows. To make the example starker, let’s say that the 10 million rows are ordered in such a way that we could profit most from a merge join against the 30-row table. It would make sense to grab those 30 rows and stream the 10 million rows through, matching up against the relevant lookup row along the way. But if we put this lookup operation into a function, it gets called once for each of the 10 million rows. This most likely will not be a merge join and will lead to a huge amount of I/O. The worst part is that your execution plan will show what one lookup looks like but fail to mention that it’s going to happen 10 million times, meaning that if you don’t understand the performance problems of functions, it’s easy to look at the plan and say that the function will actually perform better…until you see the total time elapsed!
A Case Study: User Total Time
A customer of mine produces software which optimizes employee schedules. They track shift times and figure out when an employee hits the overtime cutoff, and can track shift data from the past, present, and future. The way that they did this was with a function called UserTotalTime(). This function called a few scalar functions to check parameters, and then performed a somewhat expensive set of table joins, aggregating the data into a result set with the user ID, time period, and number of hours worked + scheduled, leading to a determination of whether that employee has overtime hours or not.
When you’re looking at one user or even all users over a single time period, the process didn’t perform that badly, and so this function was integrated into more and more code, more and more complex procedures. At this point, they started to notice the solution beginning to burst at the seams. Queries for larger customers were taking an unacceptably long time and they weren’t sure what they could do to improve the situation. It got to the point where they started deleting some data to limit how many rows were in the tables; that provided a temporary boost to performance, but wasn’t a good long-term strategy because it limited future analytics.
What To Do?
Taking a look at the function, the first thing I realized was that this generally did not need to be a function, nor did it need to be as complex as it was. Those scalar functions to check parameters could be simplified and eliminated, as well as part of the main calculation. At that point, I could inline the remaining code and modify procedures to build the result set directly instead of calling a function.
In this case, however, I went a step further. This customer’s database is really closer to a warehouse than a true transactional system: the read to write ratio is in the thousands or tens of thousands; people rarely update data, but they definitely read it a lot. As a result, I ended up creating a reporting table which holds the aggregated results of this query in a brand new table, updating it as part of an ETL process which runs whenever rows get updated, as well as periodically (just in case somebody modifies data in a non-standard manner). This works well for their environment, although it certainly isn’t a solution to every function-based performance problem.
What Can You Do?
The easiest way to fix performance problems with functions is to remove the function and bring the function’s code inline. This doesn’t feel very nice for a developer, as it means copying code and violates the Don’t Repeat Yourself mantra, but if you’re running into performance problems, a little repetition is a small cost to pay for code which performs up to business needs.
Bonus Answer: APPLY Yourself
Another tool to help get rid of pesky functions is the APPLY operator. I just happen to have a presentation on the topic. The great thing about the APPLY operator is that it operates like a table-valued function, but because the code is inline, you don’t get the performance problems that you would with a function.
Stay Tuned
In tomorrow’s blog post, I’m going to circle back around to UserTotalTime, showing why a function appears useful, how it can perform poorly, and how we can deconstruct a function and get back an acceptable level of performance.