Modularize(?)

Let’s hit the final topic of my Working Effectively with Legacy SQL series.  Today, we discuss the topic of modularization.  In object-oriented programming, modularization is a great idea.  One of the “clean code” tenets is Don’t Repeat Yourself:  if you have two pieces of code doing the same work, try to unify that code, isolating that particular functionality in its own part of the code base.

This sounds like a fantastic idea, but it turns out not to work quite as well in SQL Server.  There are two reasons for this:  first, T-SQL is not an object-oriented language; second, trying to modularize complex queries can confuse the query optimizer and give you terrible execution plans and performance.  I wrote about this topic just about a year ago, so I’ll link to my explanations of why nested views and user-defined functions can be anti-patterns in T-SQL.

The gist of these two blog posts is that the most common forms of encapsulating data (views and user-defined functions) can come with a major performance loss,  and so you should use them either sparingly or not at all.  In the spirit of improving the lives of our maintenance developers, however, let’s think about some ways in which modularization can make sense.

The first way in which modularization can make sense is separating commonly-used, independent queries into their own procedures.  Last time around, I talked about error and performance logging.  I can move those INSERT statements into their own separate utility procedures and simply call them from my main procedures.  The reason I would want to do this is that it makes future changes easier.  Suppose that I need to insert a record into PerformanceLog today, but a year from now, I’m inserting it into a different table with a different column structure, or I’m firing off a Service Broker message, or maybe I’ve got dynamic logic to determine whether this record should even be inserted into the table.  These are changes I don’t want to make in every single stored procedure, so it makes sense to wrap them up in a utility procedure.

Another example could be the SELECT statements you use to load certain tables.  Suppose you have a batch insertion procedure which selects data from some table with certain filters and then inserts records into a separate table.  Splitting that portion out into its own procedure can make sense on a couple of levels.  First, it lets you troubleshoot the SELECT portion of the query independently of the INSERT portion.  Second, if you also need to call the SELECT statement separately for reporting purposes, you can do so.  Just remember that if you do decide to create a sub-procedure, you incur a little bit of overhead cost and make your overall design a little more complex, so save this for cases in which it really makes sense.

A third way in which you can think of implementing DRY within the SQL Server world is getting rid of redundant procedures.  If you have one procedure which returns { A, B } and you have a separate procedure which returns { A, B, C }, re-work the code to call the latter procedure and drop the former.  I can see exceptions when you have a fast { A, B } (which uses a covering, non-clustered index) versus another procedure which has { A, B, …, Y, Z }, but if the difference is minor and the relevant index covers both queries, stick with one procedure.  Also, get rid of any true duplicates.  You might not expect to see any duplicate procedures, but they can pop up:  two procedures with different names but return exactly the same result set.  Maybe they returned different results at one point in time, or maybe the person who wrote the second procedure didn’t know about the first; whatever, the reason, there is zero value in maintaining two versions of the same procedure.  Sometimes you have to do some mental parsing to determine if two queries are exactly the same (and after you do that, reflect upon just how great the optimizer is, as it tries to do the same thing in a fraction of a second).  If you do have duplicates, here is a good place to practice DRY.  As a quick side note, this applies just the same to redundant indexes, so check those out as well.

Wrapping this section up, there are relatively few ways in which we as T-SQL developers can practice the DRY principle.  In many cases, it makes more sense to de-modularize code, moving out of nested views and away from table-valued functions to standard queries against tables.  This is because we want to give the query optimizer a fighting chance at coming up with an execution plan which performs well and, unlike object-oriented languages like C# and Java (where the marginal cost of an extra object is negligible), the performance penalty for encapsulating code in T-SQL is fairly high.

Improve Code Durability And Visibility

We’re almost at the end of my Working Effectively with Legacy SQL series.  My penultimate topic will be all about durability and visibility of your code.  Looking at web and application development, we see certain patterns about logging and error handling which make sense in the T-SQL world, and I will cover those.  I also intend to cover a couple of SQL-specific tips.

Error Handling

Let’s start with the point of common ground:  error handling.  Your applications (should) have error handling logic; in the world of .NET, that is typically exceptions which you handle with TRY-CATCH blocks.  Microsoft implemented TRY-CATCH logic in SQL Server 2005, where it replaced the venerable @@ERROR command.  I highly recommend that you switch to using TRY-CATCH whenever possible in your SQL code, as it makes error handling more reliable and more readable.  If you are familiar with TRY-CATCH from C# or Java, the T-SQL version is very similar, except that you don’t get multiple CATCH blocks for handling different classes of exceptions.

So here’s an example of a simple query with TRY-CATCH.  We’re going to use this simple query throughout the post as we build more and more logic into that procedure.

CREATE PROCEDURE dbo.GetFraction
	@Divisor INT = 5
AS

SELECT
	1.0 / @Divisor AS Quotient;
GO

This is very simple code. If you run it like so, you’ll get the expected result:

EXEC dbo.GetFraction
	@Divisor = 5;

But what happens if somebody tries to divide by 0?

EXEC dbo.GetFraction
	@Divisor = 0;

DivideByZero

We get an error, just like we would expect.  The problem here is that nobody whose job it is to maintain the system has knowledge of this error.  We don’t handle it anywhere and are relying upon the calling application or end user to have enough information to troubleshoot this issue.  That might be okay in a simplistic example like this, but in a real development scenario, the error might be due to interdependencies between several parameters and a specific data set; re-generating that scenario could waste a lot of the troubleshooter’s time.

With that in mind, let’s introduce TRY-CATCH logic to the procedure.

IF (OBJECT_ID('dbo.GetFraction') IS NULL)
BEGIN
	EXEC ('CREATE PROCEDURE dbo.GetFraction AS SELECT 1 AS Stub;');
END
GO

ALTER PROCEDURE dbo.GetFraction
	@Divisor INT = 5
AS

BEGIN TRY
	SELECT
		1.0 / @Divisor AS Quotient;
END TRY
BEGIN CATCH
	DECLARE
		@ErrorMessage NVARCHAR(4000);

	SELECT
		@ErrorMessage = ERROR_MESSAGE();

	THROW 50000, @ErrorMessage, 1;
END CATCH
GO

With that in place, we call the procedure with a divisor of 0 and…well, the same thing happens.  To this point, we haven’t really done anything special.  What we have done, however, is build the foundation for the next step. As a quick note before I move on, I want to point out that I re-throw the error in the catch block. This is important because we still want the application to fail like before; our goal here is to gain insight into failures more than trying to handle all errors. You can, of course, handle your errors and take care of them if you know the appropriate response, but for this scenario, we just want to log them.

Logging Errors

Until you catch your errors, you can’t log them.  Now that we’re catching errors in this procedure, we can log relevant information related to each time calling the procedure throws an exception.

So why don’t we just log whatever gets sent back to the application using a framework like ELMAH?  I hope you are doing this!  But remember that your data environment is more than just procedures the application calls.  You probably have SQL Agent jobs, SSIS packages, SSRS reports, other applications, etc. all hitting that database.  Yes, there are ways of logging calls for each of those, but now you need to look at a number of different sources to determine if you have any unbeknownst-to-you errors popping up.  Also, these tools have wildly different levels of granularity and value to the information they log.  SQL Agent jobs have a fixed result size, meaning that they truncate error data.  This makes it much harder to troubleshoot a long-running job.  Also, relatively few options log the full call with all parameters and values, meaning that you may know what the error is, but not necessarily the set of steps required to replicate this error.  With complex stored procedures, simply knowing that there was an error isn’t always enough to replicate it; you need parameters.  One final problem is that even if these loggers do collect information, they don’t really know the SQL “call stack.”  The application code doesn’t know that procedure1 called procedure2 which called procedure3 which called … procedureN.  We know that procedure1 was the initial call and the error message typically tells us that procedureN failed, but if you have that level of complexity in stored procedure calls, troubleshooting becomes tedious.

With all of those problems in mind, let’s try to come up with a simplistic solution.  This is a scaled-down version of something we use at the office:

CREATE TABLE [dbo].[ErrorLog]
(
	[DatabaseName] [NVARCHAR](128) NULL,
	[SchemaName] [NVARCHAR](128) NULL,
	[ProcedureName] [NVARCHAR](128) NULL,
	[ErrorLine] [INT] NULL,
	[ErrorMessage] [NVARCHAR](4000) NULL,
	[LogDateGMT] [DATETIME] NOT NULL,
	[ParameterList] [XML] NULL
);

With this table in mind, let’s fix up the CATCH block so that we can insert a row into the error log. Now our procedure looks a bit like this:

IF (OBJECT_ID('dbo.GetFraction') IS NULL)
BEGIN
	EXEC ('CREATE PROCEDURE dbo.GetFraction AS SELECT 1 AS Stub;');
END
GO

ALTER PROCEDURE dbo.GetFraction
	@Divisor INT = 5
AS

BEGIN TRY
	SELECT
		1.0 / @Divisor AS Quotient;
END TRY
BEGIN CATCH
	DECLARE
		@DatabaseName NVARCHAR(128),
		@CallingProcedureName NVARCHAR(128),
		@CallingSchemaName NVARCHAR(128),
		@ErrorLine INT,
		@ErrorMessage NVARCHAR(4000),
		@ParameterList XML,
		@ErrorNumber INT;

	SELECT
		@DatabaseName = DB_NAME(),
		@CallingProcedureName = OBJECT_NAME(@@PROCID),
		@CallingSchemaName = OBJECT_SCHEMA_NAME(@@PROCID),
		@ErrorLine = ERROR_LINE(),
		@ErrorMessage = ERROR_MESSAGE(),
		@ErrorNumber = ERROR_NUMBER();

	SET @ParameterList =
	(
		SELECT
			@Divisor AS [@Divisor]
		FOR XML PATH ('ParameterList'), ROOT ('Root'), ELEMENTS XSINIL
	);

	INSERT INTO dbo.ErrorLog
	(
		DatabaseName,
		SchemaName,
		ProcedureName,
		ErrorLine,
		ErrorMessage,
		LogDateGMT,
		ParameterList
	)
	VALUES
	(
		@DatabaseName,
		@CallingSchemaName,
		@CallingProcedureName,
		@ErrorLine,
		@ErrorMessage,
		GETUTCDATE(),
		@ParameterList
	);

	THROW 50000, @ErrorMessage, 1;
END CATCH
GO

That is a pretty lengthy catch block and I would recommend creating a utility stored procedure to simplify some of this logic. In any event, when we call GetFraction with a valid divisor, we still get the same results back. When we call it with a divisor whose value is 0, you get the following error message:

DivideByZeroLogged

Note that this error message looks the same as our old one, except that we have a line which reads (1 row(s) affected) right above it. That shows us that we inserted a record into the ErrorLog table:

ErrorLogInsert

This tells us when we had an error, what the error was, and if you click on ParameterList, you’ll even get a listing of which parameters were passed in:

<Root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <ParameterList Divisor="0" />
</Root>

There’s a little work translating parameters back to the relevant T-SQL, but what’s nice is that you have everything you need to re-create this error message.

Before I move on, I want to note one more thing.  I collect the error number in @ErrorNumber but I don’t use it in the THROW.  The reason I don’t re-throw using that error number is because if I do, I will get the following error:

ThrowOutOfRange

If you want to re-raise an error message and you want to use THROW, you can simply have “THROW;” be your message or you can re-raise with a custom error number.  If your calling code expects to handle certain types of errors by error number, then you should just throw without re-numbering.  Otherwise, it doesn’t make that much of a difference.

Capture Performance Data

You probably have some sort of administrative tool which tracks procedure performance.  If so and if that works well enough for you, then you don’t need to do much here.  There may be circumstances in which you want to get finer-grained performance data, even if you have a top-level tool which tracks procedure times.  For example, suppose you have a procedure which generates a number of different dynamic SQL statements.  Suppose that we can generate a query for the sales team, a separate query for the marketing team, and a third query for the accounts payable team based on our data.  Each of those teams has sorting and filtering functionality, so the Sales query has the same core components but might not always look the same from run to run, especially the WHERE and ORDER BY clauses.  Our generator knows which query it will generate, so we can build performance collection metrics into our system by query, letting us figure out which of the three queries is actually a problem.

For this example, I’m going to stick with my simplistic division case.  We now want to introduce performance logging.  To do that, I’m going to track when the procedure begins and when it ends.  If the procedure fails, I don’t want to log the amount of time it took. If you do want to log performance of errors, you’d need to modify your CATCH block somewhat and it would make the procedure a little more complex.

To track this new performance data, we’ll create a new table called PerformanceLog:

CREATE TABLE [dbo].[PerformanceLog]
(
	[DatabaseName] [NVARCHAR](128) NULL,
	[SchemaName] [NVARCHAR](128) NULL,
	[ProcedureName] [NVARCHAR](128) NULL,
	[StartDateGMT] DATETIME2(7) NOT NULL,
	[EndDateGMT] DATETIME2(7) NOT NULL,
	[ParameterList] [XML] NULL,
	RunTimeInMilliseconds AS DATEDIFF(MILLISECOND, StartDateGMT, EndDateGMT)
);

Note that several of the parameters for PerformanceLog are the same as ErrorLog. Thus, when I refactor my procedure to include performance information, I will move those matching variables out to a common area:

IF (OBJECT_ID('dbo.GetFraction') IS NULL)
BEGIN
	EXEC ('CREATE PROCEDURE dbo.GetFraction AS SELECT 1 AS Stub;');
END
GO

ALTER PROCEDURE dbo.GetFraction
	@Divisor INT = 5
AS

DECLARE
	@StartDateGMT DATETIME2(7) = SYSUTCDATETIME(),
	@DatabaseName NVARCHAR(128) = DB_NAME(),
	@CallingProcedureName NVARCHAR(128) = OBJECT_NAME(@@PROCID),
	@CallingSchemaName NVARCHAR(128) = OBJECT_SCHEMA_NAME(@@PROCID),
	@ParameterList XML;

BEGIN TRY
	SET @ParameterList =
	(
		SELECT
			@Divisor AS [@Divisor]
		FOR XML PATH ('ParameterList'), ROOT ('Root'), ELEMENTS XSINIL
	);

	SELECT
		1.0 / @Divisor AS Quotient;

	INSERT INTO dbo.PerformanceLog
	(
		DatabaseName,
		SchemaName,
		ProcedureName,
		StartDateGMT,
		EndDateGMT,
		ParameterList
	)
	VALUES
	(
		@DatabaseName,
		@CallingSchemaName,
		@CallingProcedureName,
		@StartDateGMT,
		SYSUTCDATETIME(),
		@ParameterList
	);
END TRY
BEGIN CATCH
	DECLARE
		@ErrorLine INT,
		@ErrorMessage NVARCHAR(4000),
		@ErrorNumber INT;

	SELECT
		@ErrorLine = ERROR_LINE(),
		@ErrorMessage = ERROR_MESSAGE(),
		@ErrorNumber = ERROR_NUMBER();

	INSERT INTO dbo.ErrorLog
	(
		DatabaseName,
		SchemaName,
		ProcedureName,
		ErrorLine,
		ErrorMessage,
		LogDateGMT,
		ParameterList
	)
	VALUES
	(
		@DatabaseName,
		@CallingSchemaName,
		@CallingProcedureName,
		@ErrorLine,
		@ErrorMessage,
		GETUTCDATE(),
		@ParameterList
	);

	THROW 50000, @ErrorMessage, 1;
END CATCH
GO

Let’s walk through the code changes here.  First of all, I moved some of the variable declarations outside of the CATCH block.  I also added in a new variable called @StartTimeGMT, which I declared as DATETIME2(7) in order to get maximum precision.  If you want to use a DATETIME2(7), you shouldn’t use GETUTCDATE() anymore because it just returns at DATETIME precision.  Here’s a sample of the difference, where the first two records used GETUTCDATE() and the third used SYSUTCDATETIME():

GetUTCDate

Remember that DATETIME has a precision of 3 milliseconds, so if your procedure takes less than 3ms to run, you can’t store the difference.

Moving the choice of time measure aside, the other big change was my insertion into dbo.PerformanceLog.  This is another case where creating a helper stored procedure might be worth it, but to make the demo simpler, I decided to keep that insert statement inline.

Now that we have error and performance tracking, let’s move on to the last section.

Handling Transactions

The above sections on using TRY-CATCH, logging errors, and logging performance are all right out of the handbook and apply to languages like C# or Java as well.  But here’s something which is more SQL-specific:  be sure to handle your transactions.

In this scenario, we aren’t opening any explicit transactions.  Our statement is a simple SELECT statement which does  not modify data.  Nevertheless, let’s say that we really do want transactions.  Here’s the final version of the procedure:

IF (OBJECT_ID('dbo.GetFraction') IS NULL)
BEGIN
	EXEC ('CREATE PROCEDURE dbo.GetFraction AS SELECT 1 AS Stub;');
END
GO

ALTER PROCEDURE dbo.GetFraction
	@Divisor INT = 5
AS

DECLARE
	@InNestedTransaction BIT;

DECLARE
	@StartDateGMT DATETIME2(7) = SYSUTCDATETIME(),
	@DatabaseName NVARCHAR(128) = DB_NAME(),
	@CallingProcedureName NVARCHAR(128) = OBJECT_NAME(@@PROCID),
	@CallingSchemaName NVARCHAR(128) = OBJECT_SCHEMA_NAME(@@PROCID),
	@ParameterList XML;

BEGIN TRY
	SET @ParameterList =
	(
		SELECT
			@Divisor AS [@Divisor]
		FOR XML PATH ('ParameterList'), ROOT ('Root'), ELEMENTS XSINIL
	);

	IF ( @@TRANCOUNT > 0 )
	BEGIN
		SET @InNestedTransaction = 1;
	END
	ELSE
	BEGIN
		SET @InNestedTransaction = 0;
		BEGIN TRANSACTION;
	END;

	SELECT
		1.0 / @Divisor AS Quotient;

	INSERT INTO dbo.PerformanceLog
	(
		DatabaseName,
		SchemaName,
		ProcedureName,
		StartDateGMT,
		EndDateGMT,
		ParameterList
	)
	VALUES
	(
		@DatabaseName,
		@CallingSchemaName,
		@CallingProcedureName,
		@StartDateGMT,
		SYSUTCDATETIME(),
		@ParameterList
	);

	IF ( @InNestedTransaction = 0 AND @@TRANCOUNT > 0 )
	BEGIN
		COMMIT TRANSACTION;
	END;
END TRY
BEGIN CATCH
	IF ( @InNestedTransaction = 0 AND @@TRANCOUNT > 0 )
	BEGIN
		ROLLBACK TRANSACTION;
	END

	DECLARE
		@ErrorLine INT,
		@ErrorMessage NVARCHAR(4000),
		@ErrorNumber INT;

	SELECT
		@ErrorLine = ERROR_LINE(),
		@ErrorMessage = ERROR_MESSAGE(),
		@ErrorNumber = ERROR_NUMBER();

	INSERT INTO dbo.ErrorLog
	(
		DatabaseName,
		SchemaName,
		ProcedureName,
		ErrorLine,
		ErrorMessage,
		LogDateGMT,
		ParameterList
	)
	VALUES
	(
		@DatabaseName,
		@CallingSchemaName,
		@CallingProcedureName,
		@ErrorLine,
		@ErrorMessage,
		GETUTCDATE(),
		@ParameterList
	);

	THROW 50000, @ErrorMessage, 1;
END CATCH
GO

The big change was the addition of @InNestedTransaction on line 12 and its use throughout the procedure.  If you are already inside a transaction, you don’t want to go any deeper; nested transactions in SQL Server are broken.

But why would we already be in a transaction?  The answer is:  parent procedures and processes.  GetFraction might be the third procedure in a chain, and if I have a chain of procedures, I don’t want to commit my transaction until everything has succeeded.  Putting it another way, suppose Procedure1 calls Procedure2 and Procedure3.  If Procedure2 succeeds and Procedure3 fails, I want to roll back everything—I don’t want to leave data in a state in which some of the information is new (because Procedure2 ran) but some of it is old (because Procedure3 was rolled back).

Note that my PerformanceLog insert statement is inside the transaction.  You can argue either way about this one, but my general tendency is to make sure all INSERT statements are wrapped inside a transaction.  The other side of this argument is that you wouldn’t want your procedure to fail simply because logging failed.  If you are insistent upon this point, you could take that INSERT statement out of the transaction and wrap the insertion statement in a TRY-CATCH block which swallows errors.  Just be sure to leave a note explaining why you’re swallowing errors so that whoever needs to maintain this procedure doesn’t curse your name later when they see no data for performance stats.

Wrapping Up

Our seven-line procedure (including some generous spacing) has ballooned up to 105 lines as a result of all of these changes.  Yeah, you can probably shave a couple dozen lines off with utility procedures, but we added a lot of boilerplate.  On the other hand, we now have error logging, performance logging, and transaction safety.  We definitely want error logging and (when modifying data) transaction safety, so those are certainly worthy endeavors.  Performance logging is a little more iffy, particularly if you have a solution in place which returns these numbers, but it can be helpful as well.

Modernize Your Code

We’ve covered working with legacy SQL in some depth thus far.  Today’s topic is a reminder that you should modernize your code.  There are some things which we needed to do when working with SQL Server 2000 that no longer make sense when working with SQL Server 2014.  For example, learn how to use window functions (introduced in 2005 but really improved in 2012), the APPLY operator (introduced in 2005), the CONCAT function (introduced in 2012), TRY_CONVERT (introduced in 2012), FETCH/OFFSET (introduced in 2012), and so on.  Each of these solves long-running problems (FETCH/OFFSET perhaps less than I’d like…) and as you maintain your old code, bring it up to date with modern practices.

JDate:  A Case Study

Let me walk through a case study of where modernization really helps.  An attribute named JDate has been my bete noir for the past year.  JDate is a modified Julian date which is stored in our system as:

SELECT
	CONVERT(INTEGER, DATEADD(HH, -12, GETUTCDATE()));

The impetus for this was that, back when this code was created, there was no DATE type.  Therefore, using this modified Julian date would save four bytes and allow for quick conversions like “JDate – 5″ to indicate 5 days ago.  I would argue that this was not a good design to begin with—making assumptions about how exactly a DATETIME value is stored is sketchy—but we’ll say that it was a reasonable decision at the time.

Today, however, this is not a reasonable design.  There are several problems with using JDate:

  1. The value is incomprehensible.  The JDate value for January 19, 2015 is 42021.  Unlike a well-specified date type, this JDate is essentially a surrogate key, which means that people need at least one additional step to understand the actual date under question.
  2. This design leads to extra steps which have a performance cost.  Our code base is riddled with conversions from real dates to JDates and from JDates back to real dates.  This causes a minor database performance hit and a larger developer performance hit as it takes longer to read and understand the code.
  3. Since SQL Server 2008, we’ve had the DATE type.  DATE types are 3 bytes, meaning that they take even less space than an INT (4 bytes) or a classic DATETIME (8 bytes).  They also support DATEADD(), which means we can easily get DATEADD(DAY, -5, @SomeDate).  We can even save DATEADD(DAY, -5, GETUTCDATE()) as a DATE type.
  4. “JDate” is used as an attribute on several tables, but it is as semantically meaningful as naming a date column “Date” or an integer field “Int.”  We don’t really know which date this represents, and that makes understanding attributes on a table a little bit harder.

The solution here is to fix the design.  This is a pretty large change, but it boils down to simplifying individual procedures and smaller processes.  In my case, I’ve come up with a design document to populate various DATE columns (named appropriately) with the current JDate information, modify procedures to support the date columns instead of JDate, and update calling C# code to support the date columns.  This isn’t an overnight change, but the net result is that I clean up a long-standing piece of technical debt and get an opportunity to refactor a substantial portion of the code base.

Modernization On The Small

Modernization doesn’t have to be a series of gigantic changes, however; sometimes, small changes can make a big difference in readability.  Think about how we got the date without time in SQL Server prior to 2005:

SELECT
	DATEADD(DAY, 0, DATEDIFF(DAY, 0, GETUTCDATE()));

This code works, but it isn’t quite intuitive. Instead, with SQL Server 2008, we can do the following:

SELECT
	CAST(GETUTCDATE() AS DATE);

This code is easier to read and more intuitive.  It’s a simple piece of code, but makes your life as a maintenance developer easier.  Similarly, the CONCAT function simplifies building strings:

SELECT
	CONCAT
	(
		'This is a string.  ',
		NULL,
		12,
		'   .....   ',
		31.884,
		'Some other string.  '
	);

Old-style code would have looked like:

SELECT
	'This is a string.  ' +
	ISNULL(NULL, '') +
	CAST(12 AS VARCHAR(50)) +
	'   .....   ' +
	CAST(31.884 AS VARCHAR(50)) +
	'Some other string.  ';

And that’s the simpler version. Change all of these bits to variables and you’ll need to ISNULL() and CAST() even more.

To wrap up this section, I want to reiterate that you should keep your code base modern.  Understand new features as they come out and see if they can make your life easier.  This also means being willing to question your code’s assumptions and see if a particular design still makes sense.  In the case of JDate, we have a design which no longer makes sense and which is worth changing.  In other cases, it might be worth a comment explaining why you pick (or stick with) one particular design over something which may appear to be a better design at first glance.

 

Finding And Removing Obsolete SQL

We’re getting close to the home stretch in my Working Effectively with Legacy SQL series.  Today’s post is all about obsolete SQL.  Nothing kills a project more than oodles of unused code filling up the code base.  When I started working with my current team, I was put in charge of approximately 800 stored procedures.  Within two months, that number dropped to 650 simply because I removed obsolete stored procedures.  I also eliminated approximately 25 functions and 30 views.  Today’s code base has 19 functions (of which at least three are on my hit list) and 20 views, so we’re talking about a 50% decrease in number of objects with no loss in functionality.

This is one form of finding and removing obsolete SQL.  Those 205 objects I deprecated means 205 objects I do not need to maintain and my successor(s) will never need to learn.  We’ll talk about how to find obsolete objects in just a little bit.  First, let’s talk about the other type of obsolete code:  obsolete code within objects.

There are a few ways in which obsolete code manifests itself within an object.  For the purposes of demonstration, let’s focus on stored procedures here, even though we could find the same within table-valued functions, views, and even tables.  To enumerate various ways in which we can find obsolete SQL within a single stored procedure, I’m going to use the following example:

CREATE PROCEDURE dbo.TestProcedure
@SomeValue INT = 25, --Code always passes in 25
@Person VARCHAR(30) = 'Bob'
AS

IF (@SomeValue <> 25)
BEGIN
	THROW 50000, 'Value must be 25!', 1;
END

IF (@SomeValue = 20)
BEGIN
	SELECT 'OK' AS Result;
END

IF (@SomeValue = 21)
BEGIN
	SELECT 'OK' AS Result;
END

SELECT
	@Person AS Person,
	--'Something' AS Result
	@SomeValue AS Value;

--IF (@Person = 'Melinda')
--BEGIN
--	DECLARE
--		@var1 INT,
--		@var2 INT,
--		@var3 INT;

---- This code needs to run every time!!!
--	INSERT INTO dbo.SomeTable
--	(
--		val
--	)
--	VALUES
--	(
--		@SomeValue
--	);

--	SELECT
--		@var1 = v1,
--		@var2 = v2,
--		@var3 = v3
--	FROM Someplace
--	WHERE
--		Something = 29;
--END

IF (1 = 0)
BEGIN
	THROW 50000, 'Boolean logic has been invalidated!', 1;
END

SELECT
	1
FROM Something
WHERE
	1 = 0;

In this simple example, we have four separate sections of obsolete code.  The first section starts on line 11 and goes through line 19.  @SomeValue will never be 20 or 21 because we already check on line 6 if @SomeValue is something other than 25; if it is, we throw an exception and leave the procedure.  You generally see this logical impossibility scenario in older procedures, when the first developer had special cases for values 20 and 21.  Eventually, a requirement change made it so that the value must be equal to 25.  Instead of cleaning up the procedure, the developer making this change (often a different developer than the one who initially wrote the procedure) is “being careful” and simply adds a case on top.  This leads to unnecessary checks and a minor performance degradation, but more importantly, makes the code harder to read.  When refactoring a procedure, be on the lookout for logical impossibilities and remove them.

The next bit of obsolete code is on line 23, where the Result attribute is commented out.  I see this in a lot of code, where developers are afraid to remove a column because it might be useful in the future.  The big problem here is, if there’s a bug in the procedure, I don’t know if your commented-out line is the bug or is orthogonal to the actual bug.  For all I know, you were doing some testing and accidentally forgot to uncomment that line, and it takes me extra time to try to unravel the problem.  Delete commented-out lines.

Lines 26-50 are an extended example of the previous problem.  In this case, we have an entire segment of code commented out.  For bonus effect, I added a comment on line 33 saying that this code section needs to run…and yet it won’t.  So which is true, the comment or the commented-out code?  Again, if you’re dealing with a bug in this procedure, commented-out code leads to more confusion than necessary.

Lines 52-61 are special cases of the logical impossibility.  The first scenario has an impossible IF condition, and the second scenario has an impossible condition in the WHERE clause.  Most of the time, you’ll see these when requirements change.  Maybe “IF (1 = 0)” used to read “IF (@SomeVal > 23)” but that requirement changed over time.  Again, this hurts a lot more than it helps; delete it.

So here’s the final version of the query, removing all obsolete code:

CREATE PROCEDURE dbo.TestProcedure
@SomeValue INT = 25, --Code always passes in 25
@Person VARCHAR(30) = 'Bob'
AS

IF (@SomeValue <> 25)
BEGIN
	THROW 50000, 'Value must be 25!', 1;
END

SELECT
	@Person AS Person,
	@SomeValue AS Value;

This version of the procedure is a lot more maintainable than the prior version.  Yes, this is a toy example, but when you’re dealing with a 4000-line monstrosity, eliminating a few hundred lines of obsolete SQL still makes a pretty big difference.

Given that, why do people hoard code?  It all comes down to risk aversion and fear of (or lack of) source control.  Hypothetically, that commented-out code could become useful again someday, so if you need it again, it’s better to keep it around.  Well, hypothetically, that newspaper from 1978 could come in handy again someday, but the likelihood of this being true is slim; “could come in handy” isn’t a good enough reason to hoard every copy of the newspaper going back 40 years, and it isn’t a good enough reason to keep commented code around.

But here’s the better point:  even if you know you need it again someday, you still don’t need to comment out your code.  Use source control!  I understand that there are plenty of organizations which don’t use source control for their SQL code—I worked at one where the DBA team pushed for this but the web dev team (which wrote a lot of these procedures) didn’t want to use source control because reasons.  These organizations are doing it wrong.  There are zero reasons why you should not use source control for database code.  If you believe there is a reason, either you are wrong or you are doing it wrong.  You don’t need to use Git or Mercurial (although they’re nice); even if you’re using SVN or TFVC, that does the job.  Every single source control tool has the ability to look back in history, so if you ever need that code again, you can get it.  If your company doesn’t use source control, you can always get Git or Mercurial and use that locally—that’s one benefit to a distributed version control system.

So, moving on from obsolete code within a stored procedure, let’s look at finding obsolete stored procedures.  This is a little bit harder of a problem, but we have some tools available to us to help out.  Within SQL Server, we can look at the plan and procedure caches as indicators of a procedure being in use.  These aren’t perfect, but if a procedure shows up on the list, at least we know it has been used recently enough to be in the procedure or plan cache.

Aside from that, check out source control.  We use a tool called OpenGrok to search.  They provide a URL-based interface for searching within a code base, and you can look for specific stored procedure calls in your data layer.  Alternatively, you could try using a grep-like tool (e.g., egrep or Powershell’s Select-String) to search within a directory of code files for that procedure call.  The same goes for Integration Services packages and Reporting Services reports—those are XML files and you can search inside them for procedure calls.

If you’re looking at scheduled tasks, you can check SQL Agent jobs pretty easily:  msdb.dbo.sysjobs and msdb.dbo.sysjobsteps will contain information, so a quick search against those tables would return your answer.  If you have a third-party scheduler, it probably has an interface for searches.

Naturally, objects can call other objects.  In SQL Server, sys.sql_modules has a definition attribute which contains the definition of an object.  You can search through those definitions to see if your stored procedure call exists.  You may also have metadata tables which contain stored procedures to call, so querying those will help.

I’m being purposefully vague on this topic because I’m working on a tool to help people search their code bases for potentially obsolete code.  The problem is that you can never absolutely prove that a procedure is not in use, but given enough checks, you can have a higher level of confidence and can get rid of a lot of unused procedures.  In my above example, we dropped over 150 procedures and I needed to return one procedure which people occasionally run (manually) to fix data, but hadn’t been run in a few months.  Don’t be afraid to research and drop unused procedures; it makes your life easier, and if you ever make a mistake, pull that procedure back out from source control.

Simplify Your Design

Yesterday’s post was all about simplifying individual procedures.  With today’s post, I’m going to bring it up one level and talk about inter-procedural design.  For this topic, think about how you might re-design a class in an object-oriented language.  In a class, you have private, protected, and public methods and members.  The public methods and members form a contract and it’s harder to refactor that contract because you then need to check every bit of code which calls that class.  For private and protected methods and members (and the internal how-does-it-work bits of public methods), however, you have free reign to refactor code.  In the database world, our tables are akin to private members, stored procedures which are only called internally (i.e., by other stored procedures) are akin to private methods, and stored procedures which are called by external code are sort of like public methods.  With this in mind, I’m going to dive into the topic of refactoring an entire design.

To understand this topic better, I’m going to use as a case study a recent change I made to one table and its adjoining stored procedures.  I think the easiest way to describe the initial design starts with a picture, so here goes.  Note that I’ve removed some attributes and anonymized/modified it a little bit to protect the innocent:

CostLog

As a quick note, dotted lines represent stored procedures which update the underlying table.  Using my original metaphor, here’s the breakdown of elements in this original design:

  • CostLog — Table (private member)
  • Insert — Stored procedure (private method)
  • CreateWork — Stored procedure (public method)
  • GetForService — Stored procedure (public method) — this procedure gets called from within CreateWork, but there are also schedules which call this procedure directly.
  • Update — Stored procedure (public method)
  • GetRecentActivity — Stored procedure (public method)
  • Get — Stored procedure (private method)
  • GetByDate — Stored procedure (public method)
  • GetLastUpdate — Stored procedure (public method)

The reason I wanted to elaborate on this metaphor is to note that it’s relatively easy to change the “private method” style procedures, but a lot harder to change the “public” ones, as this would involve changing C# or website code somewhere.

Let me walk through the pain points in this design.  You can probably see some of them just from the diagram, but I want to enumerate them all.

  1. The table is named CostLog, but is actually used for two purposes.  In some sense, it actually is a log:  it acts as a storage table of record for cost processing information over time.  On the other hand, it is also a queue table, as new cost log records go in and get updated as the records go through different steps in the process.  This leads to naming confusion:  it is a log and it isn’t a log.  The inability to come up with a good name for an entity is a code smell that should tell us that the entity may be trying to do too much and we should look at splitting out individual concerns.
  2. All “Get” procedures should be idempotent, but GetForService is  not.  An idempotent procedure is one whose outputs will always be the same so long as no external force changes the inputs.  The GetForService procedure updates rows in the CostLog table and if you called GetForService twice, the update statement would actually preclude results from showing up in the second call.
  3. The Get procedure is never used.  There is obsolete code we can eliminate.
  4. The Insert procedure gets called from the CreateWork procedure, but as a weird edge case when an external process is making the call and needs to re-queue work.  It would be nice to simplify the design and eliminate this procedure, or at least make it an independent procedure call.
  5. We have a “garbage collection” procedure which doesn’t really garbage collect.  It cancels specific queued records after a certain number of days.  This kind of design feels a bit hacky; instead of having an external service tell us when to stop attempting to process, the service code which has intelligence on the matter should know when to stop processing.
  6. There is a column called “GUID” on the table.  This is a very unhelpful name, as it describes the data type rather than the purpose of the attribute.
  7. There is another column called JDate, which is an integer value and represents a Julian date (of sorts).  I’m going to get into this in more detail in a later part, but I want to replace this with a Date type called ReportDate.  That makes it the data to understand and replaces a four-byte integer which acts essentially as a surrogate key  for dates with a three-byte, easily-readable date type.
  8. The SendAlerts attribute goes entirely unused, so it’s gone.

So let’s take a moment to talk about how I got to these conclusions.  The misnamed table and attribute names are basic code smells.  If you can’t define the purpose of an entity or attribute based on its name, there is likely a problem with that entity or attribute.  In the case of my entity, the problem is that we’re trying to do too much:  we have a queue table and a log table jammed into one.  A simpler design would split these two tables out and provide interfaces (stored procedures) to interact with these tables.  Remember that “simpler” doesn’t always mean “fewer.”  In this case, a simpler design will end up with more database objects, but each database object will be clearer to understand.  As for the misnamed attribute, I learned after discussing with the software engineers that the GUID attribute actually gets used as a unique filename for the cost report we create.  For this reason, I decided to rename GUID to FileName.

The SendAlerts and JDate attributes are also problematic on CostLog.  As I did research on the procedures calling this table, I noticed that none of them ever mention SendAlerts, meaning that we are setting a value on a table but never using it.  Attributes which serve no purpose should generally be eliminated, and this is no exception.  As for JDate, this is a case in which prolonged exposure breeds contempt.  This attribute exists in a number of tables I maintain and every time I need to know an actual date, I need to convert that value from an integer to a real date.  This is a minor but persistent pain point, and incremental re-design helps solve exactly those types of pains.

With these thoughts in mind, here is my new design:

New CostLog

Before I start, a quick color note:  yellow Get procedures read from the CostQueue table, blue Get procedures read from the CostLog table, and green Get procedures read from both tables.

This new model doesn’t look radically different at first glance, but there are several improvements that I was able to sneak in.  First of all, I split out the tables into CostQueue and CostLog.  The queue table tells us our current step in the process; the log table, meanwhile, tells us the status of this combination of profile, provider, and report date.  Speaking of that combination, I’ve marked those attributes in bold to note that they will form a unique constraint on both tables.  That may seem a little weird for a log table, but we aren’t concerned with storing data on how many times a process was run or the individual steps along the way; we only care about the final status of a particular cost run.

I made the other changes that I mentioned above in the design problems:  JDate and GUID are now ReportDate and FileName, respectively.  This makes records and result sets easier to understand and also cut down a little bit on processing to translate JDates back into real dates.  I also eliminated the SendAlerts attribute.

Now let’s look at the procedure changes.  The first change is that all Get procedures are idempotent.  Instead of having two procedures (including a Get procedure) to create and update log records, I now have four procedures to handle the queue.  GetNewRecords reads from CostLog and CostQueue to figure out which records should go on the queue, but it does not insert any rows.  Instead, the CreateWork procedure inserts rows onto CostQueue.  Then, the scheduler code will call the GetWork procedure to retrieve work from the queue.  Finally, there is an Insert procedure in the event that some process needs to re-queue work automatically—which does happen from time to time, and which was handled as an edge case in the old CreateWork procedure.

From there, we move on to the C# service code.  That service code will get a set of work items and start working on them.  At each step in the process, the service code will call UpdateStep so the next bit of service code knows what is happening.  Each step in the service code also has the opportunity to merge results into CostLog, updating the Status and RecordCount attributes as necessary.  Once the final step in the process is complete, the service code will call Delete to delete the cost row from the queue and we’re done.  From there, website and monitor code can query the CostLog table like before, thereby limiting the scope of changes somewhat.

With these changes in place, cost processing becomes a lot simpler.  We can troubleshoot queue problems, easily re-queue data as necessary, and test the process much more easily.

Let’s wrap this section up with a higher-level discussion of how we get from point A to point B in terms of research.  For a project of any magnitude—even one which has one table and nine stored procedures—it helps to be organized.  If I had to do this over again, here’s what I would do to get organized:

  1. Get the list of affected tables, procedures, and processes.  At this point, I would just sketch notes in a text editor.
  2. Use draw.io or some other tool to generate a high-level diagram like the ones I have above.  Drawing these diagrams came late in the process and helped show me some flaws in my design.  Start with the current design.
  3. Create a design document for the existing design.  In this case, my design document started with a picture of the design process generated in step 2, followed by a breakdown of each table attribute.  From there, I moved on to each affected stored procedure.  For each procedure, I listed the input parameters, high-level actions, outputs (either descriptively or as an ordered list of attributes, depending upon the scenario), and processes which call that procedure.
  4. Start applying incremental design changes to the paper.  I knew that I wanted to take care of some of these problems, and after each problem, I would go back to the document and think some more about it.  When I came into this design, I knew I wanted to change the JDate, GUID, and SendAlerts attributes on the table, as well as the GetForService stored procedure.  As I started working through those, I found a few other problems, so I wrapped those changes into my design.  Once I was done with that, I went back over the design again and found a couple more issues.
  5. Collaborate.  Even if you are the domain expert, it helps to get opinions from others related to the project.  In this case, most of my discussion happened with software engineers whose code consumed the information and who would assist in troubleshooting incoming tickets related to this process.  They helped me realize that “CostLog” was really two separate tables, which simplified operations significantly.
  6. Iterate, itereate, iterate.  Even when you think you’re done, you probably aren’t.  Keep iterating over that design.  One of the late changes I made to the design was to eliminate the “GC” procedure altogether, instead moving that requirement to the service code.  My design would have been fine either way, but I think this improves the design, making one process responsible for dequeueing records and determining when something is done, rather than having several overlapping processes.
  7. Simplify the code.  Note that in this design, I didn’t spent much time talking about actual code.  That doesn’t mean that you should ignore the current code as you change the structure.  Several of the procedures had parts in which I could refactor logic and make processing easier for everybody to understand.  Once you have a good idea of what the top-level design will look like, work out ways to simplify individual procedures within that design.  You can sculpt your higher-level design with these ideas in mind, but I recommend not getting too deep into individual procedures’ problems until you are at a point in which you know that this code will be relevant.  After all, why waste a few hours fixing up code that you’re going to deprecate or whose underlying table structure will be completely different?

This wraps up my section on simplifying code.  Next up, I’m going to talk about obsolete code.

Simplify!

When working with legacy code, the first and safest step (after testing and automating tests) is to format your code to make it readable.  With some procedures—especially the basic CRUD procedures—that will be enough right there and you can pat yourself on the back for a job well done.  But let’s suppose that we have some more complex procedures.  Our next step will be to simplify code.  I’m going to steal a bit from the Zen of Python:

Simple is better than complex.
Complex is better than complicated.

I love this couplet because it is both true and meaningful.  Most of the code I deal with on a daily basis is complicated:  cursors and while loops generate and execute dynamic SQL; chains of stored procedures use temp tables to hold state; unrepeatable (and thus unteastable) processes kick off other processes which kick off other processes which set up other, independent processes.  Some of this code was intentionally designed, but a lot of it evolved as business requirements changed over time.  Some developer made a particular decision based on certain business requirements and conditions at one point in time.  A later developer made a decision based on different business requirements and conditions at a later time.  A third developer created some plumbing code to connect the first and second developers’ work together, and a fourth developer changed bits and pieces of it.  Add all of this up and you have a system which nobody understands in its entirety, especially if developers 1-4 are gone and you’re stuck maintaining their work.

But we have an ace in the hole here:  tests and re-formatting.  Building out tests gives us a chance to make implementation changes without breaking our contract:  at the beginning of this process, we accept inputs { X, Y, Z }.  At the end of the process, we generate outputs { A, B, C }.  The important part about these types of tests is that we do not make any assertions about the internal details, about how we get from { X, Y, Z } to { A, B, C }.  This gives us flexibility to re-design.  Similarly, re-formatting procedures helps give us a better understanding of what exactly each procedure in the chain is doing, why it exists, and how it works.  During this time, you probably will have built some unit and integration tests around the individual procedures in the chain—the implementation details—but that’s okay.  In the end, we might get rid of some of these tests, but they help us understand what the code is doing in the meantime.

If you have tests in place and well-formatted code, you can start working your way from complicated to complex to simple.  The best way to do this is design.  I spent some time thinking about how to do design in an agile world, so that series might be of interest at this point.  Regardless, your prior work has already helped you in the quest for simplification, as you should have pretty solid requirements and you can spend time thinking about how best to refactor your code.

Because simplification is a complex topic, I’m going to break this discussion out into two parts.  The first part of this discussion will be simplifying code within a procedure, and the second part will be simplifying an overall design which covers a number of procedures.

Looking at a single procedure, there are a few things we can do to simplify code.  Here are the things I will cover in this section:

  • Using the APPLY operator to simplify calculations
  • Using static SQL unless dynamic SQL is necessary
  • Turning cursors and WHILE loops into set-based statements
  • Simplifying filters
  • Simplifying overall procedure design

Simplify Calculations Using The APPLY Operator

One of my favorite uses of the APPLY operator is simplifying calculations.  Let’s walk through a simplistic but complicated business example and try to turn it into something simple.

Let’s say that we have a fact table with a few measures and we need to calculate a few other measures from our base results.  Our Fact table has the following measures:  ReportDate, Clicks, OrderCount, OrderRevenue, ProductCost, and Cost.  Over a specific ReportDate range, we want to calculate ConversionRate (OrderCount / Clicks), NetMargin (OrderRevenue – [ProductCost + Cost]), and ReturnOnAdSpend (OrderRevenue / Cost).

This sounds pretty simple, but I’m going to throw a wrench into the plans:  first of all, Cost and Clicks can sometimes be 0, meaning that our ConversionRate and ReturnOnAdSpend calculations need safeguards against dividing by 0.  Second, ConversionRate can never be above 100%, so we need a hard limit there.  Third, let’s say that our fact table can have NULL values—it’s a terrible thing, but not as uncommon as I’d like it to be.  Fourth, Cost and Clicks don’t really exist—we have to use ClickSource1 and ClickSource2 as well as CostSource1 and CostSource2 based on whether a bit flag UseSource1 is true or false.  Again, this is bad fact design, but it’s something we often need to deal with.

So let’s gin up some sample data and get cracking:

CREATE TABLE #Fact
(
	ReportDate INT NOT NULL,
	UseSource1 BIT NOT NULL,
	ClickSource1 INT NULL,
	ClickSource2 INT NULL,
	CostSource1 DECIMAL(23, 4) NULL,
	CostSource2 DECIMAL(23, 4) NULL,
	OrderCount INT NULL,
	OrderRevenue DECIMAL(23, 4) NULL,
	ProductCost DECIMAL(23, 4) NULL
);

INSERT INTO dbo.#Fact
(
	ReportDate,
	UseSource1,
	ClickSource1,
	ClickSource2,
	CostSource1,
	CostSource2,
	OrderCount,
	OrderRevenue,
	ProductCost
)
VALUES
(20150101, 1, 25, NULL, 285.86, NULL, 18, 1349.56, 187.39),
(20150102, 0, 25, 6, 285.86, 8.36, 3, 98.72, 75.14),
(20150103, 1, 16, NULL, 28.38, NULL, 1, 9.99, 5.42),
(20150104, 1, NULL, NULL, NULL, NULL, NULL, NULL, NULL),
(20150105, 0, NULL, NULL, NULL, NULL, NULL, NULL, NULL),
(20150106, 1, 108, NULL, 39.80, NULL, 12, 2475.02, 918.60),
(20150107, 0, NULL, 85, NULL, 85.00, 67, 428.77, 206.13);

Note that I have bad data of different sorts in this example, including irrelevant data (like in 20150102 for ClickSource1 and CostSource1) and missing data (20150104 and 20150105).

So now let’s get our query written.  Note that I want to roll up data across all report dates in the range 2015-01-01 through 2015-01-07, so I need aggregations here.  Here is the first go of this result set, which is well-formatted but complicated.

SELECT
	SUM(CASE WHEN f.UseSource1 = 1 THEN ISNULL(f.ClickSource1, 0) ELSE ISNULL(f.ClickSource2, 0) END) AS Clicks,
	SUM(ISNULL(f.OrderCount, 0)) AS OrderCount,
	SUM(ISNULL(f.OrderRevenue, 0)) AS OrderRevenue,
	SUM(ISNULL(f.ProductCost, 0)) AS ProductCost,
	SUM(CASE WHEN f.UseSource1 = 1 THEN ISNULL(f.CostSource1, 0) ELSE ISNULL(f.CostSource2, 0) END) AS Cost,
	CAST
	(
		CASE
			WHEN SUM(CASE WHEN f.UseSource1 = 1 THEN ISNULL(f.ClickSource1, 0) ELSE ISNULL(f.ClickSource2, 0) END) = 0 THEN 0.0
			WHEN SUM(ISNULL(f.OrderCount, 0)) = 0 THEN 0.0
			WHEN SUM(ISNULL(f.OrderCount, 0)) &gt; SUM(CASE WHEN f.UseSource1 = 1 THEN ISNULL(f.ClickSource1, 0) ELSE ISNULL(f.ClickSource2, 0) END) THEN 1.0
			ELSE 1.0 * SUM(ISNULL(f.OrderCount, 0)) / SUM(CASE WHEN f.UseSource1 = 1 THEN ISNULL(f.ClickSource1, 0) ELSE ISNULL(f.ClickSource2, 0) END)
		END AS DECIMAL(19, 4)
	) AS ConversionRate,
	SUM(ISNULL(f.OrderRevenue, 0) - (ISNULL(f.ProductCost, 0) + CASE WHEN f.UseSource1 = 1 THEN ISNULL(f.CostSource1, 0) ELSE ISNULL(f.CostSource2, 0) END)) AS NetMargin,
	CASE
		WHEN SUM(CASE WHEN f.UseSource1 = 1 THEN ISNULL(f.CostSource1, 0) ELSE ISNULL(f.CostSource2, 0) END) = 0 THEN
			CASE
				WHEN SUM(ISNULL(f.OrderRevenue, 0)) = 0 THEN 0.0
				ELSE 1.0
			END
		ELSE SUM(ISNULL(f.OrderRevenue, 0)) / SUM(CASE WHEN f.UseSource1 = 1 THEN ISNULL(f.CostSource1, 0) ELSE ISNULL(f.CostSource2, 0) END)
	END AS ReturnOnAdSpend
FROM dbo.#Fact f
WHERE
	f.ReportDate BETWEEN 20150101 AND 20150107;

This is not terrible, but I would consider it to be complicated. Let’s look at how the APPLY operator can simplify things. First of all, I want to migrate all of those ISNULL operations into a separate section and use aliases for ISNULL. It’s a small change but I think it really helps readability. Note that for this particular example, we don’t really benefit from having ISNULL around (because our aggregations already eliminate NULL values), but shifting ISNULL statements is something which I really want to show, given how often it appears in the real world.  Next up, I want to move the Clicks and Cost calculations out into their own section. That way, we can use aliased versions of Clicks and Cost to make our code a bit easier.

Here’s what we end up with as the simplified version:

SELECT
	SUM(us1.Clicks) AS Clicks,
	SUM(nonull.OrderCount) AS OrderCount,
	SUM(nonull.OrderRevenue) AS OrderRevenue,
	SUM(nonull.ProductCost) AS ProductCost,
	SUM(us1.Cost) AS Cost,
	CAST
	(
		CASE
			WHEN SUM(us1.Clicks) = 0 THEN 0.0
			WHEN SUM(nonull.OrderCount) = 0 THEN 0.0
			WHEN SUM(nonull.OrderCount) &gt; SUM(us1.Clicks) THEN 1.0
			ELSE 1.0 * SUM(nonull.OrderCount) / SUM(us1.Clicks)
		END AS DECIMAL(19, 4)
	) AS ConversionRate,
	SUM(nonull.OrderRevenue - (nonull.ProductCost + us1.Cost)) AS NetMargin,
	CASE
		WHEN SUM(us1.Cost) = 0 THEN
			CASE
				WHEN SUM(nonull.OrderRevenue) = 0 THEN 0.0
				ELSE 1.0
			END
		ELSE SUM(nonull.OrderRevenue) / SUM(us1.Cost)
	END AS ReturnOnAdSpend
FROM dbo.#Fact f
	CROSS APPLY
	(
		SELECT
			ISNULL(ClickSource1, 0) AS ClickSource1,
			ISNULL(ClickSource2, 0) AS ClickSource2,
			ISNULL(CostSource1, 0) AS CostSource1,
			ISNULL(CostSource2, 0) AS CostSource2,
			ISNULL(OrderCount, 0) AS OrderCount,
			ISNULL(OrderRevenue, 0) AS OrderRevenue,
			ISNULL(ProductCost, 0) AS ProductCost
	) nonull
	OUTER APPLY
	(
		SELECT
			CASE WHEN f.UseSource1 = 1 THEN nonull.ClickSource1 ELSE nonull.ClickSource2 END AS Clicks,
			CASE WHEN f.UseSource1 = 1 THEN nonull.CostSource1 ELSE nonull.CostSource2 END AS Cost
	) us1
WHERE
	f.ReportDate BETWEEN 20150101 AND 20150107;

This new version has a few more lines of code, but we moved a lot of the “pink and blue” (functions and CASE statement cruft) to two APPLY operators, letting us focus more on the actual attributes in our SELECT statement.  The ConversionRate and ReturnOnAdSpend calculations are still a few lines, but most of that complexity is covering edge cases and is necessary, and with the simplified versions of Clicks and Cost in place, they’re a lot easier to see and understand immediately.

Note that the execution plans for these two queries are almost exactly the same. We haven’t affected our query results negatively, but we did make the code a lot more understandable.

APPLYCalculation

This type of operation is easy, gains a lot, and has extremely low risk of failure, especially if you have good tests around the procedure.

Static SQL Over Dynamic SQL

There are times when dynamic SQL is necessary.  Most commonly, these are scenarios in which you do not know in advance what SQL statement should be generated or you want to perform some action such as dynamic WHERE clauses, column selection, or sorting.  My recommendation here would be to make sure that you actually need the statement to be in dynamic SQL.  Too many times, I’ve seen a simple procedure wrapped in dynamic SQL for no apparent reason.  There are a few problems with this.  First, dynamic SQL is harder to read due to the lack of syntax highlighting.  Second, dynamic SQL does not get syntax checked like static SQL, so it’s easier for odd bugs and typos to sneak in.  Third, formatting dynamic SQL forces you to mess up your code in one of two ways:  either you mess up the static SQL formatting to get the dynamic SQL to look okay, or you mess up the dynamic SQL formatting to get the static SQL to look okay.  Here is a simple example of the former:

DECLARE
	@SomeVar INT = 6; -- input parameter

BEGIN TRY

	DECLARE
		@sql NVARCHAR(MAX) = N'
SELECT
	SomeColumn,
	SomeOtherColumn
FROM SomeTable
WHERE
	1 = 1' + CASE WHEN @SomeVar = 8 THEN N'
	AND Something = 38;';

	EXEC (@sql);
END TRY
BEGIN CATCH
	PRINT 'Error!';
END CATCH

In this case, my dynamic SQL is flush left even though the static SQL is indented one to two tabstops. This makes the code harder to read, especially when I could re-write the procedure using static SQL as:

DECLARE
	@SomeVar INT = 6; -- input parameter

BEGIN TRY
	SELECT
		SomeColumn,
		SomeOtherColumn
	FROM SomeTable
	WHERE
		Something = CASE WHEN @SomeVar = 8 THEN 38 ELSE Something END;
END TRY
BEGIN CATCH
	PRINT 'Error!';
END CATCH

This latter code is easier to read. As I mentioned, there are good cases for using dynamic SQL, but if this isn’t a good case, simplify the procedure and remove the dynamic SQL.

Turn Cursors And WHILE Loops Into Set-Based Operations

I’m going to keep this short because I think my target audience already gets this idea.  Don’t use cursors or WHILE loops unless you need them.  They’re terrible for performance and hard to comprehend quickly.  I’d rather have a nice set-based statement with a tally table than one or two WHILE loops or cursors.  These constructs also require you to think about a lot of temporary variables and how they interact.  Get rid of those loops and you can simplify your thought processes.

Simplify Filters

This one may seem a bit silly, but don’t write it off.  Code bases are full of “whoopsies” like the following.  In this case, it’s ripped straight from the headlines my code base.  This table has a create date and we want to get the number of records for a specific profile over the course of a given hour.  This is a slightly-anonymized version of the original code:

SELECT
	SomeTable.ProfileID,
	CASE
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 0 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 00:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 1 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 01:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 2 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 02:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 3 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 03:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 4 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 04:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 5 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 05:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 6 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 06:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 7 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 07:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 8 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 08:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 9 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 09:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 10 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 10:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 11 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 11:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 12 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 12:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 13 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 13:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 14 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 14:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 15 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 15:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 16 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 16:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 17 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 17:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 18 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 18:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 19 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 19:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 20 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 20:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 21 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 21:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 22 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 22:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 23 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 23:00:00' AS DATETIME)
	END AS CreateDateLTz,
	COUNT(1) AS RecordCount
FROM dbo.SomeTable
GROUP BY
	SomeTable.ProfileID,
	CASE
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 0 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 00:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 1 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 01:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 2 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 02:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 3 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 03:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 4 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 04:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 5 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 05:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 6 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 06:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 7 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 07:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 8 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 08:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 9 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 09:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 10 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 10:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 11 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 11:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 12 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 12:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 13 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 13:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 14 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 14:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 15 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 15:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 16 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 16:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 17 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 17:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 18 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 18:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 19 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 19:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 20 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 20:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 21 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 21:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 22 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 22:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 23 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 23:00:00' AS DATETIME)
	END;

This is long and replicates code.  It is complicated when it should be simple, so let’s make it simple:

SELECT
	SomeTable.ProfileID,
	DATEADD(HOUR, DATEDIFF(HOUR, 0, SomeTable.CreateDateLTz), 0) AS CreateDateLTzRoundedToHour,
	COUNT(1) AS RecordCount
FROM SearchAdvisor.SomeTable
GROUP BY
	SomeTable.ProfileID,
	DATEADD(HOUR, DATEDIFF(HOUR, 0, SomeTable.CreateDateLTz), 0);

The combination of DATEADD and DATEDIFF let me turn 52 lines of code into 2.  It’s arguably slightly less readable for a complete neophyte, but a quick one-line comment or an appropriate name (like my CreateDateLTzRoundedToHour) makes the result understandable.  It also makes the query more maintainable, as I’m less likely to make a mistake modifying this eight-line query than a sixty-line query with 48 duplicated statements.  It also has a nice aesthetic feel, getting rid of a wall of code.

Look for these types of things in your SELECT, WHERE, HAVING, GROUP BY, and ORDER BY clauses.  Also look for duplicated WHERE clause values and logically equivalent values.

Simplify Overall Procedure Design

This last section will be a quick grab bag of some procedure design tips that I’ve come up with over the years.  I recommend taking a considered approach towards each stored procedure you work on and try to find simpler ways of getting the same outputs given a set of inputs.

First, try to eliminate unnecessary temp tables.  Sometimes, temp tables will be introduced for performance reasons to avoid lazy spooling, spillovers on sorts due to bad estimates, etc.  But if your procedure has a dozen temp tables and you have temp tables loading temp tables which load other temp tables, I might suggest that these probably aren’t the solution to your performance problems, but may instead be the cause of performance problems.  Sometimes, people create temp tables because they’re thinking procedurally—first, the data need to be shaped this way, then this way, then this way, and finally this way.  They create temp tables for each step along the way and end up with a final result.  See if you can write a single query to get from beginning to end, and make sure that it performs well.  You may still need some temp tables in some cases, but make those decisions based on performance rather than mental workflow.

Second, eliminate unnecessary code.  You have a source control system (right?), so don’t leave that commented-out T-SQL from 8 years ago in your code.  Commented-out code does absolutely nothing for SQL Server and nothing for the programmer.

Third, make sure your variables and aliases are clear.  @Median is much clearer in intent than @m.  If your table is named StockOption, “so” makes for a better alias than “t8.”  Similarly, make sure your attribute names are clear and accurate.  One example which has bitten me in the past was an attribute named “Conversions” which actually listed conversion rates.  The number of conversions and the percentage rate of making a conversion are two totally different things and mis-naming can lead to confusion.  Finally, make sure your ORDER BY statement uses actual attribute or alias names; don’t use ordinals.

Fourth, take advantage of the language.  Let’s take an example in which we need to get a list of records fitting a specific condition, update those records, and then pass that list of records off to the next step in the procedure.  Most of the time, people will create a temp table like so:

BEGIN TRANSACTION

CREATE TABLE #Fact
(
	Id INT,
	SomeVal CHAR(1)
);

INSERT INTO dbo.#Fact
(
	Id,
	SomeVal
)
VALUES
(0,''),
(1, 'A'),
(2, 'B'),
(3, 'A'),
(4, '');

CREATE TABLE #tmp
(
	Id INT
);

INSERT INTO dbo.#tmp
(
	Id
)
SELECT
	Id
FROM #Fact
WHERE
	SomeVal = '';

UPDATE f
SET
	SomeVal = 'C'
FROM #Fact f
	INNER JOIN #tmp t
		ON f.Id = t.Id;

SELECT
	Id
FROM #tmp;

ROLLBACK TRANSACTION

Instead of using a temp table to update and then select values, I can use the OUTPUT clause to simplify this process.

BEGIN TRANSACTION

CREATE TABLE #Fact
(
	Id INT,
	SomeVal CHAR(1)
);

INSERT INTO dbo.#Fact
(
	Id,
	SomeVal
)
VALUES
(0,''),
(1, 'A'),
(2, 'B'),
(3, 'A'),
(4, '');

UPDATE f
SET
	SomeVal = 'C'
OUTPUT
	DELETED.Id
FROM #Fact f
WHERE
	f.SomeVal = '';

ROLLBACK TRANSACTION

Instead of creating a temp table, populating that temp table, updating the base table, and selecting from the temp table, I have a single update statement. What’s nice is that in addition to being more concise and elegant, the code is also faster:

SimplifyTempTable

I ran both versions of code against one another, first the old version and then the new.  The old version is responsible for 65% of the estimated cost, and only looking at the parts I changed, the new version is more than twice as efficient in terms of I/O and much better in terms of CPU.  This is a case in which simpler code is also more efficient, making it a win-win for you.

Summary

In this blog post, I covered a number of techniques we can use to simplify code, looking at several use cases based on actual code I’ve deal with.  There are a number of other techniques you can use to help simplify your codebase, but hopefully this ends up giving you some ideas on how to improve your own code base.

Stay tuned for tomorrow’s post, in which I look into how we can simplify entire processes and not just a single procedure.