Modularize(?)

Let’s hit the final topic of my Working Effectively with Legacy SQL series.  Today, we discuss the topic of modularization.  In object-oriented programming, modularization is a great idea.  One of the “clean code” tenets is Don’t Repeat Yourself:  if you have two pieces of code doing the same work, try to unify that code, isolating that particular functionality in its own part of the code base.

This sounds like a fantastic idea, but it turns out not to work quite as well in SQL Server.  There are two reasons for this:  first, T-SQL is not an object-oriented language; second, trying to modularize complex queries can confuse the query optimizer and give you terrible execution plans and performance.  I wrote about this topic just about a year ago, so I’ll link to my explanations of why nested views and user-defined functions can be anti-patterns in T-SQL.

The gist of these two blog posts is that the most common forms of encapsulating data (views and user-defined functions) can come with a major performance loss,  and so you should use them either sparingly or not at all.  In the spirit of improving the lives of our maintenance developers, however, let’s think about some ways in which modularization can make sense.

The first way in which modularization can make sense is separating commonly-used, independent queries into their own procedures.  Last time around, I talked about error and performance logging.  I can move those INSERT statements into their own separate utility procedures and simply call them from my main procedures.  The reason I would want to do this is that it makes future changes easier.  Suppose that I need to insert a record into PerformanceLog today, but a year from now, I’m inserting it into a different table with a different column structure, or I’m firing off a Service Broker message, or maybe I’ve got dynamic logic to determine whether this record should even be inserted into the table.  These are changes I don’t want to make in every single stored procedure, so it makes sense to wrap them up in a utility procedure.

Another example could be the SELECT statements you use to load certain tables.  Suppose you have a batch insertion procedure which selects data from some table with certain filters and then inserts records into a separate table.  Splitting that portion out into its own procedure can make sense on a couple of levels.  First, it lets you troubleshoot the SELECT portion of the query independently of the INSERT portion.  Second, if you also need to call the SELECT statement separately for reporting purposes, you can do so.  Just remember that if you do decide to create a sub-procedure, you incur a little bit of overhead cost and make your overall design a little more complex, so save this for cases in which it really makes sense.

A third way in which you can think of implementing DRY within the SQL Server world is getting rid of redundant procedures.  If you have one procedure which returns { A, B } and you have a separate procedure which returns { A, B, C }, re-work the code to call the latter procedure and drop the former.  I can see exceptions when you have a fast { A, B } (which uses a covering, non-clustered index) versus another procedure which has { A, B, …, Y, Z }, but if the difference is minor and the relevant index covers both queries, stick with one procedure.  Also, get rid of any true duplicates.  You might not expect to see any duplicate procedures, but they can pop up:  two procedures with different names but return exactly the same result set.  Maybe they returned different results at one point in time, or maybe the person who wrote the second procedure didn’t know about the first; whatever, the reason, there is zero value in maintaining two versions of the same procedure.  Sometimes you have to do some mental parsing to determine if two queries are exactly the same (and after you do that, reflect upon just how great the optimizer is, as it tries to do the same thing in a fraction of a second).  If you do have duplicates, here is a good place to practice DRY.  As a quick side note, this applies just the same to redundant indexes, so check those out as well.

Wrapping this section up, there are relatively few ways in which we as T-SQL developers can practice the DRY principle.  In many cases, it makes more sense to de-modularize code, moving out of nested views and away from table-valued functions to standard queries against tables.  This is because we want to give the query optimizer a fighting chance at coming up with an execution plan which performs well and, unlike object-oriented languages like C# and Java (where the marginal cost of an extra object is negligible), the performance penalty for encapsulating code in T-SQL is fairly high.

Improve Code Durability And Visibility

We’re almost at the end of my Working Effectively with Legacy SQL series.  My penultimate topic will be all about durability and visibility of your code.  Looking at web and application development, we see certain patterns about logging and error handling which make sense in the T-SQL world, and I will cover those.  I also intend to cover a couple of SQL-specific tips.

Error Handling

Let’s start with the point of common ground:  error handling.  Your applications (should) have error handling logic; in the world of .NET, that is typically exceptions which you handle with TRY-CATCH blocks.  Microsoft implemented TRY-CATCH logic in SQL Server 2005, where it replaced the venerable @@ERROR command.  I highly recommend that you switch to using TRY-CATCH whenever possible in your SQL code, as it makes error handling more reliable and more readable.  If you are familiar with TRY-CATCH from C# or Java, the T-SQL version is very similar, except that you don’t get multiple CATCH blocks for handling different classes of exceptions.

So here’s an example of a simple query with TRY-CATCH.  We’re going to use this simple query throughout the post as we build more and more logic into that procedure.

CREATE PROCEDURE dbo.GetFraction
	@Divisor INT = 5
AS

SELECT
	1.0 / @Divisor AS Quotient;
GO

This is very simple code. If you run it like so, you’ll get the expected result:

EXEC dbo.GetFraction
	@Divisor = 5;

But what happens if somebody tries to divide by 0?

EXEC dbo.GetFraction
	@Divisor = 0;

DivideByZero

We get an error, just like we would expect.  The problem here is that nobody whose job it is to maintain the system has knowledge of this error.  We don’t handle it anywhere and are relying upon the calling application or end user to have enough information to troubleshoot this issue.  That might be okay in a simplistic example like this, but in a real development scenario, the error might be due to interdependencies between several parameters and a specific data set; re-generating that scenario could waste a lot of the troubleshooter’s time.

With that in mind, let’s introduce TRY-CATCH logic to the procedure.

IF (OBJECT_ID('dbo.GetFraction') IS NULL)
BEGIN
	EXEC ('CREATE PROCEDURE dbo.GetFraction AS SELECT 1 AS Stub;');
END
GO

ALTER PROCEDURE dbo.GetFraction
	@Divisor INT = 5
AS

BEGIN TRY
	SELECT
		1.0 / @Divisor AS Quotient;
END TRY
BEGIN CATCH
	DECLARE
		@ErrorMessage NVARCHAR(4000);

	SELECT
		@ErrorMessage = ERROR_MESSAGE();

	THROW 50000, @ErrorMessage, 1;
END CATCH
GO

With that in place, we call the procedure with a divisor of 0 and…well, the same thing happens.  To this point, we haven’t really done anything special.  What we have done, however, is build the foundation for the next step. As a quick note before I move on, I want to point out that I re-throw the error in the catch block. This is important because we still want the application to fail like before; our goal here is to gain insight into failures more than trying to handle all errors. You can, of course, handle your errors and take care of them if you know the appropriate response, but for this scenario, we just want to log them.

Logging Errors

Until you catch your errors, you can’t log them.  Now that we’re catching errors in this procedure, we can log relevant information related to each time calling the procedure throws an exception.

So why don’t we just log whatever gets sent back to the application using a framework like ELMAH?  I hope you are doing this!  But remember that your data environment is more than just procedures the application calls.  You probably have SQL Agent jobs, SSIS packages, SSRS reports, other applications, etc. all hitting that database.  Yes, there are ways of logging calls for each of those, but now you need to look at a number of different sources to determine if you have any unbeknownst-to-you errors popping up.  Also, these tools have wildly different levels of granularity and value to the information they log.  SQL Agent jobs have a fixed result size, meaning that they truncate error data.  This makes it much harder to troubleshoot a long-running job.  Also, relatively few options log the full call with all parameters and values, meaning that you may know what the error is, but not necessarily the set of steps required to replicate this error.  With complex stored procedures, simply knowing that there was an error isn’t always enough to replicate it; you need parameters.  One final problem is that even if these loggers do collect information, they don’t really know the SQL “call stack.”  The application code doesn’t know that procedure1 called procedure2 which called procedure3 which called … procedureN.  We know that procedure1 was the initial call and the error message typically tells us that procedureN failed, but if you have that level of complexity in stored procedure calls, troubleshooting becomes tedious.

With all of those problems in mind, let’s try to come up with a simplistic solution.  This is a scaled-down version of something we use at the office:

CREATE TABLE [dbo].[ErrorLog]
(
	[DatabaseName] [NVARCHAR](128) NULL,
	[SchemaName] [NVARCHAR](128) NULL,
	[ProcedureName] [NVARCHAR](128) NULL,
	[ErrorLine] [INT] NULL,
	[ErrorMessage] [NVARCHAR](4000) NULL,
	[LogDateGMT] [DATETIME] NOT NULL,
	[ParameterList] [XML] NULL
);

With this table in mind, let’s fix up the CATCH block so that we can insert a row into the error log. Now our procedure looks a bit like this:

IF (OBJECT_ID('dbo.GetFraction') IS NULL)
BEGIN
	EXEC ('CREATE PROCEDURE dbo.GetFraction AS SELECT 1 AS Stub;');
END
GO

ALTER PROCEDURE dbo.GetFraction
	@Divisor INT = 5
AS

BEGIN TRY
	SELECT
		1.0 / @Divisor AS Quotient;
END TRY
BEGIN CATCH
	DECLARE
		@DatabaseName NVARCHAR(128),
		@CallingProcedureName NVARCHAR(128),
		@CallingSchemaName NVARCHAR(128),
		@ErrorLine INT,
		@ErrorMessage NVARCHAR(4000),
		@ParameterList XML,
		@ErrorNumber INT;

	SELECT
		@DatabaseName = DB_NAME(),
		@CallingProcedureName = OBJECT_NAME(@@PROCID),
		@CallingSchemaName = OBJECT_SCHEMA_NAME(@@PROCID),
		@ErrorLine = ERROR_LINE(),
		@ErrorMessage = ERROR_MESSAGE(),
		@ErrorNumber = ERROR_NUMBER();

	SET @ParameterList =
	(
		SELECT
			@Divisor AS [@Divisor]
		FOR XML PATH ('ParameterList'), ROOT ('Root'), ELEMENTS XSINIL
	);

	INSERT INTO dbo.ErrorLog
	(
		DatabaseName,
		SchemaName,
		ProcedureName,
		ErrorLine,
		ErrorMessage,
		LogDateGMT,
		ParameterList
	)
	VALUES
	(
		@DatabaseName,
		@CallingSchemaName,
		@CallingProcedureName,
		@ErrorLine,
		@ErrorMessage,
		GETUTCDATE(),
		@ParameterList
	);

	THROW 50000, @ErrorMessage, 1;
END CATCH
GO

That is a pretty lengthy catch block and I would recommend creating a utility stored procedure to simplify some of this logic. In any event, when we call GetFraction with a valid divisor, we still get the same results back. When we call it with a divisor whose value is 0, you get the following error message:

DivideByZeroLogged

Note that this error message looks the same as our old one, except that we have a line which reads (1 row(s) affected) right above it. That shows us that we inserted a record into the ErrorLog table:

ErrorLogInsert

This tells us when we had an error, what the error was, and if you click on ParameterList, you’ll even get a listing of which parameters were passed in:

<Root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <ParameterList Divisor="0" />
</Root>

There’s a little work translating parameters back to the relevant T-SQL, but what’s nice is that you have everything you need to re-create this error message.

Before I move on, I want to note one more thing.  I collect the error number in @ErrorNumber but I don’t use it in the THROW.  The reason I don’t re-throw using that error number is because if I do, I will get the following error:

ThrowOutOfRange

If you want to re-raise an error message and you want to use THROW, you can simply have “THROW;” be your message or you can re-raise with a custom error number.  If your calling code expects to handle certain types of errors by error number, then you should just throw without re-numbering.  Otherwise, it doesn’t make that much of a difference.

Capture Performance Data

You probably have some sort of administrative tool which tracks procedure performance.  If so and if that works well enough for you, then you don’t need to do much here.  There may be circumstances in which you want to get finer-grained performance data, even if you have a top-level tool which tracks procedure times.  For example, suppose you have a procedure which generates a number of different dynamic SQL statements.  Suppose that we can generate a query for the sales team, a separate query for the marketing team, and a third query for the accounts payable team based on our data.  Each of those teams has sorting and filtering functionality, so the Sales query has the same core components but might not always look the same from run to run, especially the WHERE and ORDER BY clauses.  Our generator knows which query it will generate, so we can build performance collection metrics into our system by query, letting us figure out which of the three queries is actually a problem.

For this example, I’m going to stick with my simplistic division case.  We now want to introduce performance logging.  To do that, I’m going to track when the procedure begins and when it ends.  If the procedure fails, I don’t want to log the amount of time it took. If you do want to log performance of errors, you’d need to modify your CATCH block somewhat and it would make the procedure a little more complex.

To track this new performance data, we’ll create a new table called PerformanceLog:

CREATE TABLE [dbo].[PerformanceLog]
(
	[DatabaseName] [NVARCHAR](128) NULL,
	[SchemaName] [NVARCHAR](128) NULL,
	[ProcedureName] [NVARCHAR](128) NULL,
	[StartDateGMT] DATETIME2(7) NOT NULL,
	[EndDateGMT] DATETIME2(7) NOT NULL,
	[ParameterList] [XML] NULL,
	RunTimeInMilliseconds AS DATEDIFF(MILLISECOND, StartDateGMT, EndDateGMT)
);

Note that several of the parameters for PerformanceLog are the same as ErrorLog. Thus, when I refactor my procedure to include performance information, I will move those matching variables out to a common area:

IF (OBJECT_ID('dbo.GetFraction') IS NULL)
BEGIN
	EXEC ('CREATE PROCEDURE dbo.GetFraction AS SELECT 1 AS Stub;');
END
GO

ALTER PROCEDURE dbo.GetFraction
	@Divisor INT = 5
AS

DECLARE
	@StartDateGMT DATETIME2(7) = SYSUTCDATETIME(),
	@DatabaseName NVARCHAR(128) = DB_NAME(),
	@CallingProcedureName NVARCHAR(128) = OBJECT_NAME(@@PROCID),
	@CallingSchemaName NVARCHAR(128) = OBJECT_SCHEMA_NAME(@@PROCID),
	@ParameterList XML;

BEGIN TRY
	SET @ParameterList =
	(
		SELECT
			@Divisor AS [@Divisor]
		FOR XML PATH ('ParameterList'), ROOT ('Root'), ELEMENTS XSINIL
	);

	SELECT
		1.0 / @Divisor AS Quotient;

	INSERT INTO dbo.PerformanceLog
	(
		DatabaseName,
		SchemaName,
		ProcedureName,
		StartDateGMT,
		EndDateGMT,
		ParameterList
	)
	VALUES
	(
		@DatabaseName,
		@CallingSchemaName,
		@CallingProcedureName,
		@StartDateGMT,
		SYSUTCDATETIME(),
		@ParameterList
	);
END TRY
BEGIN CATCH
	DECLARE
		@ErrorLine INT,
		@ErrorMessage NVARCHAR(4000),
		@ErrorNumber INT;

	SELECT
		@ErrorLine = ERROR_LINE(),
		@ErrorMessage = ERROR_MESSAGE(),
		@ErrorNumber = ERROR_NUMBER();

	INSERT INTO dbo.ErrorLog
	(
		DatabaseName,
		SchemaName,
		ProcedureName,
		ErrorLine,
		ErrorMessage,
		LogDateGMT,
		ParameterList
	)
	VALUES
	(
		@DatabaseName,
		@CallingSchemaName,
		@CallingProcedureName,
		@ErrorLine,
		@ErrorMessage,
		GETUTCDATE(),
		@ParameterList
	);

	THROW 50000, @ErrorMessage, 1;
END CATCH
GO

Let’s walk through the code changes here.  First of all, I moved some of the variable declarations outside of the CATCH block.  I also added in a new variable called @StartTimeGMT, which I declared as DATETIME2(7) in order to get maximum precision.  If you want to use a DATETIME2(7), you shouldn’t use GETUTCDATE() anymore because it just returns at DATETIME precision.  Here’s a sample of the difference, where the first two records used GETUTCDATE() and the third used SYSUTCDATETIME():

GetUTCDate

Remember that DATETIME has a precision of 3 milliseconds, so if your procedure takes less than 3ms to run, you can’t store the difference.

Moving the choice of time measure aside, the other big change was my insertion into dbo.PerformanceLog.  This is another case where creating a helper stored procedure might be worth it, but to make the demo simpler, I decided to keep that insert statement inline.

Now that we have error and performance tracking, let’s move on to the last section.

Handling Transactions

The above sections on using TRY-CATCH, logging errors, and logging performance are all right out of the handbook and apply to languages like C# or Java as well.  But here’s something which is more SQL-specific:  be sure to handle your transactions.

In this scenario, we aren’t opening any explicit transactions.  Our statement is a simple SELECT statement which does  not modify data.  Nevertheless, let’s say that we really do want transactions.  Here’s the final version of the procedure:

IF (OBJECT_ID('dbo.GetFraction') IS NULL)
BEGIN
	EXEC ('CREATE PROCEDURE dbo.GetFraction AS SELECT 1 AS Stub;');
END
GO

ALTER PROCEDURE dbo.GetFraction
	@Divisor INT = 5
AS

DECLARE
	@InNestedTransaction BIT;

DECLARE
	@StartDateGMT DATETIME2(7) = SYSUTCDATETIME(),
	@DatabaseName NVARCHAR(128) = DB_NAME(),
	@CallingProcedureName NVARCHAR(128) = OBJECT_NAME(@@PROCID),
	@CallingSchemaName NVARCHAR(128) = OBJECT_SCHEMA_NAME(@@PROCID),
	@ParameterList XML;

BEGIN TRY
	SET @ParameterList =
	(
		SELECT
			@Divisor AS [@Divisor]
		FOR XML PATH ('ParameterList'), ROOT ('Root'), ELEMENTS XSINIL
	);

	IF ( @@TRANCOUNT > 0 )
	BEGIN
		SET @InNestedTransaction = 1;
	END
	ELSE
	BEGIN
		SET @InNestedTransaction = 0;
		BEGIN TRANSACTION;
	END;

	SELECT
		1.0 / @Divisor AS Quotient;

	INSERT INTO dbo.PerformanceLog
	(
		DatabaseName,
		SchemaName,
		ProcedureName,
		StartDateGMT,
		EndDateGMT,
		ParameterList
	)
	VALUES
	(
		@DatabaseName,
		@CallingSchemaName,
		@CallingProcedureName,
		@StartDateGMT,
		SYSUTCDATETIME(),
		@ParameterList
	);

	IF ( @InNestedTransaction = 0 AND @@TRANCOUNT > 0 )
	BEGIN
		COMMIT TRANSACTION;
	END;
END TRY
BEGIN CATCH
	IF ( @InNestedTransaction = 0 AND @@TRANCOUNT > 0 )
	BEGIN
		ROLLBACK TRANSACTION;
	END

	DECLARE
		@ErrorLine INT,
		@ErrorMessage NVARCHAR(4000),
		@ErrorNumber INT;

	SELECT
		@ErrorLine = ERROR_LINE(),
		@ErrorMessage = ERROR_MESSAGE(),
		@ErrorNumber = ERROR_NUMBER();

	INSERT INTO dbo.ErrorLog
	(
		DatabaseName,
		SchemaName,
		ProcedureName,
		ErrorLine,
		ErrorMessage,
		LogDateGMT,
		ParameterList
	)
	VALUES
	(
		@DatabaseName,
		@CallingSchemaName,
		@CallingProcedureName,
		@ErrorLine,
		@ErrorMessage,
		GETUTCDATE(),
		@ParameterList
	);

	THROW 50000, @ErrorMessage, 1;
END CATCH
GO

The big change was the addition of @InNestedTransaction on line 12 and its use throughout the procedure.  If you are already inside a transaction, you don’t want to go any deeper; nested transactions in SQL Server are broken.

But why would we already be in a transaction?  The answer is:  parent procedures and processes.  GetFraction might be the third procedure in a chain, and if I have a chain of procedures, I don’t want to commit my transaction until everything has succeeded.  Putting it another way, suppose Procedure1 calls Procedure2 and Procedure3.  If Procedure2 succeeds and Procedure3 fails, I want to roll back everything—I don’t want to leave data in a state in which some of the information is new (because Procedure2 ran) but some of it is old (because Procedure3 was rolled back).

Note that my PerformanceLog insert statement is inside the transaction.  You can argue either way about this one, but my general tendency is to make sure all INSERT statements are wrapped inside a transaction.  The other side of this argument is that you wouldn’t want your procedure to fail simply because logging failed.  If you are insistent upon this point, you could take that INSERT statement out of the transaction and wrap the insertion statement in a TRY-CATCH block which swallows errors.  Just be sure to leave a note explaining why you’re swallowing errors so that whoever needs to maintain this procedure doesn’t curse your name later when they see no data for performance stats.

Wrapping Up

Our seven-line procedure (including some generous spacing) has ballooned up to 105 lines as a result of all of these changes.  Yeah, you can probably shave a couple dozen lines off with utility procedures, but we added a lot of boilerplate.  On the other hand, we now have error logging, performance logging, and transaction safety.  We definitely want error logging and (when modifying data) transaction safety, so those are certainly worthy endeavors.  Performance logging is a little more iffy, particularly if you have a solution in place which returns these numbers, but it can be helpful as well.

Browns find OC with awesome nickname

Remember when I ranked a list of OC candidates a few days ago? I woke up this morning to see that #6 on my list was gone. Chan Gailey (who didn’t make the final list) was hired by the Jets, because the best way to make a horrible offense better is to implement a scheme for which you have zero talent. Nobody else on my list has a job yet.

Then I checked feedly and found that, a few hours after Baltimore got their OC, Cleveland found their man. John DeFilippo. #7 on my list. Granted, he’s there mostly because he’s entirely unproven, but still not my first choice. My favorite passive-aggressive optimistic post came from Waiting for Next Year.

Admittedly, settling for a candidate that the Browns passed over last season is hardly encouraging. Surely more details about DeFilippo will emerge throughout the day, as will (hopefully) what the Browns found so attractive about DeFilippo. Browns fans undoubtedly will want to become familiar with the man they expect to play a prominent role in their football lives for at least the next 11.5 months or so. Hopefully DeFilippo’s background as a quarterbacks coach will help him draw out the inner All-Pro from the likes of Brian Hoyer and Johnny Manziel.

Well, Hoyer’s a free agent and won’t be coming back, but the rest of that paragraph delights me and saddens me, much like the Browns on a regular basis. One of his former bosses gives him rave reviews. Yet what inspires me most of all was this article, which  got me weirdly excited for DeFilippo to succeed. It’s an interview with Derek Carr who also sung the praises of the new OC. He also revealed that DeFilippo has a nickname: “Flip.”

Yes, it’s a play on his name. However, it also encases thousands of possibilities, from a life as a secret ninja who fights crime to a fan of Flip Wilson to an all-time master of Flip Cup (which is surprisingly hard, at least for me). Maybe he’s going to be the coaching equivalent of Two-Face? Perhaps he’s a gymnastics enthusiast? Or a pro wrestling enthusiast?

The Bleacher Report (linked above) points out exactly why Flip got the new job:

DeFilippo was wanted by many as a quarterbacks coach before Cleveland offered him the offensive coordinator position[.]

I imagine Cleveland first offered him the job to be QB coach, he declined, and then ol’ Flip was given the keys to the kingdom. Well, maybe more of a barony. He’s also from Youngstown and has ties to Pettine, which probably helped.

Overall, I’m not sure how he’ll turn out. He probably won’t have a pathological obsession with Kirk Cousins, which is definitely a point in his favor. (Now I just got a mental image of Kyle Shanahan wearing a “used” Kirk Cousins jersey at all times. And by “used” I don’t mean during a game.) Carr also pointed out that Flip got “the most” out of Terrelle Pryor, which isn’t saying much.

All I have to say is this: Flip. Flipflipflipflipflip. FLIP.

Modernize Your Code

We’ve covered working with legacy SQL in some depth thus far.  Today’s topic is a reminder that you should modernize your code.  There are some things which we needed to do when working with SQL Server 2000 that no longer make sense when working with SQL Server 2014.  For example, learn how to use window functions (introduced in 2005 but really improved in 2012), the APPLY operator (introduced in 2005), the CONCAT function (introduced in 2012), TRY_CONVERT (introduced in 2012), FETCH/OFFSET (introduced in 2012), and so on.  Each of these solves long-running problems (FETCH/OFFSET perhaps less than I’d like…) and as you maintain your old code, bring it up to date with modern practices.

JDate:  A Case Study

Let me walk through a case study of where modernization really helps.  An attribute named JDate has been my bete noir for the past year.  JDate is a modified Julian date which is stored in our system as:

SELECT
	CONVERT(INTEGER, DATEADD(HH, -12, GETUTCDATE()));

The impetus for this was that, back when this code was created, there was no DATE type.  Therefore, using this modified Julian date would save four bytes and allow for quick conversions like “JDate – 5″ to indicate 5 days ago.  I would argue that this was not a good design to begin with—making assumptions about how exactly a DATETIME value is stored is sketchy—but we’ll say that it was a reasonable decision at the time.

Today, however, this is not a reasonable design.  There are several problems with using JDate:

  1. The value is incomprehensible.  The JDate value for January 19, 2015 is 42021.  Unlike a well-specified date type, this JDate is essentially a surrogate key, which means that people need at least one additional step to understand the actual date under question.
  2. This design leads to extra steps which have a performance cost.  Our code base is riddled with conversions from real dates to JDates and from JDates back to real dates.  This causes a minor database performance hit and a larger developer performance hit as it takes longer to read and understand the code.
  3. Since SQL Server 2008, we’ve had the DATE type.  DATE types are 3 bytes, meaning that they take even less space than an INT (4 bytes) or a classic DATETIME (8 bytes).  They also support DATEADD(), which means we can easily get DATEADD(DAY, -5, @SomeDate).  We can even save DATEADD(DAY, -5, GETUTCDATE()) as a DATE type.
  4. “JDate” is used as an attribute on several tables, but it is as semantically meaningful as naming a date column “Date” or an integer field “Int.”  We don’t really know which date this represents, and that makes understanding attributes on a table a little bit harder.

The solution here is to fix the design.  This is a pretty large change, but it boils down to simplifying individual procedures and smaller processes.  In my case, I’ve come up with a design document to populate various DATE columns (named appropriately) with the current JDate information, modify procedures to support the date columns instead of JDate, and update calling C# code to support the date columns.  This isn’t an overnight change, but the net result is that I clean up a long-standing piece of technical debt and get an opportunity to refactor a substantial portion of the code base.

Modernization On The Small

Modernization doesn’t have to be a series of gigantic changes, however; sometimes, small changes can make a big difference in readability.  Think about how we got the date without time in SQL Server prior to 2005:

SELECT
	DATEADD(DAY, 0, DATEDIFF(DAY, 0, GETUTCDATE()));

This code works, but it isn’t quite intuitive. Instead, with SQL Server 2008, we can do the following:

SELECT
	CAST(GETUTCDATE() AS DATE);

This code is easier to read and more intuitive.  It’s a simple piece of code, but makes your life as a maintenance developer easier.  Similarly, the CONCAT function simplifies building strings:

SELECT
	CONCAT
	(
		'This is a string.  ',
		NULL,
		12,
		'   .....   ',
		31.884,
		'Some other string.  '
	);

Old-style code would have looked like:

SELECT
	'This is a string.  ' +
	ISNULL(NULL, '') +
	CAST(12 AS VARCHAR(50)) +
	'   .....   ' +
	CAST(31.884 AS VARCHAR(50)) +
	'Some other string.  ';

And that’s the simpler version. Change all of these bits to variables and you’ll need to ISNULL() and CAST() even more.

To wrap up this section, I want to reiterate that you should keep your code base modern.  Understand new features as they come out and see if they can make your life easier.  This also means being willing to question your code’s assumptions and see if a particular design still makes sense.  In the case of JDate, we have a design which no longer makes sense and which is worth changing.  In other cases, it might be worth a comment explaining why you pick (or stick with) one particular design over something which may appear to be a better design at first glance.

 

Finding And Removing Obsolete SQL

We’re getting close to the home stretch in my Working Effectively with Legacy SQL series.  Today’s post is all about obsolete SQL.  Nothing kills a project more than oodles of unused code filling up the code base.  When I started working with my current team, I was put in charge of approximately 800 stored procedures.  Within two months, that number dropped to 650 simply because I removed obsolete stored procedures.  I also eliminated approximately 25 functions and 30 views.  Today’s code base has 19 functions (of which at least three are on my hit list) and 20 views, so we’re talking about a 50% decrease in number of objects with no loss in functionality.

This is one form of finding and removing obsolete SQL.  Those 205 objects I deprecated means 205 objects I do not need to maintain and my successor(s) will never need to learn.  We’ll talk about how to find obsolete objects in just a little bit.  First, let’s talk about the other type of obsolete code:  obsolete code within objects.

There are a few ways in which obsolete code manifests itself within an object.  For the purposes of demonstration, let’s focus on stored procedures here, even though we could find the same within table-valued functions, views, and even tables.  To enumerate various ways in which we can find obsolete SQL within a single stored procedure, I’m going to use the following example:

CREATE PROCEDURE dbo.TestProcedure
@SomeValue INT = 25, --Code always passes in 25
@Person VARCHAR(30) = 'Bob'
AS

IF (@SomeValue <> 25)
BEGIN
	THROW 50000, 'Value must be 25!', 1;
END

IF (@SomeValue = 20)
BEGIN
	SELECT 'OK' AS Result;
END

IF (@SomeValue = 21)
BEGIN
	SELECT 'OK' AS Result;
END

SELECT
	@Person AS Person,
	--'Something' AS Result
	@SomeValue AS Value;

--IF (@Person = 'Melinda')
--BEGIN
--	DECLARE
--		@var1 INT,
--		@var2 INT,
--		@var3 INT;

---- This code needs to run every time!!!
--	INSERT INTO dbo.SomeTable
--	(
--		val
--	)
--	VALUES
--	(
--		@SomeValue
--	);

--	SELECT
--		@var1 = v1,
--		@var2 = v2,
--		@var3 = v3
--	FROM Someplace
--	WHERE
--		Something = 29;
--END

IF (1 = 0)
BEGIN
	THROW 50000, 'Boolean logic has been invalidated!', 1;
END

SELECT
	1
FROM Something
WHERE
	1 = 0;

In this simple example, we have four separate sections of obsolete code.  The first section starts on line 11 and goes through line 19.  @SomeValue will never be 20 or 21 because we already check on line 6 if @SomeValue is something other than 25; if it is, we throw an exception and leave the procedure.  You generally see this logical impossibility scenario in older procedures, when the first developer had special cases for values 20 and 21.  Eventually, a requirement change made it so that the value must be equal to 25.  Instead of cleaning up the procedure, the developer making this change (often a different developer than the one who initially wrote the procedure) is “being careful” and simply adds a case on top.  This leads to unnecessary checks and a minor performance degradation, but more importantly, makes the code harder to read.  When refactoring a procedure, be on the lookout for logical impossibilities and remove them.

The next bit of obsolete code is on line 23, where the Result attribute is commented out.  I see this in a lot of code, where developers are afraid to remove a column because it might be useful in the future.  The big problem here is, if there’s a bug in the procedure, I don’t know if your commented-out line is the bug or is orthogonal to the actual bug.  For all I know, you were doing some testing and accidentally forgot to uncomment that line, and it takes me extra time to try to unravel the problem.  Delete commented-out lines.

Lines 26-50 are an extended example of the previous problem.  In this case, we have an entire segment of code commented out.  For bonus effect, I added a comment on line 33 saying that this code section needs to run…and yet it won’t.  So which is true, the comment or the commented-out code?  Again, if you’re dealing with a bug in this procedure, commented-out code leads to more confusion than necessary.

Lines 52-61 are special cases of the logical impossibility.  The first scenario has an impossible IF condition, and the second scenario has an impossible condition in the WHERE clause.  Most of the time, you’ll see these when requirements change.  Maybe “IF (1 = 0)” used to read “IF (@SomeVal > 23)” but that requirement changed over time.  Again, this hurts a lot more than it helps; delete it.

So here’s the final version of the query, removing all obsolete code:

CREATE PROCEDURE dbo.TestProcedure
@SomeValue INT = 25, --Code always passes in 25
@Person VARCHAR(30) = 'Bob'
AS

IF (@SomeValue <> 25)
BEGIN
	THROW 50000, 'Value must be 25!', 1;
END

SELECT
	@Person AS Person,
	@SomeValue AS Value;

This version of the procedure is a lot more maintainable than the prior version.  Yes, this is a toy example, but when you’re dealing with a 4000-line monstrosity, eliminating a few hundred lines of obsolete SQL still makes a pretty big difference.

Given that, why do people hoard code?  It all comes down to risk aversion and fear of (or lack of) source control.  Hypothetically, that commented-out code could become useful again someday, so if you need it again, it’s better to keep it around.  Well, hypothetically, that newspaper from 1978 could come in handy again someday, but the likelihood of this being true is slim; “could come in handy” isn’t a good enough reason to hoard every copy of the newspaper going back 40 years, and it isn’t a good enough reason to keep commented code around.

But here’s the better point:  even if you know you need it again someday, you still don’t need to comment out your code.  Use source control!  I understand that there are plenty of organizations which don’t use source control for their SQL code—I worked at one where the DBA team pushed for this but the web dev team (which wrote a lot of these procedures) didn’t want to use source control because reasons.  These organizations are doing it wrong.  There are zero reasons why you should not use source control for database code.  If you believe there is a reason, either you are wrong or you are doing it wrong.  You don’t need to use Git or Mercurial (although they’re nice); even if you’re using SVN or TFVC, that does the job.  Every single source control tool has the ability to look back in history, so if you ever need that code again, you can get it.  If your company doesn’t use source control, you can always get Git or Mercurial and use that locally—that’s one benefit to a distributed version control system.

So, moving on from obsolete code within a stored procedure, let’s look at finding obsolete stored procedures.  This is a little bit harder of a problem, but we have some tools available to us to help out.  Within SQL Server, we can look at the plan and procedure caches as indicators of a procedure being in use.  These aren’t perfect, but if a procedure shows up on the list, at least we know it has been used recently enough to be in the procedure or plan cache.

Aside from that, check out source control.  We use a tool called OpenGrok to search.  They provide a URL-based interface for searching within a code base, and you can look for specific stored procedure calls in your data layer.  Alternatively, you could try using a grep-like tool (e.g., egrep or Powershell’s Select-String) to search within a directory of code files for that procedure call.  The same goes for Integration Services packages and Reporting Services reports—those are XML files and you can search inside them for procedure calls.

If you’re looking at scheduled tasks, you can check SQL Agent jobs pretty easily:  msdb.dbo.sysjobs and msdb.dbo.sysjobsteps will contain information, so a quick search against those tables would return your answer.  If you have a third-party scheduler, it probably has an interface for searches.

Naturally, objects can call other objects.  In SQL Server, sys.sql_modules has a definition attribute which contains the definition of an object.  You can search through those definitions to see if your stored procedure call exists.  You may also have metadata tables which contain stored procedures to call, so querying those will help.

I’m being purposefully vague on this topic because I’m working on a tool to help people search their code bases for potentially obsolete code.  The problem is that you can never absolutely prove that a procedure is not in use, but given enough checks, you can have a higher level of confidence and can get rid of a lot of unused procedures.  In my above example, we dropped over 150 procedures and I needed to return one procedure which people occasionally run (manually) to fix data, but hadn’t been run in a few months.  Don’t be afraid to research and drop unused procedures; it makes your life easier, and if you ever make a mistake, pull that procedure back out from source control.