Finding And Removing Obsolete SQL

We’re getting close to the home stretch in my Working Effectively with Legacy SQL series.  Today’s post is all about obsolete SQL.  Nothing kills a project more than oodles of unused code filling up the code base.  When I started working with my current team, I was put in charge of approximately 800 stored procedures.  Within two months, that number dropped to 650 simply because I removed obsolete stored procedures.  I also eliminated approximately 25 functions and 30 views.  Today’s code base has 19 functions (of which at least three are on my hit list) and 20 views, so we’re talking about a 50% decrease in number of objects with no loss in functionality.

This is one form of finding and removing obsolete SQL.  Those 205 objects I deprecated means 205 objects I do not need to maintain and my successor(s) will never need to learn.  We’ll talk about how to find obsolete objects in just a little bit.  First, let’s talk about the other type of obsolete code:  obsolete code within objects.

There are a few ways in which obsolete code manifests itself within an object.  For the purposes of demonstration, let’s focus on stored procedures here, even though we could find the same within table-valued functions, views, and even tables.  To enumerate various ways in which we can find obsolete SQL within a single stored procedure, I’m going to use the following example:

CREATE PROCEDURE dbo.TestProcedure
@SomeValue INT = 25, --Code always passes in 25
@Person VARCHAR(30) = 'Bob'
AS

IF (@SomeValue <> 25)
BEGIN
	THROW 50000, 'Value must be 25!', 1;
END

IF (@SomeValue = 20)
BEGIN
	SELECT 'OK' AS Result;
END

IF (@SomeValue = 21)
BEGIN
	SELECT 'OK' AS Result;
END

SELECT
	@Person AS Person,
	--'Something' AS Result
	@SomeValue AS Value;

--IF (@Person = 'Melinda')
--BEGIN
--	DECLARE
--		@var1 INT,
--		@var2 INT,
--		@var3 INT;

---- This code needs to run every time!!!
--	INSERT INTO dbo.SomeTable
--	(
--		val
--	)
--	VALUES
--	(
--		@SomeValue
--	);

--	SELECT
--		@var1 = v1,
--		@var2 = v2,
--		@var3 = v3
--	FROM Someplace
--	WHERE
--		Something = 29;
--END

IF (1 = 0)
BEGIN
	THROW 50000, 'Boolean logic has been invalidated!', 1;
END

SELECT
	1
FROM Something
WHERE
	1 = 0;

In this simple example, we have four separate sections of obsolete code.  The first section starts on line 11 and goes through line 19.  @SomeValue will never be 20 or 21 because we already check on line 6 if @SomeValue is something other than 25; if it is, we throw an exception and leave the procedure.  You generally see this logical impossibility scenario in older procedures, when the first developer had special cases for values 20 and 21.  Eventually, a requirement change made it so that the value must be equal to 25.  Instead of cleaning up the procedure, the developer making this change (often a different developer than the one who initially wrote the procedure) is “being careful” and simply adds a case on top.  This leads to unnecessary checks and a minor performance degradation, but more importantly, makes the code harder to read.  When refactoring a procedure, be on the lookout for logical impossibilities and remove them.

The next bit of obsolete code is on line 23, where the Result attribute is commented out.  I see this in a lot of code, where developers are afraid to remove a column because it might be useful in the future.  The big problem here is, if there’s a bug in the procedure, I don’t know if your commented-out line is the bug or is orthogonal to the actual bug.  For all I know, you were doing some testing and accidentally forgot to uncomment that line, and it takes me extra time to try to unravel the problem.  Delete commented-out lines.

Lines 26-50 are an extended example of the previous problem.  In this case, we have an entire segment of code commented out.  For bonus effect, I added a comment on line 33 saying that this code section needs to run…and yet it won’t.  So which is true, the comment or the commented-out code?  Again, if you’re dealing with a bug in this procedure, commented-out code leads to more confusion than necessary.

Lines 52-61 are special cases of the logical impossibility.  The first scenario has an impossible IF condition, and the second scenario has an impossible condition in the WHERE clause.  Most of the time, you’ll see these when requirements change.  Maybe “IF (1 = 0)” used to read “IF (@SomeVal > 23)” but that requirement changed over time.  Again, this hurts a lot more than it helps; delete it.

So here’s the final version of the query, removing all obsolete code:

CREATE PROCEDURE dbo.TestProcedure
@SomeValue INT = 25, --Code always passes in 25
@Person VARCHAR(30) = 'Bob'
AS

IF (@SomeValue <> 25)
BEGIN
	THROW 50000, 'Value must be 25!', 1;
END

SELECT
	@Person AS Person,
	@SomeValue AS Value;

This version of the procedure is a lot more maintainable than the prior version.  Yes, this is a toy example, but when you’re dealing with a 4000-line monstrosity, eliminating a few hundred lines of obsolete SQL still makes a pretty big difference.

Given that, why do people hoard code?  It all comes down to risk aversion and fear of (or lack of) source control.  Hypothetically, that commented-out code could become useful again someday, so if you need it again, it’s better to keep it around.  Well, hypothetically, that newspaper from 1978 could come in handy again someday, but the likelihood of this being true is slim; “could come in handy” isn’t a good enough reason to hoard every copy of the newspaper going back 40 years, and it isn’t a good enough reason to keep commented code around.

But here’s the better point:  even if you know you need it again someday, you still don’t need to comment out your code.  Use source control!  I understand that there are plenty of organizations which don’t use source control for their SQL code—I worked at one where the DBA team pushed for this but the web dev team (which wrote a lot of these procedures) didn’t want to use source control because reasons.  These organizations are doing it wrong.  There are zero reasons why you should not use source control for database code.  If you believe there is a reason, either you are wrong or you are doing it wrong.  You don’t need to use Git or Mercurial (although they’re nice); even if you’re using SVN or TFVC, that does the job.  Every single source control tool has the ability to look back in history, so if you ever need that code again, you can get it.  If your company doesn’t use source control, you can always get Git or Mercurial and use that locally—that’s one benefit to a distributed version control system.

So, moving on from obsolete code within a stored procedure, let’s look at finding obsolete stored procedures.  This is a little bit harder of a problem, but we have some tools available to us to help out.  Within SQL Server, we can look at the plan and procedure caches as indicators of a procedure being in use.  These aren’t perfect, but if a procedure shows up on the list, at least we know it has been used recently enough to be in the procedure or plan cache.

Aside from that, check out source control.  We use a tool called OpenGrok to search.  They provide a URL-based interface for searching within a code base, and you can look for specific stored procedure calls in your data layer.  Alternatively, you could try using a grep-like tool (e.g., egrep or Powershell’s Select-String) to search within a directory of code files for that procedure call.  The same goes for Integration Services packages and Reporting Services reports—those are XML files and you can search inside them for procedure calls.

If you’re looking at scheduled tasks, you can check SQL Agent jobs pretty easily:  msdb.dbo.sysjobs and msdb.dbo.sysjobsteps will contain information, so a quick search against those tables would return your answer.  If you have a third-party scheduler, it probably has an interface for searches.

Naturally, objects can call other objects.  In SQL Server, sys.sql_modules has a definition attribute which contains the definition of an object.  You can search through those definitions to see if your stored procedure call exists.  You may also have metadata tables which contain stored procedures to call, so querying those will help.

I’m being purposefully vague on this topic because I’m working on a tool to help people search their code bases for potentially obsolete code.  The problem is that you can never absolutely prove that a procedure is not in use, but given enough checks, you can have a higher level of confidence and can get rid of a lot of unused procedures.  In my above example, we dropped over 150 procedures and I needed to return one procedure which people occasionally run (manually) to fix data, but hadn’t been run in a few months.  Don’t be afraid to research and drop unused procedures; it makes your life easier, and if you ever make a mistake, pull that procedure back out from source control.

Simplify Your Design

Yesterday’s post was all about simplifying individual procedures.  With today’s post, I’m going to bring it up one level and talk about inter-procedural design.  For this topic, think about how you might re-design a class in an object-oriented language.  In a class, you have private, protected, and public methods and members.  The public methods and members form a contract and it’s harder to refactor that contract because you then need to check every bit of code which calls that class.  For private and protected methods and members (and the internal how-does-it-work bits of public methods), however, you have free reign to refactor code.  In the database world, our tables are akin to private members, stored procedures which are only called internally (i.e., by other stored procedures) are akin to private methods, and stored procedures which are called by external code are sort of like public methods.  With this in mind, I’m going to dive into the topic of refactoring an entire design.

To understand this topic better, I’m going to use as a case study a recent change I made to one table and its adjoining stored procedures.  I think the easiest way to describe the initial design starts with a picture, so here goes.  Note that I’ve removed some attributes and anonymized/modified it a little bit to protect the innocent:

CostLog

As a quick note, dotted lines represent stored procedures which update the underlying table.  Using my original metaphor, here’s the breakdown of elements in this original design:

  • CostLog — Table (private member)
  • Insert — Stored procedure (private method)
  • CreateWork — Stored procedure (public method)
  • GetForService — Stored procedure (public method) — this procedure gets called from within CreateWork, but there are also schedules which call this procedure directly.
  • Update — Stored procedure (public method)
  • GetRecentActivity — Stored procedure (public method)
  • Get — Stored procedure (private method)
  • GetByDate — Stored procedure (public method)
  • GetLastUpdate — Stored procedure (public method)

The reason I wanted to elaborate on this metaphor is to note that it’s relatively easy to change the “private method” style procedures, but a lot harder to change the “public” ones, as this would involve changing C# or website code somewhere.

Let me walk through the pain points in this design.  You can probably see some of them just from the diagram, but I want to enumerate them all.

  1. The table is named CostLog, but is actually used for two purposes.  In some sense, it actually is a log:  it acts as a storage table of record for cost processing information over time.  On the other hand, it is also a queue table, as new cost log records go in and get updated as the records go through different steps in the process.  This leads to naming confusion:  it is a log and it isn’t a log.  The inability to come up with a good name for an entity is a code smell that should tell us that the entity may be trying to do too much and we should look at splitting out individual concerns.
  2. All “Get” procedures should be idempotent, but GetForService is  not.  An idempotent procedure is one whose outputs will always be the same so long as no external force changes the inputs.  The GetForService procedure updates rows in the CostLog table and if you called GetForService twice, the update statement would actually preclude results from showing up in the second call.
  3. The Get procedure is never used.  There is obsolete code we can eliminate.
  4. The Insert procedure gets called from the CreateWork procedure, but as a weird edge case when an external process is making the call and needs to re-queue work.  It would be nice to simplify the design and eliminate this procedure, or at least make it an independent procedure call.
  5. We have a “garbage collection” procedure which doesn’t really garbage collect.  It cancels specific queued records after a certain number of days.  This kind of design feels a bit hacky; instead of having an external service tell us when to stop attempting to process, the service code which has intelligence on the matter should know when to stop processing.
  6. There is a column called “GUID” on the table.  This is a very unhelpful name, as it describes the data type rather than the purpose of the attribute.
  7. There is another column called JDate, which is an integer value and represents a Julian date (of sorts).  I’m going to get into this in more detail in a later part, but I want to replace this with a Date type called ReportDate.  That makes it the data to understand and replaces a four-byte integer which acts essentially as a surrogate key  for dates with a three-byte, easily-readable date type.
  8. The SendAlerts attribute goes entirely unused, so it’s gone.

So let’s take a moment to talk about how I got to these conclusions.  The misnamed table and attribute names are basic code smells.  If you can’t define the purpose of an entity or attribute based on its name, there is likely a problem with that entity or attribute.  In the case of my entity, the problem is that we’re trying to do too much:  we have a queue table and a log table jammed into one.  A simpler design would split these two tables out and provide interfaces (stored procedures) to interact with these tables.  Remember that “simpler” doesn’t always mean “fewer.”  In this case, a simpler design will end up with more database objects, but each database object will be clearer to understand.  As for the misnamed attribute, I learned after discussing with the software engineers that the GUID attribute actually gets used as a unique filename for the cost report we create.  For this reason, I decided to rename GUID to FileName.

The SendAlerts and JDate attributes are also problematic on CostLog.  As I did research on the procedures calling this table, I noticed that none of them ever mention SendAlerts, meaning that we are setting a value on a table but never using it.  Attributes which serve no purpose should generally be eliminated, and this is no exception.  As for JDate, this is a case in which prolonged exposure breeds contempt.  This attribute exists in a number of tables I maintain and every time I need to know an actual date, I need to convert that value from an integer to a real date.  This is a minor but persistent pain point, and incremental re-design helps solve exactly those types of pains.

With these thoughts in mind, here is my new design:

New CostLog

Before I start, a quick color note:  yellow Get procedures read from the CostQueue table, blue Get procedures read from the CostLog table, and green Get procedures read from both tables.

This new model doesn’t look radically different at first glance, but there are several improvements that I was able to sneak in.  First of all, I split out the tables into CostQueue and CostLog.  The queue table tells us our current step in the process; the log table, meanwhile, tells us the status of this combination of profile, provider, and report date.  Speaking of that combination, I’ve marked those attributes in bold to note that they will form a unique constraint on both tables.  That may seem a little weird for a log table, but we aren’t concerned with storing data on how many times a process was run or the individual steps along the way; we only care about the final status of a particular cost run.

I made the other changes that I mentioned above in the design problems:  JDate and GUID are now ReportDate and FileName, respectively.  This makes records and result sets easier to understand and also cut down a little bit on processing to translate JDates back into real dates.  I also eliminated the SendAlerts attribute.

Now let’s look at the procedure changes.  The first change is that all Get procedures are idempotent.  Instead of having two procedures (including a Get procedure) to create and update log records, I now have four procedures to handle the queue.  GetNewRecords reads from CostLog and CostQueue to figure out which records should go on the queue, but it does not insert any rows.  Instead, the CreateWork procedure inserts rows onto CostQueue.  Then, the scheduler code will call the GetWork procedure to retrieve work from the queue.  Finally, there is an Insert procedure in the event that some process needs to re-queue work automatically—which does happen from time to time, and which was handled as an edge case in the old CreateWork procedure.

From there, we move on to the C# service code.  That service code will get a set of work items and start working on them.  At each step in the process, the service code will call UpdateStep so the next bit of service code knows what is happening.  Each step in the service code also has the opportunity to merge results into CostLog, updating the Status and RecordCount attributes as necessary.  Once the final step in the process is complete, the service code will call Delete to delete the cost row from the queue and we’re done.  From there, website and monitor code can query the CostLog table like before, thereby limiting the scope of changes somewhat.

With these changes in place, cost processing becomes a lot simpler.  We can troubleshoot queue problems, easily re-queue data as necessary, and test the process much more easily.

Let’s wrap this section up with a higher-level discussion of how we get from point A to point B in terms of research.  For a project of any magnitude—even one which has one table and nine stored procedures—it helps to be organized.  If I had to do this over again, here’s what I would do to get organized:

  1. Get the list of affected tables, procedures, and processes.  At this point, I would just sketch notes in a text editor.
  2. Use draw.io or some other tool to generate a high-level diagram like the ones I have above.  Drawing these diagrams came late in the process and helped show me some flaws in my design.  Start with the current design.
  3. Create a design document for the existing design.  In this case, my design document started with a picture of the design process generated in step 2, followed by a breakdown of each table attribute.  From there, I moved on to each affected stored procedure.  For each procedure, I listed the input parameters, high-level actions, outputs (either descriptively or as an ordered list of attributes, depending upon the scenario), and processes which call that procedure.
  4. Start applying incremental design changes to the paper.  I knew that I wanted to take care of some of these problems, and after each problem, I would go back to the document and think some more about it.  When I came into this design, I knew I wanted to change the JDate, GUID, and SendAlerts attributes on the table, as well as the GetForService stored procedure.  As I started working through those, I found a few other problems, so I wrapped those changes into my design.  Once I was done with that, I went back over the design again and found a couple more issues.
  5. Collaborate.  Even if you are the domain expert, it helps to get opinions from others related to the project.  In this case, most of my discussion happened with software engineers whose code consumed the information and who would assist in troubleshooting incoming tickets related to this process.  They helped me realize that “CostLog” was really two separate tables, which simplified operations significantly.
  6. Iterate, itereate, iterate.  Even when you think you’re done, you probably aren’t.  Keep iterating over that design.  One of the late changes I made to the design was to eliminate the “GC” procedure altogether, instead moving that requirement to the service code.  My design would have been fine either way, but I think this improves the design, making one process responsible for dequeueing records and determining when something is done, rather than having several overlapping processes.
  7. Simplify the code.  Note that in this design, I didn’t spent much time talking about actual code.  That doesn’t mean that you should ignore the current code as you change the structure.  Several of the procedures had parts in which I could refactor logic and make processing easier for everybody to understand.  Once you have a good idea of what the top-level design will look like, work out ways to simplify individual procedures within that design.  You can sculpt your higher-level design with these ideas in mind, but I recommend not getting too deep into individual procedures’ problems until you are at a point in which you know that this code will be relevant.  After all, why waste a few hours fixing up code that you’re going to deprecate or whose underlying table structure will be completely different?

This wraps up my section on simplifying code.  Next up, I’m going to talk about obsolete code.

It’s a conspiracy… unless that’s what they want you to think!

A few days ago, I remarked on the strangeness of Kyle Shanahan leaving Cleveland as OC. Well, now we know where he’s going: Atlanta, to be with new head coach Dan Quinn. ESPN’s Browns guy basically says that Shanahan knew Quinn was getting the Atlanta job and wanted a job with an unsucky QB. The hire can’t be official until after the Super Bowl, but even the NFL Network says it’s set in stone.

I don’t find the “Shanahan knew he had a better job lined up and started a bunch of shit to make leaving more okay” argument to be persuasive. From Shanahan’s perspective, calling this a “lateral move” is bullshit. He’s going from a team with no proven QB to one of the best young QBs in the game. He has stability and the time to install his offense exactly the way he wants it. Oh yeah, the Falcons are also in a much easier division than Cleveland.

For the Browns, I’m not sure what this means, because I’m not sure who will replace the old OC. The reasoning was that Manziel’s skill set is similar to RGIII’s, therefore Shanahan will make him into RGIII 2.0 (RGIV?). So, of course, you hand him Brian Hoyer. In retrospect, Shanahan was an odd choice if Hoyer was your Week 1 starter, but I think that the early success set the FO’s plans back a bit; I wonder if the original plan was to bring in Manziel after the bye week?

There are currently nine people being interviewed or in the mix to be interviewed for the Cleveland position. Let’s take a look at the possibilities:

Chan Gailey– Kevin can, doubtless, provide some insight here. In any case, he’s a guy who likes the spread. Manziel is not a spread QB, and more importantly, Cleveland doesn’t have the receiving talent to make a spread viable in the near future. Pass on him.

John DeFilippo — We don’t know what his scheme would be, but he did okay with Derek Carr. He’s got ties to Mike Pettine too. He’s never been a coordinator at any level before; I’d like him for a QB coach (a position which is also vacant) but a rookie OC with a rookie QB strikes me as a bad combination. Put him on the list, hire him as QB coach if he’d take it, but look elsewhere if possible.

Matt Cavanaugh — Also has ties to Pettine. According to clevelandbrowns.com, Cavanaugh favors a run-heavy west coast style offense, which I think would be perfect for Manziel. It’s been a while since he’s called plays, but that type of offense doesn’t require loads of creativity. Merits very serious consideration, perhaps the best possible choice.

Bill Callahan – Another conservative guy, but one who prefers zone blocking schemes (of the type that Shanahan and his father like). Not as good a fit as Cavanaugh simply because his greatest strength is developing the offensive line, and Cleveland’s line is already pretty awesome. Cavanaugh is considered better with QBs, Cleveland’s major area of concern. On the list.

Scott Linehan — A very intriguing possibility, with experience developing QBs and wide receivers. His scheme is considered “QB friendly.” His offenses have been all over place, however. He oversaw Daunte Culpepper and Marc Bulger in his early career; sometimes they were awesome, sometimes they weren’t. He deserves a lot of credit for developing Matt Stafford, though. A lot depends on the QB he ends up with, but his track record is all over the place. On the list.

Charlie Weis — He was the OC for the New England Patriots, loves trick plays and schemes. Undoubtedly deserves credit for developing Tom Brady, but that was a long time ago and he’s not worked with young or rookie QBs since. I’m also concerned about Manziel’s ability to learn a complex scheme. Still, he’s got to be hungry to prove that Brady isn’t a fluke. Cleveland would be the ultimate challenge. On the list.

Al Saunders — The nightmare scenario as OC. He loves a wildly complicated, very detailed offensive scheme, which is the worst possible scheme for Manziel, a guy who isn’t big into prep work to begin with. If I had a rookie QB (and the Browns essentially do) this is the last guy I’d choose. Off the list.

Marc Trestman — He knew Bernie Kosar! In all seriousness, I like him more than some others. He’s a shotgun guy — which, again, the Browns don’t have the receivers for — and is used to dealing with a mobile, highly temperamental QB. The difficulty is that his record is insanely spotty. He’s been good, but not great, and hasn’t stuck in the NFL for very long. I think there’s some potential here, but Trestman’s already an older guy and I’ve got a feeling that he won’t want the Browns job. On the list, but towards the bottom.

Anthony Lynn — He’s a running back guy, which sets him apart from all of the other guys on this list. He’s also the second youngest, but has no play calling experience. If you’re going to build a power run team, something the Browns seem intent on, he’s definitely a good candidate. He’d need a good QB coach to work with the QB, which would make a DeFilippo pairing very enticing indeed. On the list.

Here’s how I’d rank “the list.”

1. Cavanaugh
2. Callahan
3. Lynn (if paired with a quality QB coach)
4. Weis
5. Linehan
6. Trestman
7. DeFilippo

Of course, this list is predicated on one central truth: that Manziel is a significant part of the Cleveland QB future. I think he is, at least for one more season. If for some reason he isn’t, any of these guys could be in play. I will be most interested to see who gets hired.

Simplify!

When working with legacy code, the first and safest step (after testing and automating tests) is to format your code to make it readable.  With some procedures—especially the basic CRUD procedures—that will be enough right there and you can pat yourself on the back for a job well done.  But let’s suppose that we have some more complex procedures.  Our next step will be to simplify code.  I’m going to steal a bit from the Zen of Python:

Simple is better than complex.
Complex is better than complicated.

I love this couplet because it is both true and meaningful.  Most of the code I deal with on a daily basis is complicated:  cursors and while loops generate and execute dynamic SQL; chains of stored procedures use temp tables to hold state; unrepeatable (and thus unteastable) processes kick off other processes which kick off other processes which set up other, independent processes.  Some of this code was intentionally designed, but a lot of it evolved as business requirements changed over time.  Some developer made a particular decision based on certain business requirements and conditions at one point in time.  A later developer made a decision based on different business requirements and conditions at a later time.  A third developer created some plumbing code to connect the first and second developers’ work together, and a fourth developer changed bits and pieces of it.  Add all of this up and you have a system which nobody understands in its entirety, especially if developers 1-4 are gone and you’re stuck maintaining their work.

But we have an ace in the hole here:  tests and re-formatting.  Building out tests gives us a chance to make implementation changes without breaking our contract:  at the beginning of this process, we accept inputs { X, Y, Z }.  At the end of the process, we generate outputs { A, B, C }.  The important part about these types of tests is that we do not make any assertions about the internal details, about how we get from { X, Y, Z } to { A, B, C }.  This gives us flexibility to re-design.  Similarly, re-formatting procedures helps give us a better understanding of what exactly each procedure in the chain is doing, why it exists, and how it works.  During this time, you probably will have built some unit and integration tests around the individual procedures in the chain—the implementation details—but that’s okay.  In the end, we might get rid of some of these tests, but they help us understand what the code is doing in the meantime.

If you have tests in place and well-formatted code, you can start working your way from complicated to complex to simple.  The best way to do this is design.  I spent some time thinking about how to do design in an agile world, so that series might be of interest at this point.  Regardless, your prior work has already helped you in the quest for simplification, as you should have pretty solid requirements and you can spend time thinking about how best to refactor your code.

Because simplification is a complex topic, I’m going to break this discussion out into two parts.  The first part of this discussion will be simplifying code within a procedure, and the second part will be simplifying an overall design which covers a number of procedures.

Looking at a single procedure, there are a few things we can do to simplify code.  Here are the things I will cover in this section:

  • Using the APPLY operator to simplify calculations
  • Using static SQL unless dynamic SQL is necessary
  • Turning cursors and WHILE loops into set-based statements
  • Simplifying filters
  • Simplifying overall procedure design

Simplify Calculations Using The APPLY Operator

One of my favorite uses of the APPLY operator is simplifying calculations.  Let’s walk through a simplistic but complicated business example and try to turn it into something simple.

Let’s say that we have a fact table with a few measures and we need to calculate a few other measures from our base results.  Our Fact table has the following measures:  ReportDate, Clicks, OrderCount, OrderRevenue, ProductCost, and Cost.  Over a specific ReportDate range, we want to calculate ConversionRate (OrderCount / Clicks), NetMargin (OrderRevenue – [ProductCost + Cost]), and ReturnOnAdSpend (OrderRevenue / Cost).

This sounds pretty simple, but I’m going to throw a wrench into the plans:  first of all, Cost and Clicks can sometimes be 0, meaning that our ConversionRate and ReturnOnAdSpend calculations need safeguards against dividing by 0.  Second, ConversionRate can never be above 100%, so we need a hard limit there.  Third, let’s say that our fact table can have NULL values—it’s a terrible thing, but not as uncommon as I’d like it to be.  Fourth, Cost and Clicks don’t really exist—we have to use ClickSource1 and ClickSource2 as well as CostSource1 and CostSource2 based on whether a bit flag UseSource1 is true or false.  Again, this is bad fact design, but it’s something we often need to deal with.

So let’s gin up some sample data and get cracking:

CREATE TABLE #Fact
(
	ReportDate INT NOT NULL,
	UseSource1 BIT NOT NULL,
	ClickSource1 INT NULL,
	ClickSource2 INT NULL,
	CostSource1 DECIMAL(23, 4) NULL,
	CostSource2 DECIMAL(23, 4) NULL,
	OrderCount INT NULL,
	OrderRevenue DECIMAL(23, 4) NULL,
	ProductCost DECIMAL(23, 4) NULL
);

INSERT INTO dbo.#Fact
(
	ReportDate,
	UseSource1,
	ClickSource1,
	ClickSource2,
	CostSource1,
	CostSource2,
	OrderCount,
	OrderRevenue,
	ProductCost
)
VALUES
(20150101, 1, 25, NULL, 285.86, NULL, 18, 1349.56, 187.39),
(20150102, 0, 25, 6, 285.86, 8.36, 3, 98.72, 75.14),
(20150103, 1, 16, NULL, 28.38, NULL, 1, 9.99, 5.42),
(20150104, 1, NULL, NULL, NULL, NULL, NULL, NULL, NULL),
(20150105, 0, NULL, NULL, NULL, NULL, NULL, NULL, NULL),
(20150106, 1, 108, NULL, 39.80, NULL, 12, 2475.02, 918.60),
(20150107, 0, NULL, 85, NULL, 85.00, 67, 428.77, 206.13);

Note that I have bad data of different sorts in this example, including irrelevant data (like in 20150102 for ClickSource1 and CostSource1) and missing data (20150104 and 20150105).

So now let’s get our query written.  Note that I want to roll up data across all report dates in the range 2015-01-01 through 2015-01-07, so I need aggregations here.  Here is the first go of this result set, which is well-formatted but complicated.

SELECT
	SUM(CASE WHEN f.UseSource1 = 1 THEN ISNULL(f.ClickSource1, 0) ELSE ISNULL(f.ClickSource2, 0) END) AS Clicks,
	SUM(ISNULL(f.OrderCount, 0)) AS OrderCount,
	SUM(ISNULL(f.OrderRevenue, 0)) AS OrderRevenue,
	SUM(ISNULL(f.ProductCost, 0)) AS ProductCost,
	SUM(CASE WHEN f.UseSource1 = 1 THEN ISNULL(f.CostSource1, 0) ELSE ISNULL(f.CostSource2, 0) END) AS Cost,
	CAST
	(
		CASE
			WHEN SUM(CASE WHEN f.UseSource1 = 1 THEN ISNULL(f.ClickSource1, 0) ELSE ISNULL(f.ClickSource2, 0) END) = 0 THEN 0.0
			WHEN SUM(ISNULL(f.OrderCount, 0)) = 0 THEN 0.0
			WHEN SUM(ISNULL(f.OrderCount, 0)) &gt; SUM(CASE WHEN f.UseSource1 = 1 THEN ISNULL(f.ClickSource1, 0) ELSE ISNULL(f.ClickSource2, 0) END) THEN 1.0
			ELSE 1.0 * SUM(ISNULL(f.OrderCount, 0)) / SUM(CASE WHEN f.UseSource1 = 1 THEN ISNULL(f.ClickSource1, 0) ELSE ISNULL(f.ClickSource2, 0) END)
		END AS DECIMAL(19, 4)
	) AS ConversionRate,
	SUM(ISNULL(f.OrderRevenue, 0) - (ISNULL(f.ProductCost, 0) + CASE WHEN f.UseSource1 = 1 THEN ISNULL(f.CostSource1, 0) ELSE ISNULL(f.CostSource2, 0) END)) AS NetMargin,
	CASE
		WHEN SUM(CASE WHEN f.UseSource1 = 1 THEN ISNULL(f.CostSource1, 0) ELSE ISNULL(f.CostSource2, 0) END) = 0 THEN
			CASE
				WHEN SUM(ISNULL(f.OrderRevenue, 0)) = 0 THEN 0.0
				ELSE 1.0
			END
		ELSE SUM(ISNULL(f.OrderRevenue, 0)) / SUM(CASE WHEN f.UseSource1 = 1 THEN ISNULL(f.CostSource1, 0) ELSE ISNULL(f.CostSource2, 0) END)
	END AS ReturnOnAdSpend
FROM dbo.#Fact f
WHERE
	f.ReportDate BETWEEN 20150101 AND 20150107;

This is not terrible, but I would consider it to be complicated. Let’s look at how the APPLY operator can simplify things. First of all, I want to migrate all of those ISNULL operations into a separate section and use aliases for ISNULL. It’s a small change but I think it really helps readability. Note that for this particular example, we don’t really benefit from having ISNULL around (because our aggregations already eliminate NULL values), but shifting ISNULL statements is something which I really want to show, given how often it appears in the real world.  Next up, I want to move the Clicks and Cost calculations out into their own section. That way, we can use aliased versions of Clicks and Cost to make our code a bit easier.

Here’s what we end up with as the simplified version:

SELECT
	SUM(us1.Clicks) AS Clicks,
	SUM(nonull.OrderCount) AS OrderCount,
	SUM(nonull.OrderRevenue) AS OrderRevenue,
	SUM(nonull.ProductCost) AS ProductCost,
	SUM(us1.Cost) AS Cost,
	CAST
	(
		CASE
			WHEN SUM(us1.Clicks) = 0 THEN 0.0
			WHEN SUM(nonull.OrderCount) = 0 THEN 0.0
			WHEN SUM(nonull.OrderCount) &gt; SUM(us1.Clicks) THEN 1.0
			ELSE 1.0 * SUM(nonull.OrderCount) / SUM(us1.Clicks)
		END AS DECIMAL(19, 4)
	) AS ConversionRate,
	SUM(nonull.OrderRevenue - (nonull.ProductCost + us1.Cost)) AS NetMargin,
	CASE
		WHEN SUM(us1.Cost) = 0 THEN
			CASE
				WHEN SUM(nonull.OrderRevenue) = 0 THEN 0.0
				ELSE 1.0
			END
		ELSE SUM(nonull.OrderRevenue) / SUM(us1.Cost)
	END AS ReturnOnAdSpend
FROM dbo.#Fact f
	CROSS APPLY
	(
		SELECT
			ISNULL(ClickSource1, 0) AS ClickSource1,
			ISNULL(ClickSource2, 0) AS ClickSource2,
			ISNULL(CostSource1, 0) AS CostSource1,
			ISNULL(CostSource2, 0) AS CostSource2,
			ISNULL(OrderCount, 0) AS OrderCount,
			ISNULL(OrderRevenue, 0) AS OrderRevenue,
			ISNULL(ProductCost, 0) AS ProductCost
	) nonull
	OUTER APPLY
	(
		SELECT
			CASE WHEN f.UseSource1 = 1 THEN nonull.ClickSource1 ELSE nonull.ClickSource2 END AS Clicks,
			CASE WHEN f.UseSource1 = 1 THEN nonull.CostSource1 ELSE nonull.CostSource2 END AS Cost
	) us1
WHERE
	f.ReportDate BETWEEN 20150101 AND 20150107;

This new version has a few more lines of code, but we moved a lot of the “pink and blue” (functions and CASE statement cruft) to two APPLY operators, letting us focus more on the actual attributes in our SELECT statement.  The ConversionRate and ReturnOnAdSpend calculations are still a few lines, but most of that complexity is covering edge cases and is necessary, and with the simplified versions of Clicks and Cost in place, they’re a lot easier to see and understand immediately.

Note that the execution plans for these two queries are almost exactly the same. We haven’t affected our query results negatively, but we did make the code a lot more understandable.

APPLYCalculation

This type of operation is easy, gains a lot, and has extremely low risk of failure, especially if you have good tests around the procedure.

Static SQL Over Dynamic SQL

There are times when dynamic SQL is necessary.  Most commonly, these are scenarios in which you do not know in advance what SQL statement should be generated or you want to perform some action such as dynamic WHERE clauses, column selection, or sorting.  My recommendation here would be to make sure that you actually need the statement to be in dynamic SQL.  Too many times, I’ve seen a simple procedure wrapped in dynamic SQL for no apparent reason.  There are a few problems with this.  First, dynamic SQL is harder to read due to the lack of syntax highlighting.  Second, dynamic SQL does not get syntax checked like static SQL, so it’s easier for odd bugs and typos to sneak in.  Third, formatting dynamic SQL forces you to mess up your code in one of two ways:  either you mess up the static SQL formatting to get the dynamic SQL to look okay, or you mess up the dynamic SQL formatting to get the static SQL to look okay.  Here is a simple example of the former:

DECLARE
	@SomeVar INT = 6; -- input parameter

BEGIN TRY

	DECLARE
		@sql NVARCHAR(MAX) = N'
SELECT
	SomeColumn,
	SomeOtherColumn
FROM SomeTable
WHERE
	1 = 1' + CASE WHEN @SomeVar = 8 THEN N'
	AND Something = 38;';

	EXEC (@sql);
END TRY
BEGIN CATCH
	PRINT 'Error!';
END CATCH

In this case, my dynamic SQL is flush left even though the static SQL is indented one to two tabstops. This makes the code harder to read, especially when I could re-write the procedure using static SQL as:

DECLARE
	@SomeVar INT = 6; -- input parameter

BEGIN TRY
	SELECT
		SomeColumn,
		SomeOtherColumn
	FROM SomeTable
	WHERE
		Something = CASE WHEN @SomeVar = 8 THEN 38 ELSE Something END;
END TRY
BEGIN CATCH
	PRINT 'Error!';
END CATCH

This latter code is easier to read. As I mentioned, there are good cases for using dynamic SQL, but if this isn’t a good case, simplify the procedure and remove the dynamic SQL.

Turn Cursors And WHILE Loops Into Set-Based Operations

I’m going to keep this short because I think my target audience already gets this idea.  Don’t use cursors or WHILE loops unless you need them.  They’re terrible for performance and hard to comprehend quickly.  I’d rather have a nice set-based statement with a tally table than one or two WHILE loops or cursors.  These constructs also require you to think about a lot of temporary variables and how they interact.  Get rid of those loops and you can simplify your thought processes.

Simplify Filters

This one may seem a bit silly, but don’t write it off.  Code bases are full of “whoopsies” like the following.  In this case, it’s ripped straight from the headlines my code base.  This table has a create date and we want to get the number of records for a specific profile over the course of a given hour.  This is a slightly-anonymized version of the original code:

SELECT
	SomeTable.ProfileID,
	CASE
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 0 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 00:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 1 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 01:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 2 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 02:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 3 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 03:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 4 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 04:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 5 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 05:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 6 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 06:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 7 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 07:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 8 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 08:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 9 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 09:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 10 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 10:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 11 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 11:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 12 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 12:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 13 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 13:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 14 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 14:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 15 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 15:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 16 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 16:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 17 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 17:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 18 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 18:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 19 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 19:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 20 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 20:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 21 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 21:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 22 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 22:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 23 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 23:00:00' AS DATETIME)
	END AS CreateDateLTz,
	COUNT(1) AS RecordCount
FROM dbo.SomeTable
GROUP BY
	SomeTable.ProfileID,
	CASE
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 0 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 00:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 1 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 01:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 2 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 02:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 3 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 03:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 4 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 04:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 5 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 05:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 6 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 06:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 7 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 07:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 8 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 08:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 9 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 09:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 10 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 10:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 11 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 11:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 12 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 12:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 13 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 13:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 14 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 14:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 15 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 15:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 16 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 16:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 17 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 17:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 18 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 18:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 19 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 19:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 20 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 20:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 21 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 21:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 22 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 22:00:00' AS DATETIME)
		WHEN DATEPART(hh, SomeTable.CreateDateLTz) = 23 THEN CAST(CONVERT(CHAR(10), SomeTable.CreateDateLTz, 101) + ' 23:00:00' AS DATETIME)
	END;

This is long and replicates code.  It is complicated when it should be simple, so let’s make it simple:

SELECT
	SomeTable.ProfileID,
	DATEADD(HOUR, DATEDIFF(HOUR, 0, SomeTable.CreateDateLTz), 0) AS CreateDateLTzRoundedToHour,
	COUNT(1) AS RecordCount
FROM SearchAdvisor.SomeTable
GROUP BY
	SomeTable.ProfileID,
	DATEADD(HOUR, DATEDIFF(HOUR, 0, SomeTable.CreateDateLTz), 0);

The combination of DATEADD and DATEDIFF let me turn 52 lines of code into 2.  It’s arguably slightly less readable for a complete neophyte, but a quick one-line comment or an appropriate name (like my CreateDateLTzRoundedToHour) makes the result understandable.  It also makes the query more maintainable, as I’m less likely to make a mistake modifying this eight-line query than a sixty-line query with 48 duplicated statements.  It also has a nice aesthetic feel, getting rid of a wall of code.

Look for these types of things in your SELECT, WHERE, HAVING, GROUP BY, and ORDER BY clauses.  Also look for duplicated WHERE clause values and logically equivalent values.

Simplify Overall Procedure Design

This last section will be a quick grab bag of some procedure design tips that I’ve come up with over the years.  I recommend taking a considered approach towards each stored procedure you work on and try to find simpler ways of getting the same outputs given a set of inputs.

First, try to eliminate unnecessary temp tables.  Sometimes, temp tables will be introduced for performance reasons to avoid lazy spooling, spillovers on sorts due to bad estimates, etc.  But if your procedure has a dozen temp tables and you have temp tables loading temp tables which load other temp tables, I might suggest that these probably aren’t the solution to your performance problems, but may instead be the cause of performance problems.  Sometimes, people create temp tables because they’re thinking procedurally—first, the data need to be shaped this way, then this way, then this way, and finally this way.  They create temp tables for each step along the way and end up with a final result.  See if you can write a single query to get from beginning to end, and make sure that it performs well.  You may still need some temp tables in some cases, but make those decisions based on performance rather than mental workflow.

Second, eliminate unnecessary code.  You have a source control system (right?), so don’t leave that commented-out T-SQL from 8 years ago in your code.  Commented-out code does absolutely nothing for SQL Server and nothing for the programmer.

Third, make sure your variables and aliases are clear.  @Median is much clearer in intent than @m.  If your table is named StockOption, “so” makes for a better alias than “t8.”  Similarly, make sure your attribute names are clear and accurate.  One example which has bitten me in the past was an attribute named “Conversions” which actually listed conversion rates.  The number of conversions and the percentage rate of making a conversion are two totally different things and mis-naming can lead to confusion.  Finally, make sure your ORDER BY statement uses actual attribute or alias names; don’t use ordinals.

Fourth, take advantage of the language.  Let’s take an example in which we need to get a list of records fitting a specific condition, update those records, and then pass that list of records off to the next step in the procedure.  Most of the time, people will create a temp table like so:

BEGIN TRANSACTION

CREATE TABLE #Fact
(
	Id INT,
	SomeVal CHAR(1)
);

INSERT INTO dbo.#Fact
(
	Id,
	SomeVal
)
VALUES
(0,''),
(1, 'A'),
(2, 'B'),
(3, 'A'),
(4, '');

CREATE TABLE #tmp
(
	Id INT
);

INSERT INTO dbo.#tmp
(
	Id
)
SELECT
	Id
FROM #Fact
WHERE
	SomeVal = '';

UPDATE f
SET
	SomeVal = 'C'
FROM #Fact f
	INNER JOIN #tmp t
		ON f.Id = t.Id;

SELECT
	Id
FROM #tmp;

ROLLBACK TRANSACTION

Instead of using a temp table to update and then select values, I can use the OUTPUT clause to simplify this process.

BEGIN TRANSACTION

CREATE TABLE #Fact
(
	Id INT,
	SomeVal CHAR(1)
);

INSERT INTO dbo.#Fact
(
	Id,
	SomeVal
)
VALUES
(0,''),
(1, 'A'),
(2, 'B'),
(3, 'A'),
(4, '');

UPDATE f
SET
	SomeVal = 'C'
OUTPUT
	DELETED.Id
FROM #Fact f
WHERE
	f.SomeVal = '';

ROLLBACK TRANSACTION

Instead of creating a temp table, populating that temp table, updating the base table, and selecting from the temp table, I have a single update statement. What’s nice is that in addition to being more concise and elegant, the code is also faster:

SimplifyTempTable

I ran both versions of code against one another, first the old version and then the new.  The old version is responsible for 65% of the estimated cost, and only looking at the parts I changed, the new version is more than twice as efficient in terms of I/O and much better in terms of CPU.  This is a case in which simpler code is also more efficient, making it a win-win for you.

Summary

In this blog post, I covered a number of techniques we can use to simplify code, looking at several use cases based on actual code I’ve deal with.  There are a number of other techniques you can use to help simplify your codebase, but hopefully this ends up giving you some ideas on how to improve your own code base.

Stay tuned for tomorrow’s post, in which I look into how we can simplify entire processes and not just a single procedure.

Format Your SQL

The last couple of days, I’ve spent a lot of time talking about testing (and more testing).  The reason I spent so much time talking about testing when working with legacy SQL is that you probably don’t know exactly what your code base does.  Even if you wrote most (or all) of the SQL in your code base, there’s a lot that the you of three years ago knew that the you of today probably doesn’t:  business requirements, edge cases, and why you took one route over another.  Adding test cases helps you capture some of that information and more importantly, gives you protection when it comes time to take care of the next steps on the list.

The next step on my “working effectively with legacy SQL” list is formatting your SQL.   This is the lowest-risk form of refactoring, as you aren’t making any logical code changes.  The end effect, however, is that your code becomes a lot more readable, and readable code is (potentially) understandable code.  Let me give you an example of what I consider unreadable code, and break down why:

SELECT
	ISNULL(AdvertisementID, 0) AdvertisementID, JDate,krc.ProfileID,ISNULL(krc.SearchProviderID, 0) SearchProviderID,ISNULL(NULLIF(STCategoryID, '-1'), '0')
STCategoryID,
				krc.EventTypeID,
	CAST(JDate AS DATETIME) CreateDate,
	DATEPART(MONTH, CAST(JDate AS DATETIME)) CreateMonth,
DATEPART(YEAR, CAST(JDate AS DATETIME)) CreateYear, Keyword, NULLIF(PageID, 0) PageID,
	CASE
		WHEN AdvertisementID > 0 THEN 1
		ELSE 0 END Paid,
	CAST(MAX(ISNULL(s.SearchProviderID, 0)) AS BIT) AllowsCampaigns,
	SUM
	(ISNULL(Conversions
	, 0)) AS Orders,  SUM(ISNULL
	(Hits, 0)) PixelHits,
		SUM(ISNULL(Clicks, 0)) Clicks,
	SUM(
		ISNULL(Impressions,
0)) AS Impressions,
	CAST(CASE SUM(ISNULL(Impressions, 0)) WHEN 0 THEN SUM(ISNULL(WeightedRank, 0)) ELSE SUM(ISNULL(WeightedRank, 0) * ISNULL(Impressions, 0)) / SUM(ISNULL(Impressions, 0)) END AS MONEY) WeightedRank,
	SUM(CASE WHEN AdvertisementID > 0 THEN 1 ELSE 0 END) AS AdCount,
	BINARY_CHECKSUM(krc.ProfileID, ISNULL(AdvertisementID, 0), ISNULL(NULLIF(STCategoryID, '-1'), '0')) AS CheckSumID,
	RateAdjustmentCustomRuleID,
	SUM(ISNULL(EstimatedCost, 0)) AS EstimatedCost,
	SUM(ISNULL(EstimatedClicks, 0)) AS EstimatedClicks,
	SUM(CASE WHEN ISNULL(et.ParentEventTypeID, 0) = 3 
	THEN krc.Revenue ELSE 0 END) AS [OrderRevenue],
	SUM(
	CASE WHEN 
	ISNULL(et.ParentEventTypeID, 0) = 3 
	THEN 
	krc.Conversions ELSE 0 
	END) AS [OrderCount],
	SUM(CASE WHEN ISNULL(et.ParentEventTypeID, 0) = 4 
				THEN 
		krc.Revenue 
				ELSE 0 
			END)
	[EventValue],
	SUM(CASE WHEN ISNULL(et.ParentEventTypeID, 0) = 4 THEN krc.Conversions ELSE 0 END) AS [EventCount]
FROM SomeTable krc INNER JOIN SomeOtherTable c2a ON krc.ProfileID = c2a.ProfileID AND c2a.EligibleForGC = 0
	LEFT OUTER JOIN
	(SELECT s.ProfileID, s.SearchProvider, s.SearchProviderID
		FROM SomeThirdTable c
			CROSS APPLY SomeFunction(c.ProfileID) s
		WHERE
			s.ProfileID = c.ProfileID AND NOT (s.SearchProviderID = 62 OR s.ChannelID = 3 OR (s.SearchProviderID = 2 AND c.ProfileID IN (SELECT pa.ProfileID FROM SomeFourthTable pa
			WHERE
							pa.SearchProviderID = 2
							AND pa.IsActive = 1 AND pa.Version <> 'EWS' AND s.ProfileID = pa.ProfileID
					)
				)
			)
	) s
		ON s.SearchProviderID = krc.SearchProviderID
		AND s.ProfileID = krc.ProfileID
	LEFT OUTER JOIN SomeFifthTable et
		ON et.EventTypeID = krc.EventTypeID
		AND ISNULL(NULLIF(et.ProfileID, 0), krc.ProfileID) = krc.ProfileID
GROUP BY
    AdvertisementID,
    JDate,
    krc.ProfileID,
    ISNULL(krc.SearchProviderID, 0),
    STCategoryID,
    krc.EventTypeID,
    CAST(JDate AS DATETIME),
    DATEPART(MONTH, CAST(JDate AS DATETIME)),
    DATEPART(YEAR, CAST(JDate AS DATETIME)),
    Keyword,
    NULLIF(PageID, 0),
    CASE
    	WHEN AdvertisementID > 0 THEN 1
    	ELSE 0
    END,
    RateAdjustmentCustomRuleID;

This is a representative example of bad formatting and although your production stored procedures may not look (quite) this bad, I’m willing to be that this isn’t too far off for a lot of systems.  So let’s enumerate the problems, from top to bottom:

  1. Multiple attributes on one row.  Row number 2 starts off with AdvertisementID and then moves to four other attributes before continuing on row number 3.  Having multiple attributes on a single line makes it easier to conflate attributes, especially when combined with:
  2. Aliases not delimited.  There are two ways in SQL to delimit an alias.  You can do [Alias] = [column details], or you can do [column details] AS [Alias].  Pick one and stick with it throughout your code base.  I personally like the latter but I know there are people who like the former.  In any event, we see “ISNULL(AdvertisementID, 0) AdvertisementID” and that makes it very easy to mistake this for two separate attributes.  It also makes it easy when combined with problem #1 to accidentally confuse a column with an alias.  If you’re re-writing a query, you might forget a comma and turn “ColumnA, ColumnB” into “ColumnA ColumnB” without noticing the problem until it’s too late.
  3. Attributes not table-specified.  Neither AdvertisementID nor JDate has the appropriate table specified.  This makes troubleshooting and extending the query more difficult.  If I add a join to another table with JDate or AdvertisementID, I will get an ambiguity error, and if I want to remove a join from this list, I might not know exactly what ramifications that would have.
  4. Inconsistent spacing.  Note that our first three lines of attributes have three different tabstop levels.  It’s hard when you’re reading a long procedure to tell exactly where you’re at if your tabs are inconsistent.  Conversely, if you keep consistent tab spacing, it’s easy to tell if you’ve accidentally jumped out of an IF block.
  5. Inconsistent parentheses and spacing for operations.  Looking at lines 12-14, it’s hard to tell that this is just one operation, that the next operation is lines 14-15, and that the subsequent operation is lines 16-19.  Look at lines 34-39 for a special blend of problems.
  6. The LEFT OUTER JOIN statement looks terrible.  First of all, we have parentheses which don’t line up, making it hard to tell that there are two correlated sub-queries in our operation.  When you have several parentheses on the same line like this, that’s a pretty bad code smell.  Aside from that, we have multiple attributes on line 43, multiple filters on line 50, and inconsistent indentation levels.
  7. You might not be able to tell from this code example, but the GROUP BY clause uses spaces to indent, whereas the rest of the query uses tabs to indent.  This type of inconsistency may seem minor, but because different tools have different space equivalencies for tabstops (e.g., 4 spaces, 5 spaces, 8 spaces), your code might not look good in all tools if you mix tabstops and spaces.

So that’s a pretty nasty set of procedure code.  Let me show you a different version of the same code, with the only difference being that I fix the seven problems above.

SELECT
	ISNULL(krc.AdvertisementID, 0) AS AdvertisementID,
	krc.JDate,
	krc.ProfileID,
	ISNULL(krc.SearchProviderID, 0) AS SearchProviderID,
	ISNULL(NULLIF(STCategoryID, '-1'), '0') AS STCategoryID,
	krc.EventTypeID,
	CAST(JDate AS DATETIME) AS CreateDate,
	DATEPART(MONTH, CAST(JDate AS DATETIME)) AS CreateMonth,
	DATEPART(YEAR, CAST(JDate AS DATETIME)) AS CreateYear,
	Keyword,
	NULLIF(PageID, 0) AS PageID,
	CASE
		WHEN AdvertisementID > 0 THEN 1
		ELSE 0
	END AS Paid,
	CAST(MAX(ISNULL(s.SearchProviderID, 0)) AS BIT) AS AllowsCampaigns,
	SUM(ISNULL(Conversions, 0)) AS Orders,
	SUM(ISNULL(Hits, 0)) AS PixelHits,
	SUM(ISNULL(Clicks, 0)) AS Clicks,
	SUM(ISNULL(Impressions, 0)) AS Impressions,
	CAST
	(
		CASE SUM(ISNULL(Impressions, 0))
			WHEN 0 THEN SUM(ISNULL(WeightedRank, 0))
			ELSE SUM(ISNULL(WeightedRank, 0) * ISNULL(Impressions, 0)) / SUM(ISNULL(Impressions, 0))
		END AS MONEY
	) AS WeightedRank,
	SUM(CASE WHEN AdvertisementID > 0 THEN 1 ELSE 0 END) AS AdCount,
	BINARY_CHECKSUM(krc.ProfileID, ISNULL(AdvertisementID, 0), ISNULL(NULLIF(STCategoryID, '-1'), '0')) AS CheckSumID,
	RateAdjustmentCustomRuleID,
	SUM(ISNULL(EstimatedCost, 0)) AS EstimatedCost,
	SUM(ISNULL(EstimatedClicks, 0)) AS EstimatedClicks,
	SUM(CASE WHEN ISNULL(et.ParentEventTypeID, 0) = 3 THEN krc.Revenue ELSE 0 END) AS [OrderRevenue],
	SUM(CASE WHEN ISNULL(et.ParentEventTypeID, 0) = 3 THEN krc.Conversions ELSE 0 END) AS [OrderCount],
	SUM(CASE WHEN ISNULL(et.ParentEventTypeID, 0) = 4 THEN krc.Revenue ELSE 0 END) AS [EventValue],
	SUM(CASE WHEN ISNULL(et.ParentEventTypeID, 0) = 4 THEN krc.Conversions ELSE 0 END) AS [EventCount]
FROM SomeTable krc
	INNER JOIN SomeOtherTable c2a
		ON krc.ProfileID = c2a.ProfileID
		AND c2a.EligibleForGC = 0
	LEFT OUTER JOIN
	(
		SELECT
			s.ProfileID,
			s.SearchProvider,
			s.SearchProviderID
		FROM SomeThirdTable c
			CROSS APPLY SomeFunction(c.ProfileID) s
		WHERE
			s.ProfileID = c.ProfileID
			AND NOT
			(
				s.SearchProviderID = 62
				OR s.ChannelID = 3
				OR
				(
					s.SearchProviderID = 2
					AND c.ProfileID IN
					(
						SELECT
							pa.ProfileID
						FROM SomeFourthTable pa
						WHERE
							pa.SearchProviderID = 2
							AND pa.IsActive = 1
							AND pa.Version <> 'EWS'
							AND s.ProfileID = pa.ProfileID
					)
				)
			)
	) s
		ON s.SearchProviderID = krc.SearchProviderID
		AND s.ProfileID = krc.ProfileID
	LEFT OUTER JOIN SomeFifthTable et
		ON et.EventTypeID = krc.EventTypeID
		AND ISNULL(NULLIF(et.ProfileID, 0), krc.ProfileID) = krc.ProfileID
GROUP BY
	AdvertisementID,
	JDate,
	krc.ProfileID,
	ISNULL(krc.SearchProviderID, 0),
	STCategoryID,
	krc.EventTypeID,
	CAST(JDate AS DATETIME),
	DATEPART(MONTH, CAST(JDate AS DATETIME)),
	DATEPART(YEAR, CAST(JDate AS DATETIME)),
	Keyword,
	NULLIF(PageID, 0),
	CASE
		WHEN AdvertisementID > 0 THEN 1
		ELSE 0
	END,
	RateAdjustmentCustomRuleID;

Even though I haven’t changed a single bit of functionality, I think this version of the code is much easier to maintain going forward.  It is now much easier to see every column, what that column’s alias is, the details of each calculation (and only that calculation—no spilling over into other calculations), and also the level of complexity in our LEFT OUTER JOIN aliased as s.

Again, I want to emphasize that the way that I formatted this procedure is not the only good way to format a procedure.  As long as the end result is readable and fits with accepted development standards on your team, it’s good formatting.  Nevertheless, let me give a few recommendations for how these standards should look:

  • Consistent casing for keywords.  Use capital letters to offset statements like LEFT OUTER JOIN from attributes.
  • Consistent indentation levels.  I like to have the SELECT, FROM, WHERE, HAVING, GROUP BY, and ORDER clauses flush and the rest of the code indented one line.  You’ll note that I don’t quite follow that rule with the FROM clause, as that’s the standard at my place of work.  As long as all of the FROM clauses are indented the same way, life is good.
  • One identifier per line.  Don’t try to slam multiple attributes, join criteria, or filters on the same line, even if they’re small.  This makes it easier to differentiate your criteria.
  • Use spacing to clarify intent.  Lines 34-37 of the re-formatted procedure are all one-line calculations because the CASE statement just serves to filter revenue and count.  By contrast, the calculation for weighted rank on lines 23-28 is a bit more complex, so I break it up across several lines.  If I have CASE statements within CASE statements, I definitely want to break those up and indent them.

So let’s say you have several thousand stored procedures to maintain.  Re-formatting procedures takes a long time, so how will we ever get that done?  The answer is simple:  amortize your tech debt.  Every time you need to make a change to a procedure, create test cases for that procedure and bring its formatting up to snuff.  For extremely complex procedures you regularly update, maybe you take care of a portion of the procedure at a time—that’s not great, but it’s still better than nothing.  Similarly, even if you missed something your first go-around, you’ll have another opportunity to fix the code up again later, so aim for improvement instead of perfection.  Working like this, you will improve your code base at the margin.  Sometimes, you’ll get an opportunity to make larger-scale improvements, and I whole-heartedly recommend taking advantage of those opportunities.

Another factor which might help with formatting is getting the right tool for the job.  SQL Server Management Studio doesn’t format at all.  But as C# developers know, using tools like ReSharper can be incredible time-savers.  Over in the Management Studio world, I’ve got personal experience with three tools.  No matter which tool you choose, spend some time customizing it.  Getting auto-formatting to do 98% of the formatting work for me saves huge amounts of time and lets me get away with bigger changes than I might otherwise be able to make.  And if your employer balks at the cost of a tool, think of it this way:  one employee earning $50 an hour needs to save approximately 3 hours a year for one of these tools to pay off.  Play around with the tools a bit and you’ll probably save 3 hours of work per week in terms of time spent re-formatting, reading code, trying to understand code, and testing bugs in code.