The last couple of days, I’ve spent a lot of time talking about testing (and more testing).  The reason I spent so much time talking about testing when working with legacy SQL is that you probably don’t know exactly what your code base does.  Even if you wrote most (or all) of the SQL in your code base, there’s a lot that the you of three years ago knew that the you of today probably doesn’t:  business requirements, edge cases, and why you took one route over another.  Adding test cases helps you capture some of that information and more importantly, gives you protection when it comes time to take care of the next steps on the list.

The next step on my “working effectively with legacy SQL” list is formatting your SQL.   This is the lowest-risk form of refactoring, as you aren’t making any logical code changes.  The end effect, however, is that your code becomes a lot more readable, and readable code is (potentially) understandable code.  Let me give you an example of what I consider unreadable code, and break down why:

SELECT
	ISNULL(AdvertisementID, 0) AdvertisementID, JDate,krc.ProfileID,ISNULL(krc.SearchProviderID, 0) SearchProviderID,ISNULL(NULLIF(STCategoryID, '-1'), '0')
STCategoryID,
				krc.EventTypeID,
	CAST(JDate AS DATETIME) CreateDate,
	DATEPART(MONTH, CAST(JDate AS DATETIME)) CreateMonth,
DATEPART(YEAR, CAST(JDate AS DATETIME)) CreateYear, Keyword, NULLIF(PageID, 0) PageID,
	CASE
		WHEN AdvertisementID > 0 THEN 1
		ELSE 0 END Paid,
	CAST(MAX(ISNULL(s.SearchProviderID, 0)) AS BIT) AllowsCampaigns,
	SUM
	(ISNULL(Conversions
	, 0)) AS Orders,  SUM(ISNULL
	(Hits, 0)) PixelHits,
		SUM(ISNULL(Clicks, 0)) Clicks,
	SUM(
		ISNULL(Impressions,
0)) AS Impressions,
	CAST(CASE SUM(ISNULL(Impressions, 0)) WHEN 0 THEN SUM(ISNULL(WeightedRank, 0)) ELSE SUM(ISNULL(WeightedRank, 0) * ISNULL(Impressions, 0)) / SUM(ISNULL(Impressions, 0)) END AS MONEY) WeightedRank,
	SUM(CASE WHEN AdvertisementID > 0 THEN 1 ELSE 0 END) AS AdCount,
	BINARY_CHECKSUM(krc.ProfileID, ISNULL(AdvertisementID, 0), ISNULL(NULLIF(STCategoryID, '-1'), '0')) AS CheckSumID,
	RateAdjustmentCustomRuleID,
	SUM(ISNULL(EstimatedCost, 0)) AS EstimatedCost,
	SUM(ISNULL(EstimatedClicks, 0)) AS EstimatedClicks,
	SUM(CASE WHEN ISNULL(et.ParentEventTypeID, 0) = 3 
	THEN krc.Revenue ELSE 0 END) AS [OrderRevenue],
	SUM(
	CASE WHEN 
	ISNULL(et.ParentEventTypeID, 0) = 3 
	THEN 
	krc.Conversions ELSE 0 
	END) AS [OrderCount],
	SUM(CASE WHEN ISNULL(et.ParentEventTypeID, 0) = 4 
				THEN 
		krc.Revenue 
				ELSE 0 
			END)
	[EventValue],
	SUM(CASE WHEN ISNULL(et.ParentEventTypeID, 0) = 4 THEN krc.Conversions ELSE 0 END) AS [EventCount]
FROM SomeTable krc INNER JOIN SomeOtherTable c2a ON krc.ProfileID = c2a.ProfileID AND c2a.EligibleForGC = 0
	LEFT OUTER JOIN
	(SELECT s.ProfileID, s.SearchProvider, s.SearchProviderID
		FROM SomeThirdTable c
			CROSS APPLY SomeFunction(c.ProfileID) s
		WHERE
			s.ProfileID = c.ProfileID AND NOT (s.SearchProviderID = 62 OR s.ChannelID = 3 OR (s.SearchProviderID = 2 AND c.ProfileID IN (SELECT pa.ProfileID FROM SomeFourthTable pa
			WHERE
							pa.SearchProviderID = 2
							AND pa.IsActive = 1 AND pa.Version <> 'EWS' AND s.ProfileID = pa.ProfileID
					)
				)
			)
	) s
		ON s.SearchProviderID = krc.SearchProviderID
		AND s.ProfileID = krc.ProfileID
	LEFT OUTER JOIN SomeFifthTable et
		ON et.EventTypeID = krc.EventTypeID
		AND ISNULL(NULLIF(et.ProfileID, 0), krc.ProfileID) = krc.ProfileID
GROUP BY
    AdvertisementID,
    JDate,
    krc.ProfileID,
    ISNULL(krc.SearchProviderID, 0),
    STCategoryID,
    krc.EventTypeID,
    CAST(JDate AS DATETIME),
    DATEPART(MONTH, CAST(JDate AS DATETIME)),
    DATEPART(YEAR, CAST(JDate AS DATETIME)),
    Keyword,
    NULLIF(PageID, 0),
    CASE
    	WHEN AdvertisementID > 0 THEN 1
    	ELSE 0
    END,
    RateAdjustmentCustomRuleID;

This is a representative example of bad formatting and although your production stored procedures may not look (quite) this bad, I’m willing to be that this isn’t too far off for a lot of systems.  So let’s enumerate the problems, from top to bottom:

  1. Multiple attributes on one row.  Row number 2 starts off with AdvertisementID and then moves to four other attributes before continuing on row number 3.  Having multiple attributes on a single line makes it easier to conflate attributes, especially when combined with:
  2. Aliases not delimited.  There are two ways in SQL to delimit an alias.  You can do [Alias] = [column details], or you can do [column details] AS [Alias].  Pick one and stick with it throughout your code base.  I personally like the latter but I know there are people who like the former.  In any event, we see “ISNULL(AdvertisementID, 0) AdvertisementID” and that makes it very easy to mistake this for two separate attributes.  It also makes it easy when combined with problem #1 to accidentally confuse a column with an alias.  If you’re re-writing a query, you might forget a comma and turn “ColumnA, ColumnB” into “ColumnA ColumnB” without noticing the problem until it’s too late.
  3. Attributes not table-specified.  Neither AdvertisementID nor JDate has the appropriate table specified.  This makes troubleshooting and extending the query more difficult.  If I add a join to another table with JDate or AdvertisementID, I will get an ambiguity error, and if I want to remove a join from this list, I might not know exactly what ramifications that would have.
  4. Inconsistent spacing.  Note that our first three lines of attributes have three different tabstop levels.  It’s hard when you’re reading a long procedure to tell exactly where you’re at if your tabs are inconsistent.  Conversely, if you keep consistent tab spacing, it’s easy to tell if you’ve accidentally jumped out of an IF block.
  5. Inconsistent parentheses and spacing for operations.  Looking at lines 12-14, it’s hard to tell that this is just one operation, that the next operation is lines 14-15, and that the subsequent operation is lines 16-19.  Look at lines 34-39 for a special blend of problems.
  6. The LEFT OUTER JOIN statement looks terrible.  First of all, we have parentheses which don’t line up, making it hard to tell that there are two correlated sub-queries in our operation.  When you have several parentheses on the same line like this, that’s a pretty bad code smell.  Aside from that, we have multiple attributes on line 43, multiple filters on line 50, and inconsistent indentation levels.
  7. You might not be able to tell from this code example, but the GROUP BY clause uses spaces to indent, whereas the rest of the query uses tabs to indent.  This type of inconsistency may seem minor, but because different tools have different space equivalencies for tabstops (e.g., 4 spaces, 5 spaces, 8 spaces), your code might not look good in all tools if you mix tabstops and spaces.

So that’s a pretty nasty set of procedure code.  Let me show you a different version of the same code, with the only difference being that I fix the seven problems above.

SELECT
	ISNULL(krc.AdvertisementID, 0) AS AdvertisementID,
	krc.JDate,
	krc.ProfileID,
	ISNULL(krc.SearchProviderID, 0) AS SearchProviderID,
	ISNULL(NULLIF(STCategoryID, '-1'), '0') AS STCategoryID,
	krc.EventTypeID,
	CAST(JDate AS DATETIME) AS CreateDate,
	DATEPART(MONTH, CAST(JDate AS DATETIME)) AS CreateMonth,
	DATEPART(YEAR, CAST(JDate AS DATETIME)) AS CreateYear,
	Keyword,
	NULLIF(PageID, 0) AS PageID,
	CASE
		WHEN AdvertisementID > 0 THEN 1
		ELSE 0
	END AS Paid,
	CAST(MAX(ISNULL(s.SearchProviderID, 0)) AS BIT) AS AllowsCampaigns,
	SUM(ISNULL(Conversions, 0)) AS Orders,
	SUM(ISNULL(Hits, 0)) AS PixelHits,
	SUM(ISNULL(Clicks, 0)) AS Clicks,
	SUM(ISNULL(Impressions, 0)) AS Impressions,
	CAST
	(
		CASE SUM(ISNULL(Impressions, 0))
			WHEN 0 THEN SUM(ISNULL(WeightedRank, 0))
			ELSE SUM(ISNULL(WeightedRank, 0) * ISNULL(Impressions, 0)) / SUM(ISNULL(Impressions, 0))
		END AS MONEY
	) AS WeightedRank,
	SUM(CASE WHEN AdvertisementID > 0 THEN 1 ELSE 0 END) AS AdCount,
	BINARY_CHECKSUM(krc.ProfileID, ISNULL(AdvertisementID, 0), ISNULL(NULLIF(STCategoryID, '-1'), '0')) AS CheckSumID,
	RateAdjustmentCustomRuleID,
	SUM(ISNULL(EstimatedCost, 0)) AS EstimatedCost,
	SUM(ISNULL(EstimatedClicks, 0)) AS EstimatedClicks,
	SUM(CASE WHEN ISNULL(et.ParentEventTypeID, 0) = 3 THEN krc.Revenue ELSE 0 END) AS [OrderRevenue],
	SUM(CASE WHEN ISNULL(et.ParentEventTypeID, 0) = 3 THEN krc.Conversions ELSE 0 END) AS [OrderCount],
	SUM(CASE WHEN ISNULL(et.ParentEventTypeID, 0) = 4 THEN krc.Revenue ELSE 0 END) AS [EventValue],
	SUM(CASE WHEN ISNULL(et.ParentEventTypeID, 0) = 4 THEN krc.Conversions ELSE 0 END) AS [EventCount]
FROM SomeTable krc
	INNER JOIN SomeOtherTable c2a
		ON krc.ProfileID = c2a.ProfileID
		AND c2a.EligibleForGC = 0
	LEFT OUTER JOIN
	(
		SELECT
			s.ProfileID,
			s.SearchProvider,
			s.SearchProviderID
		FROM SomeThirdTable c
			CROSS APPLY SomeFunction(c.ProfileID) s
		WHERE
			s.ProfileID = c.ProfileID
			AND NOT
			(
				s.SearchProviderID = 62
				OR s.ChannelID = 3
				OR
				(
					s.SearchProviderID = 2
					AND c.ProfileID IN
					(
						SELECT
							pa.ProfileID
						FROM SomeFourthTable pa
						WHERE
							pa.SearchProviderID = 2
							AND pa.IsActive = 1
							AND pa.Version <> 'EWS'
							AND s.ProfileID = pa.ProfileID
					)
				)
			)
	) s
		ON s.SearchProviderID = krc.SearchProviderID
		AND s.ProfileID = krc.ProfileID
	LEFT OUTER JOIN SomeFifthTable et
		ON et.EventTypeID = krc.EventTypeID
		AND ISNULL(NULLIF(et.ProfileID, 0), krc.ProfileID) = krc.ProfileID
GROUP BY
	AdvertisementID,
	JDate,
	krc.ProfileID,
	ISNULL(krc.SearchProviderID, 0),
	STCategoryID,
	krc.EventTypeID,
	CAST(JDate AS DATETIME),
	DATEPART(MONTH, CAST(JDate AS DATETIME)),
	DATEPART(YEAR, CAST(JDate AS DATETIME)),
	Keyword,
	NULLIF(PageID, 0),
	CASE
		WHEN AdvertisementID > 0 THEN 1
		ELSE 0
	END,
	RateAdjustmentCustomRuleID;

Even though I haven’t changed a single bit of functionality, I think this version of the code is much easier to maintain going forward.  It is now much easier to see every column, what that column’s alias is, the details of each calculation (and only that calculation—no spilling over into other calculations), and also the level of complexity in our LEFT OUTER JOIN aliased as s.

Again, I want to emphasize that the way that I formatted this procedure is not the only good way to format a procedure.  As long as the end result is readable and fits with accepted development standards on your team, it’s good formatting.  Nevertheless, let me give a few recommendations for how these standards should look:

  • Consistent casing for keywords.  Use capital letters to offset statements like LEFT OUTER JOIN from attributes.
  • Consistent indentation levels.  I like to have the SELECT, FROM, WHERE, HAVING, GROUP BY, and ORDER clauses flush and the rest of the code indented one line.  You’ll note that I don’t quite follow that rule with the FROM clause, as that’s the standard at my place of work.  As long as all of the FROM clauses are indented the same way, life is good.
  • One identifier per line.  Don’t try to slam multiple attributes, join criteria, or filters on the same line, even if they’re small.  This makes it easier to differentiate your criteria.
  • Use spacing to clarify intent.  Lines 34-37 of the re-formatted procedure are all one-line calculations because the CASE statement just serves to filter revenue and count.  By contrast, the calculation for weighted rank on lines 23-28 is a bit more complex, so I break it up across several lines.  If I have CASE statements within CASE statements, I definitely want to break those up and indent them.

So let’s say you have several thousand stored procedures to maintain.  Re-formatting procedures takes a long time, so how will we ever get that done?  The answer is simple:  amortize your tech debt.  Every time you need to make a change to a procedure, create test cases for that procedure and bring its formatting up to snuff.  For extremely complex procedures you regularly update, maybe you take care of a portion of the procedure at a time—that’s not great, but it’s still better than nothing.  Similarly, even if you missed something your first go-around, you’ll have another opportunity to fix the code up again later, so aim for improvement instead of perfection.  Working like this, you will improve your code base at the margin.  Sometimes, you’ll get an opportunity to make larger-scale improvements, and I whole-heartedly recommend taking advantage of those opportunities.

Another factor which might help with formatting is getting the right tool for the job.  SQL Server Management Studio doesn’t format at all.  But as C# developers know, using tools like ReSharper can be incredible time-savers.  Over in the Management Studio world, I’ve got personal experience with three tools.  No matter which tool you choose, spend some time customizing it.  Getting auto-formatting to do 98% of the formatting work for me saves huge amounts of time and lets me get away with bigger changes than I might otherwise be able to make.  And if your employer balks at the cost of a tool, think of it this way:  one employee earning $50 an hour needs to save approximately 3 hours a year for one of these tools to pay off.  Play around with the tools a bit and you’ll probably save 3 hours of work per week in terms of time spent re-formatting, reading code, trying to understand code, and testing bugs in code.

Advertisement

2 thoughts on “Format Your SQL

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s