Approaching Zero: A Case Study

This is part eight in a series on near-zero downtime deployments.

As we wind down this series, I want to talk through a case study, something of which I’m particularly proud. Here’s the scenario: I joined a product team responsible for approximately 200 tables and 800 stored procedures in 2014. This product team used the name ClientID to represent a customer, whereas the rest of the company used the name ProfileID to represent a customer. Now, this isn’t the end of the world by itself, but things got a lot more confusing because the rest of the company also had a competing definition for the term ClientID which was totally different than my team’s definition. This led to a lot of confusion historically and more than one miscommunication between teams. It’s something that prior database developers had wanted to clean up, but it was going to be too much work…and then more and more tables and procedures were added which followed that old model.

Of course there were a couple of complications to this project. First, ClientID was part of the primary key of 150+ tables. Second, ClientID was a parameter name in 700+ stored procedures. Third, we had to do this without a downtime window.

When I joined the team, I promised our data architect that we’d get this done. She came up with the design and we split the heavy lifting. Here’s what we accomplished:

  • Renamed ClientID to ProfileID on 150+ tables, including changing names on key constraints, indexes, etc. which referenced the term ClientID.
  • Refactored and renamed ClientID to ProfileID on approximately 550 stored procedures across a few instances. Most were centrally located in one schema, but a couple dozen were in miscellaneous other projects.
  • Updated several SSIS packages to use ProfileID instead of ClientID and deprecated several more.
  • Converted ad hoc queries spread around the code base to use stored procedures everywhere, adding about 20 or so more stored procedures with ProfileID.
  • Deprecated approximately 20 tables and 150 stored procedures which were no longer in use.

All of this happened in a single four-week sprint with no extra downtime window. In the end, we messed up one procedure and had to re-add one procedure we dropped out of a hundred because it turned out to be in very occasional use. Let’s walk through what we did in a little more detail.

To make it easier to follow, I created the following helpful graphic which woefully understates the amount of work we needed to do.

Phase 1: antebellum.

Early in the Sprint: Pre-Pre-Release

Very early on in the sprint, we worked on reformatting and refactoring procedures to make them consistent and (in some cases) understandable. Ideally, we would have done this in a prior sprint to reduce cognitive load, but we did not want to run the risk of higher-ups saying “Nope, this change is too big” and cancelling our project.

We also went looking for procedures not in use. Some of the places we looked were:

  • The plan cache and procedure cache. For servers which are online for a while, you get a pretty decent idea of which procedures get called. A procedure or table not showing up in these is not indicative of disuse or irrelevance, but if one does show up, it is a good indicator that we can’t just drop it.
  • Source control. All data is in source control and we use OpenGrok to search. OpenGrok isn’t the best tool ever but it does have an API so I could automate these checks.
  • SQL Agent jobs, specifically the sysjobsteps table.
  • Metadata tables. This is where my system will differ from yours. We have a few tables strewn about which contain steps for executing procedures in an order. For example, we have one table which an application uses to perform bulk insert operations for different tables.
  • Procedures with broken references. If your procedure is looking for a table or view which isn’t there, and if the error log doesn’t have references of failure, that’s a pretty good indicator that nobody’s using your table.
  • Prior versions of current procedures. People sometimes forget to clean up old versions of procedures, so this was a good opportunity to flush some of these down the drain.

For an example of the types of queries we built and ran, here’s one to query SQL Agent jobs:

DECLARE
	@SearchString NVARCHAR(3000) = N'dbo';

SET @SearchString = 'ClientID'

SELECT
	sj.name AS JobName,
	sj.enabled AS IsJobEnabled,
	sjs.step_name AS JobStepName,
	sjs.command AS SQLCommand,
	sjs.last_run_date AS JobStepLastRunDate
FROM msdb.dbo.sysjobsteps sjs WITH (NOLOCK)
	INNER JOIN msdb.dbo.sysjobs sj WITH (NOLOCK)
		ON sjs.job_id = sj.job_id
WHERE
	subsystem = 'TSQL'
	AND command LIKE '%' + @SearchString + '%';

We dropped about a hundred procedures and twenty tables doing this. That’s nice because it meant a hundred procedures and twenty tables that we didn’t need to modify.

During the Sprint, Still Pre-Pre-Release

During this phase, we created tests for as many procedures as possible to ensure that they still returned the same results after all of our changes. This gave us an important safety net.

We also prepped as much of this work as possible. As you’ll see, there were a lot of changes involved.

Pre-Release Phase

At this point, all of our tables have ClientID as part of the primary key / clustered index. We added a ProfileID as a nullable integer to each table with ClientID on it. Here’s a sample of the type of code you can use to do that:

SELECT '
IF NOT EXISTS (
		SELECT 1 
		FROM INFORMATION_SCHEMA.COLUMNS 
		WHERE TABLE_SCHEMA = ''' + c.TABLE_SCHEMA + '''
		AND TABLE_NAME = ''' + c.TABLE_NAME + '''
		AND COLUMN_NAME = ''ProfileID''
		)
BEGIN
	PRINT ''ALTER TABLE ' + c.TABLE_SCHEMA + '.' + c.TABLE_NAME	+ ' ADD [ProfileID] [int] NULL''
	ALTER TABLE ' + c.TABLE_SCHEMA + '.' + c.TABLE_NAME	+ ' ADD [ProfileID] [int] NULL
END
GO
'
FROM INFORMATION_SCHEMA.COLUMNS c
INNER JOIN sys.objects o ON o.object_id = OBJECT_ID(c.TABLE_SCHEMA + '.' + c.TABLE_NAME)
WHERE
	c.TABLE_SCHEMA = 'SearchAdvisor'
	AND c.COLUMN_NAME = 'ClientID'
	AND NOT EXISTS (
		SELECT 1 FROM INFORMATION_SCHEMA.COLUMNS c2
		WHERE c.TABLE_SCHEMA = c2.TABLE_SCHEMA
		AND c.TABLE_NAME = c2.TABLE_NAME
		AND c2.COLUMN_NAME = 'ProfileID'
		)
	AND o.type = 'U'
	AND c.TABLE_NAME NOT LIKE '%DELETEME%'

Our pre-release phase started probably 20-30 minutes before the database release. We didn’t need to create any new indexes, backfill any tables, or do any other long-running pre-release actions, so it was a relatively short period of time.

Database Release Phase

This is where the real action began.

Release Temporary Procedures

First, we developed temporary procedures for the release window. Each procedure which accepts a @ClientID parameter had a new @ProfileID parameter added as well and defaulted this to NULL. In the event that @ProfileID came in as NULL, we’d set @ProfileID = @ClientID.

Each of the current procedures used in the application (which take @ClientID as the parameter) would then have their logic scooped out and put into these temp procedures. The current procedures would call temp procedures. Each temp procedure would have @ClientIDchecks replaced with ISNULL(@ProfileID, @ClientID) checks.

To give a concrete example, let’s say there was a procedure dbo.DoSomeWork @ClientID = 20;. We would create a temporary procedure dbo.DoSomeWorkTemp @ClientID, @ProfileID and copy out the guts from DoSomeWork into the temp procedure. Then, we replace the innards of DoSomeWork with a call to DoSomeWorkTemp. The key reason here is that we couldn’t add a new parameter to these stored procedures without breaking a lot of data access layer code.

Here’s what it looks like now:

Phase 2a – And so it begins.

Update Views and Functions

All views and functions in use (fortunately, there weren’t too many of either) needed to run ISNULL(ProfileID, ClientID) as ProfileID and as ClientID. We needed to return both during this release window.

Now the image looks thusly. Kind of.

Phase 2b – We’re starting to modify some views and procedures.

Update SQL Code

For code that comes from SQL Server, we modified code to coalesce ProfileID and ClientID as ISNULL(ProfileID, ClientID) during the release window and to start passing in @ProfileID instead of @ClientID. This included SQL Agent jobs, triggers, stored procedures not used as part of our application code, SSIS packages, etc. That way, during the release window, queries would still return the same results regardless of where we were in the migration process. We could make these changes directly because didn’t have a relatively brittle data access layer to worry about; we just needed to be able to handle all circumstances.

You might expect that the image will look something like this, and if so, you are correct:

Phase 2c – Definitely not 2b.

Swap ProfileID and ClientID

For each table, we started a transaction to rename our empty ProfileID table to ProfileID2. We then renamed ClientID to ProfileID and added a new ClientID column as a nullable integer with. Finally, we dropped ProfileID2 and closed the transaction.

Here’s an example of how you might do this.

SET NOCOUNT ON;
 
SELECT '
IF EXISTS (
		SELECT 1 
		FROM INFORMATION_SCHEMA.COLUMNS 
		WHERE TABLE_SCHEMA = ''' + c.TABLE_SCHEMA + '''
		AND TABLE_NAME = ''' + c.TABLE_NAME + '''
		AND COLUMN_NAME = ''ClientID''
		AND IS_NULLABLE = ''NO''
		)
AND EXISTS (
		SELECT 1 
		FROM INFORMATION_SCHEMA.COLUMNS 
		WHERE TABLE_SCHEMA = ''' + c.TABLE_SCHEMA + '''
		AND TABLE_NAME = ''' + c.TABLE_NAME + '''
		AND COLUMN_NAME = ''ProfileID''
		AND IS_NULLABLE = ''YES''
		)		
BEGIN
	BEGIN TRANSACTION
	PRINT ''EXEC sp_rename ''''' + c.TABLE_SCHEMA + '.' + c.TABLE_NAME + '.ProfileID'''', ''''ProfileID2'''', ''''COLUMN'''';''
	EXEC sp_rename ''' + c.TABLE_SCHEMA + '.' + c.TABLE_NAME + '.ProfileID'', ''ProfileID2'', ''COLUMN'';
	PRINT ''EXEC sp_rename ''''' + c.TABLE_SCHEMA + '.' + c.TABLE_NAME + '.ClientID'''', ''''ProfileID'''', ''''COLUMN'''';''
	EXEC sp_rename ''' + c.TABLE_SCHEMA + '.' + c.TABLE_NAME + '.ClientID'', ''ProfileID'', ''COLUMN'';
	PRINT ''ALTER TABLE ' + c.TABLE_SCHEMA + '.' + c.TABLE_NAME	+ ' ADD [ClientID] [int] NULL''
	ALTER TABLE ' + c.TABLE_SCHEMA + '.' + c.TABLE_NAME	+ ' ADD [ClientID] [int] NULL
	PRINT ''ALTER TABLE ' + c.TABLE_SCHEMA + '.' + c.TABLE_NAME	+ ' DROP COLUMN [ProfileID2]''
	ALTER TABLE ' + c.TABLE_SCHEMA + '.' + c.TABLE_NAME	+ ' DROP COLUMN [ProfileID2]
	COMMIT TRANSACTION	
END
GO'
FROM INFORMATION_SCHEMA.COLUMNS c
INNER JOIN sys.objects o ON o.object_id = OBJECT_ID(c.TABLE_SCHEMA + '.' + c.TABLE_NAME)
WHERE
	c.TABLE_SCHEMA = '{{YOURSCHEMA}}'
	AND c.COLUMN_NAME = 'ClientID'
	AND NOT EXISTS (
		SELECT 1 FROM INFORMATION_SCHEMA.COLUMNS c2
		WHERE c.TABLE_SCHEMA = c2.TABLE_SCHEMA
		AND c.TABLE_NAME = c2.TABLE_NAME
		AND c2.COLUMN_NAME = 'ProfileID'
		)
	AND o.type = 'U'
	AND c.TABLE_NAME NOT LIKE '%DELETEME%'
GO

And this is what our system looks like after this change:

Phase 3 – Our ClientID remains necessary but not for long. Note the gradient swap; that’s some top-tier visualization effort from me.

Rename Indexes and Constraints

After making all of these changes, we then renamed some indexes and constraints which had ClientID in the name. That’s not absolutely required, but it helps with database explainability down the road.

All in all, this took about 2 hours to do everything (as I recall). Our normal database release was somewhere around an hour or so at the time, but we had about an order of magnitude more scripts to run against a few thousand databases on dozens of instances.

Database Post-Release

Here I want to take stock of where we’re at. We renamed ClientID on all of these tables to ProfileID. We still have a vestigial ClientID column on our tables, a bunch of temporary procedures, and a bunch of temporary code in SQL Agent jobs, et al. At this point, we’re probably about halfway done.

Deploy Final Versions of New (née Temp) Procedures

This is where we start the decline phase in our story. Our temporary procedures existed as a bridge from the old procedures which took ClientID and new procedures which will take ProfileID. With our final versions of procedures, we replace @ClientID with @ProfileID in the input parameters and update any conditional logic within filters to use ProfileID instead of ClientID.

The only remaining use of ClientID in these procedures is as an output in select statements, as we still need that for the old code; by this time, all of those references are ProfileID AS ClientID. Otherwise, these new procedures are built around ProfileID.

We still have the original procedures that the application code uses and they reference our now-finalized procedures. These procedures are now transition procedures—we need them right now but as we move application code over to the new procedures, we can deprecate these. But I’m getting ahead of myself a little bit.

And here is the graphical version of where we are now:

Phase 4 – New procedures are ready to rumble.

Update Non-Code SQL

Next, we need to perform one final update of any SQL Agent jobs, dynamic SQL, triggers, and non-application sources. We want to remove any ISNULL(ProfileID, ClientID) references and replace them simply with references to ProfileID.

We’re getting bluer and bluer in our diagram and that’s a good thing:

Phase 5 – Non-app code is now aware that we have always been at war with ProfileID.

Update Views and Functions

Views and functions should replace ISNULL(ProfileID, ClientID) checks with ProfileID. Like stored procedures, they should return both ProfileID and ProfileID AS ClientID so we don’t break any calling code or procedures which still expect that ClientID record in the result set.

The post-release phase took a little bit of time to run. I don’t remember at this point just how long it took, but I do remember people starting to get a little antsy due to the sheer number of operations. Our biggest problem during this window was degraded quality of service due to all of the ISNULL(ProfileID, ClientID) checks in filters, but the site was still running and people could do their work.

Here is our current status:

Phase 6 – The endgame approaches.

Post-Post-Release

Now we get to the second bonus round in this great switcharoo. Only after all of the post-release work was done could we get to this phase, where we start doing final cleanup.

Drop the Empty ClientID Column

All of our tables now have ProfileID as part of the primary key, but we have this vestigial ClientID column. We can safely drop the column from all tables, as nothing references it anymore.

Now our image looks like this:

Phase 7 – ClientID is almost a relic of the past.

Deprecate Obsolete Objects

Remember all of those obsolete tables, procedures, views, and functions I mentioned earlier in the article? This is where we dropped about 150 objects. It was a glorious day.

Update Application Code Over Time

We weren’t going to get all of the application code done in a single sprint, so we left this as a lingering task to complete over time. As developers changed code, we could eliminate references to ClientID in the code and eliminate references to the old wrapper stored procedures, having developers use the new versions which explicitly use ProfileID as parameter names. Once developers made those changes, we were able to remove ClientID from result sets and drop those wrapper procedures.

This change took years and there are probably a few gnarly parts of the code which still use the old ClientID name, but we handled at least 80% of the work within a few months.

Because it’s a long-running operation, my final picture shows the old procedures fading away and application code slowly making the shift.

Phase 8 – Old procedures don’t die; they just fade away. Unless I kill them with a vengeance. Basically, my databases are like reenactments of No Country for Old Men.

Conclusion

This project was interesting primarily because of the limitations we had. If we could have taken a downtime window of 4-6 hours, it would have been so much easier. But that wasn’t in the cards, so we had to find a workaround. The ultimate plan was our architect’s (who I totally won’t call out by accidentally linking her LinkedIn profile here), so she gets the credit for proving that we could do it. My marginal improvements were around source control, dropping lots of objects, and performing insane amounts of work during a single sprint.

Advertisements

2 thoughts on “Approaching Zero: A Case Study

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s