Course Review: Data Storytelling

This is a review of Troy Kranendonk’s Pluralsight course entitled Data Storytelling:  Moving Beyond Static Data Visualizations.

This is a short introduction to some of the concepts behind data storytelling. Considering that the entire course comes in at approximately 45 minutes, you won’t leave this course an expert on any one thing, but I think Troy does a good job laying out the groundwork to get better at data storytelling. Troy’s course covers topics of advocacy with some discussion of when to use particular storytelling elements such as tooltips, interactions, and eye-catching images.

One of the most compelling things I got out of this course was his work on animated GIFs. Troy’s last two sections cover this topic and he gives us a few good resources on the topic. Animating images is one of my weaker areas in the data visualization space so I was particularly interested in the topic. I think Troy did a good job of providing introductory content in that regard.

All in all, if you have a Pluralsight account, check out this course. It’s the type of course you can knock out on a plane ride and doesn’t drag on like an 8-hour slog.

Upcoming Events: SQL Saturday Cincinnati

Key Details

What: SQL Saturday Cincinnati
When: Saturday, March 30, 2019.
Where: College of Education, Criminal Justice, and Human Services University of Cincinnati, 250H Teachers-Dyer Complex, Cincinnati, Ohio, 45221
Admission is free. Sign up at the SQL Saturday website.

What I’m Presenting

4:00 PM — 5:00 PM — Getting Started with Apache Spark

This is the second year for SQL Saturday Cincinnati. After a successful first year, I will be interested to see what they do to kick it up a notch.

If luck holds, I will also get to pay a visit to Dayton (home of my alma mater) and Wright-Patterson Air Force Base (home of really cool airplanes).

Vote For SPARK-27006

In lieu of a PolyBase Revealed post today, I’m going to ask a favor of you. Yes, you, the person wearing the thing, right.

Vote For Duke

There is a Jira ticket for the Apache Spark project, SPARK-27006. The gist of this ticket is to bring .NET support to Spark, specifically by supporting DataFrames in C# (and hopefully F#). No support for Datasets or RDDs is included in here, but giving .NET developers DataFrame access would make it easy for us to write code which interacts with Spark SQL and a good chunk of the SparkSession object.

One of the more interesting parts of this is the support of C# UDFs and UDAFs, allowing us to write potentially complicated functions in .NET and execute them directly against data in Spark.

If you like the idea of being able to write C# (or hopefully F#) code against Spark clusters, I have a request to make of you: vote for this Jira ticket. If you do not already have a Jira account for Apache projects, you will need to create one, but that’s a couple of minutes’ work. Then click the “Vote for this issue” link on the right-hand side. Having a large number of votes for this issue will show the Spark community that there’s value in bringing .NET support in as a first-class citizen. It also tells Microsoft that there’s a community interested in the topic, so they should dedicate time and get it right.

Approaching Zero: Wrapping Up

This is part nine in a series on near-zero downtime deployments.

What Have We Learned?

Let’s summarize what we’ve learned in this series. I’ll do that with a series of headers and way too short text bits. Like this one. And links. Like this one to the introduction.

Is This Something You NEED?

I think this is the first question you have to ask yourself. Are you in an environment where extended downtime isn’t an option? I’ve been in places where the last person touching a server is usually done by 8 or 9 PM local time so I’d have 8-10 hours of time to futz about before anybody noticed.

In that kind of scenario, you can make a reasonable argument that no, you should not follow these practices. These practices extend the amount of time developers spend writing database code to get the same amount of end product done, so from a developer’s perspective it can feel like wasted effort.

Get Your Pieces in Place First

There are a few things you really want to be the case (and a few more things) before you pare down to near-zero downtime:

  • Code in source control with a repeatable release process and some semblance of continuous integration.
  • Most of your data access going through stored procedures. This doesn’t need to be 100% but it needs to be way higher than 0%.
  • You have acknowledgement from the business side that this is a trade-off: they’re trading developer time on certain business features for product reliability (which is itself a feature). The things developers need to do to reduce downtime windows necessarily means slower delivery of features compared to the alternative.

Think in Terms of Steps

Even if you work in a pure continuous deployment environment, I think it’s still good to think in terms of individual steps: pre-release, database release, code release, database post-release. They don’t need to be cordoned-off blocks of time on a calendar (though it’s fine if they are). They need to be concepts to keep in mind. Pre-release is all about preparation to move from one state to the next. Database release gets you ready for code release by ensuring that old and new versions of code can play nice with the same database. Database post-release gets rid of the cruft you’ve built up.

Procedure Changes are Easy

Working with stored procedures makes near-zero downtime deployments easy. Working with ad hoc code makes it difficult to the point where certain changes may become practically impossible.

Table Changes are Usually Easy, as Are Index Changes

Changing tables is usually pretty easy if you have stored procedures fronting your database code. The downside is that “usually” doesn’t mean “always” and some types of table changes might require extended foresight. For example, changing a column type from NTEXT to NVARCHAR(MAX) is not a one-step operation if you care about blocking.

Index changes independent of constraint changes are also pretty easy, with a minor schema modification lock at the time the index becomes available for us to use.

Constraint Changes are Usually NOT Easy

Most constraint changes are tricky. Creating primary and unique keys matches creating indexes means minor blocking for a very short time frame. The story is not as nice with foreign key or check constraints. For those, we cause blocking throughout the entire time—for foreign key constraints, we block both the table we create the key on and the table we reference. With those constraints, you’re going to want to create a new table, backfill, and swap names at the end.

Identity Column Changes are Sort of Easy

Some kinds of identity column changes, such as reseeding values, are easy. We also learned about some techniques that make adding an identity value after the fact or changing the increment pretty easy. If you want to add a primary key on top of that, you’ll have more process but can pull it off.

Parting Thoughts

The upside to getting into near-zero downtime is that you really get a stronger understanding of how the database engine you’re using works with respect to locking and changes. Being able to reason through changes with limited customer impact is a key consideration for a great brownfield developer and at the end of the day, most work is brownfield. If you want a specific case study of brownfield development, here you go.

This wraps up the Approaching Zero series of posts. If you liked it, good for you. If you hated it, good for you as well, but I have to wonder why you’d take the time to read tens of thousands of words of something you hated. Still, I appreciate that commitment to hatred.

If you have something to add, I’d love to hear about it in the comments or as a series of passive-aggressive sub-tweets.

Approaching Zero: A Case Study

This is part eight in a series on near-zero downtime deployments.

As we wind down this series, I want to talk through a case study, something of which I’m particularly proud. Here’s the scenario: I joined a product team responsible for approximately 200 tables and 800 stored procedures in 2014. This product team used the name ClientID to represent a customer, whereas the rest of the company used the name ProfileID to represent a customer. Now, this isn’t the end of the world by itself, but things got a lot more confusing because the rest of the company also had a competing definition for the term ClientID which was totally different than my team’s definition. This led to a lot of confusion historically and more than one miscommunication between teams. It’s something that prior database developers had wanted to clean up, but it was going to be too much work…and then more and more tables and procedures were added which followed that old model.

Of course there were a couple of complications to this project. First, ClientID was part of the primary key of 150+ tables. Second, ClientID was a parameter name in 700+ stored procedures. Third, we had to do this without a downtime window.

When I joined the team, I promised our data architect that we’d get this done. She came up with the design and we split the heavy lifting. Here’s what we accomplished:

  • Renamed ClientID to ProfileID on 150+ tables, including changing names on key constraints, indexes, etc. which referenced the term ClientID.
  • Refactored and renamed ClientID to ProfileID on approximately 550 stored procedures across a few instances. Most were centrally located in one schema, but a couple dozen were in miscellaneous other projects.
  • Updated several SSIS packages to use ProfileID instead of ClientID and deprecated several more.
  • Converted ad hoc queries spread around the code base to use stored procedures everywhere, adding about 20 or so more stored procedures with ProfileID.
  • Deprecated approximately 20 tables and 150 stored procedures which were no longer in use.

All of this happened in a single four-week sprint with no extra downtime window. In the end, we messed up one procedure and had to re-add one procedure we dropped out of a hundred because it turned out to be in very occasional use. Let’s walk through what we did in a little more detail.

To make it easier to follow, I created the following helpful graphic which woefully understates the amount of work we needed to do.

Phase 1: antebellum.

Early in the Sprint: Pre-Pre-Release

Very early on in the sprint, we worked on reformatting and refactoring procedures to make them consistent and (in some cases) understandable. Ideally, we would have done this in a prior sprint to reduce cognitive load, but we did not want to run the risk of higher-ups saying “Nope, this change is too big” and cancelling our project.

We also went looking for procedures not in use. Some of the places we looked were:

  • The plan cache and procedure cache. For servers which are online for a while, you get a pretty decent idea of which procedures get called. A procedure or table not showing up in these is not indicative of disuse or irrelevance, but if one does show up, it is a good indicator that we can’t just drop it.
  • Source control. All data is in source control and we use OpenGrok to search. OpenGrok isn’t the best tool ever but it does have an API so I could automate these checks.
  • SQL Agent jobs, specifically the sysjobsteps table.
  • Metadata tables. This is where my system will differ from yours. We have a few tables strewn about which contain steps for executing procedures in an order. For example, we have one table which an application uses to perform bulk insert operations for different tables.
  • Procedures with broken references. If your procedure is looking for a table or view which isn’t there, and if the error log doesn’t have references of failure, that’s a pretty good indicator that nobody’s using your table.
  • Prior versions of current procedures. People sometimes forget to clean up old versions of procedures, so this was a good opportunity to flush some of these down the drain.

For an example of the types of queries we built and ran, here’s one to query SQL Agent jobs:

	@SearchString NVARCHAR(3000) = N'dbo';

SET @SearchString = 'ClientID'

	sj.enabled AS IsJobEnabled,
	sjs.step_name AS JobStepName,
	sjs.command AS SQLCommand,
	sjs.last_run_date AS JobStepLastRunDate
FROM msdb.dbo.sysjobsteps sjs WITH (NOLOCK)
	INNER JOIN msdb.dbo.sysjobs sj WITH (NOLOCK)
		ON sjs.job_id = sj.job_id
	subsystem = 'TSQL'
	AND command LIKE '%' + @SearchString + '%';

We dropped about a hundred procedures and twenty tables doing this. That’s nice because it meant a hundred procedures and twenty tables that we didn’t need to modify.

During the Sprint, Still Pre-Pre-Release

During this phase, we created tests for as many procedures as possible to ensure that they still returned the same results after all of our changes. This gave us an important safety net.

We also prepped as much of this work as possible. As you’ll see, there were a lot of changes involved.

Pre-Release Phase

At this point, all of our tables have ClientID as part of the primary key / clustered index. We added a ProfileID as a nullable integer to each table with ClientID on it. Here’s a sample of the type of code you can use to do that:

		AND TABLE_NAME = ''' + c.TABLE_NAME + '''
		AND COLUMN_NAME = ''ProfileID''
	PRINT ''ALTER TABLE ' + c.TABLE_SCHEMA + '.' + c.TABLE_NAME	+ ' ADD [ProfileID] [int] NULL''
	ALTER TABLE ' + c.TABLE_SCHEMA + '.' + c.TABLE_NAME	+ ' ADD [ProfileID] [int] NULL
INNER JOIN sys.objects o ON o.object_id = OBJECT_ID(c.TABLE_SCHEMA + '.' + c.TABLE_NAME)
	c.TABLE_SCHEMA = 'SearchAdvisor'
		AND c2.COLUMN_NAME = 'ProfileID'
	AND o.type = 'U'

Our pre-release phase started probably 20-30 minutes before the database release. We didn’t need to create any new indexes, backfill any tables, or do any other long-running pre-release actions, so it was a relatively short period of time.

Database Release Phase

This is where the real action began.

Release Temporary Procedures

First, we developed temporary procedures for the release window. Each procedure which accepts a @ClientID parameter had a new @ProfileID parameter added as well and defaulted this to NULL. In the event that @ProfileID came in as NULL, we’d set @ProfileID = @ClientID.

Each of the current procedures used in the application (which take @ClientID as the parameter) would then have their logic scooped out and put into these temp procedures. The current procedures would call temp procedures. Each temp procedure would have @ClientIDchecks replaced with ISNULL(@ProfileID, @ClientID) checks.

To give a concrete example, let’s say there was a procedure dbo.DoSomeWork @ClientID = 20;. We would create a temporary procedure dbo.DoSomeWorkTemp @ClientID, @ProfileID and copy out the guts from DoSomeWork into the temp procedure. Then, we replace the innards of DoSomeWork with a call to DoSomeWorkTemp. The key reason here is that we couldn’t add a new parameter to these stored procedures without breaking a lot of data access layer code.

Here’s what it looks like now:

Phase 2a – And so it begins.

Update Views and Functions

All views and functions in use (fortunately, there weren’t too many of either) needed to run ISNULL(ProfileID, ClientID) as ProfileID and as ClientID. We needed to return both during this release window.

Now the image looks thusly. Kind of.

Phase 2b – We’re starting to modify some views and procedures.

Update SQL Code

For code that comes from SQL Server, we modified code to coalesce ProfileID and ClientID as ISNULL(ProfileID, ClientID) during the release window and to start passing in @ProfileID instead of @ClientID. This included SQL Agent jobs, triggers, stored procedures not used as part of our application code, SSIS packages, etc. That way, during the release window, queries would still return the same results regardless of where we were in the migration process. We could make these changes directly because didn’t have a relatively brittle data access layer to worry about; we just needed to be able to handle all circumstances.

You might expect that the image will look something like this, and if so, you are correct:

Phase 2c – Definitely not 2b.

Swap ProfileID and ClientID

For each table, we started a transaction to rename our empty ProfileID table to ProfileID2. We then renamed ClientID to ProfileID and added a new ClientID column as a nullable integer with. Finally, we dropped ProfileID2 and closed the transaction.

Here’s an example of how you might do this.

		AND TABLE_NAME = ''' + c.TABLE_NAME + '''
		AND COLUMN_NAME = ''ClientID''
		AND TABLE_NAME = ''' + c.TABLE_NAME + '''
		AND COLUMN_NAME = ''ProfileID''
	PRINT ''EXEC sp_rename ''''' + c.TABLE_SCHEMA + '.' + c.TABLE_NAME + '.ProfileID'''', ''''ProfileID2'''', ''''COLUMN'''';''
	EXEC sp_rename ''' + c.TABLE_SCHEMA + '.' + c.TABLE_NAME + '.ProfileID'', ''ProfileID2'', ''COLUMN'';
	PRINT ''EXEC sp_rename ''''' + c.TABLE_SCHEMA + '.' + c.TABLE_NAME + '.ClientID'''', ''''ProfileID'''', ''''COLUMN'''';''
	EXEC sp_rename ''' + c.TABLE_SCHEMA + '.' + c.TABLE_NAME + '.ClientID'', ''ProfileID'', ''COLUMN'';
	PRINT ''ALTER TABLE ' + c.TABLE_SCHEMA + '.' + c.TABLE_NAME	+ ' ADD [ClientID] [int] NULL''
	ALTER TABLE ' + c.TABLE_SCHEMA + '.' + c.TABLE_NAME	+ ' ADD [ClientID] [int] NULL
INNER JOIN sys.objects o ON o.object_id = OBJECT_ID(c.TABLE_SCHEMA + '.' + c.TABLE_NAME)
		AND c2.COLUMN_NAME = 'ProfileID'
	AND o.type = 'U'

And this is what our system looks like after this change:

Phase 3 – Our ClientID remains necessary but not for long. Note the gradient swap; that’s some top-tier visualization effort from me.

Rename Indexes and Constraints

After making all of these changes, we then renamed some indexes and constraints which had ClientID in the name. That’s not absolutely required, but it helps with database explainability down the road.

All in all, this took about 2 hours to do everything (as I recall). Our normal database release was somewhere around an hour or so at the time, but we had about an order of magnitude more scripts to run against a few thousand databases on dozens of instances.

Database Post-Release

Here I want to take stock of where we’re at. We renamed ClientID on all of these tables to ProfileID. We still have a vestigial ClientID column on our tables, a bunch of temporary procedures, and a bunch of temporary code in SQL Agent jobs, et al. At this point, we’re probably about halfway done.

Deploy Final Versions of New (née Temp) Procedures

This is where we start the decline phase in our story. Our temporary procedures existed as a bridge from the old procedures which took ClientID and new procedures which will take ProfileID. With our final versions of procedures, we replace @ClientID with @ProfileID in the input parameters and update any conditional logic within filters to use ProfileID instead of ClientID.

The only remaining use of ClientID in these procedures is as an output in select statements, as we still need that for the old code; by this time, all of those references are ProfileID AS ClientID. Otherwise, these new procedures are built around ProfileID.

We still have the original procedures that the application code uses and they reference our now-finalized procedures. These procedures are now transition procedures—we need them right now but as we move application code over to the new procedures, we can deprecate these. But I’m getting ahead of myself a little bit.

And here is the graphical version of where we are now:

Phase 4 – New procedures are ready to rumble.

Update Non-Code SQL

Next, we need to perform one final update of any SQL Agent jobs, dynamic SQL, triggers, and non-application sources. We want to remove any ISNULL(ProfileID, ClientID) references and replace them simply with references to ProfileID.

We’re getting bluer and bluer in our diagram and that’s a good thing:

Phase 5 – Non-app code is now aware that we have always been at war with ProfileID.

Update Views and Functions

Views and functions should replace ISNULL(ProfileID, ClientID) checks with ProfileID. Like stored procedures, they should return both ProfileID and ProfileID AS ClientID so we don’t break any calling code or procedures which still expect that ClientID record in the result set.

The post-release phase took a little bit of time to run. I don’t remember at this point just how long it took, but I do remember people starting to get a little antsy due to the sheer number of operations. Our biggest problem during this window was degraded quality of service due to all of the ISNULL(ProfileID, ClientID) checks in filters, but the site was still running and people could do their work.

Here is our current status:

Phase 6 – The endgame approaches.


Now we get to the second bonus round in this great switcharoo. Only after all of the post-release work was done could we get to this phase, where we start doing final cleanup.

Drop the Empty ClientID Column

All of our tables now have ProfileID as part of the primary key, but we have this vestigial ClientID column. We can safely drop the column from all tables, as nothing references it anymore.

Now our image looks like this:

Phase 7 – ClientID is almost a relic of the past.

Deprecate Obsolete Objects

Remember all of those obsolete tables, procedures, views, and functions I mentioned earlier in the article? This is where we dropped about 150 objects. It was a glorious day.

Update Application Code Over Time

We weren’t going to get all of the application code done in a single sprint, so we left this as a lingering task to complete over time. As developers changed code, we could eliminate references to ClientID in the code and eliminate references to the old wrapper stored procedures, having developers use the new versions which explicitly use ProfileID as parameter names. Once developers made those changes, we were able to remove ClientID from result sets and drop those wrapper procedures.

This change took years and there are probably a few gnarly parts of the code which still use the old ClientID name, but we handled at least 80% of the work within a few months.

Because it’s a long-running operation, my final picture shows the old procedures fading away and application code slowly making the shift.

Phase 8 – Old procedures don’t die; they just fade away. Unless I kill them with a vengeance. Basically, my databases are like reenactments of No Country for Old Men.


This project was interesting primarily because of the limitations we had. If we could have taken a downtime window of 4-6 hours, it would have been so much easier. But that wasn’t in the cards, so we had to find a workaround. The ultimate plan was our architect’s (who I totally won’t call out by accidentally linking her LinkedIn profile here), so she gets the credit for proving that we could do it. My marginal improvements were around source control, dropping lots of objects, and performing insane amounts of work during a single sprint.

Upcoming Events: SQL Saturday Spokane

Key Details

What: SQL Saturday Spokane
Where: EWU Computer Science and Engineering Building, 837-979 Washington St., Cheney, Washington, 99004
When: Saturday, March 23rd, all day
Admission is free. Sign up at the SQL Saturday website.

What I’m Presenting

10:15 AM — 11:30 AM — Data Cleaning with SQL and R

Bonus Victory Burrito Content

Cheney, Washington is the home of the best burrito I have ever had. If you want to join me for an amazing burrito, go to Arturo’s Mexican Restaurant. The likelihood of you regretting it is low.

PolyBase Revealed: MRAppMaster Errors

Let me tell you about one of my least favorite things I like to see in PolyBase:

Error: Could not find or load main class

This error is not limited to PolyBase but is instead an issue when trying to run MapReduce jobs in Hadoop. There are several potential causes, so let’s cover each of them as they relate to PolyBase and hopefully one of these solves your issue.

SQL Server mapred-site.xml Needs Fixed

The first potential cause of this issue is that your mapred-site.xml file in SQL Server needs a little something-something added to it. Specifically, make sure that it has the following property:


If you don’t know what I’m talking about, I have an older post on this topic.

SQL Server yarn-site.xml Needs Fixed

If the first option didn’t work, check out your yarn-site.xml file and ensure that it points to the right location. If you’re using Hortonworks Data Platform, be sure to check the configuration post I just linked to because the default configuration in HDP’s yarn-site.xml points you to will cause you problems.

Hadoop yarn-site.xml Needs Fixed

There is a chance that your problem isn’t on the SQL Server site—it could be on the Hadoop side. Check your yarn-site.xml file in Hadoop and ensure that the classpath is correct. Here is what I have for my yarn.application.classpath value:


The hadoop_home parameter points to someplace like /usr/hdp/ or whatever your specific version of HDP is. That makes it a bit more stable than the SQL Server side, where we don’t have the parameter.

I found this last cause particularly interesting because I have had success with MapReduce jobs before, but suddenly my old yarn-site.xml settings for HDP 3.0 (the default settings) stopped working and MapReduce jobs wouldn’t work again until I modified yarn-site.xml to correspond with what I use in SQL Server.

Hadoop mapred-site.xml Needs Fixed

One additional option pops up in the error message itself. Here is the full error message:

Last 4096 bytes of stderr :
Error: Could not find or load main class
Please check whether your etc/hadoop/mapred-site.xml contains the below configuration:
<value>HADOOP_MAPRED_HOME=${full path of your hadoop distribution directory}</value>
<value>HADOOP_MAPRED_HOME=${full path of your hadoop distribution directory}</value>
<value>HADOOP_MAPRED_HOME=${full path of your hadoop distribution directory}</value>

It’s looking for three particular properties in the Hadoop-side mapred-site.xml file, so make sure those are there. In my case, they turned out to be there already but considering that it’s in the error message, that makes me think this is a common enough cause that it’s worth mentioning.


There are several potential causes for a missing MRAppMaster class when creating MapReduce jobs and the causes tend to revolve around configuring yarn-site.xml and mapred-site.xml on your SQL Server instance and on your Hadoop cluster.

PASS Summit 2019 Precon Speakers Announced

PASS has announced their precon speakers for PASS Summit 2019. I’ll be there giving a precon on something or another. Because they haven’t announced any details, I won’t divulge any secrets, but it’ll be a great full-day training on a really cool topic.

Stay tuned for more details and register in the meantime while prices are still fairly low. Just looking at the speaker lineup, there’s going to be a great spread in topics. But my topic’s the coolest.

Approaching Zero: Identity Column Changes

This is part seven in a series on near-zero downtime deployments.

Identity columns are common enough in the SQL Server world that I figured I would spend some time on them today. I’m also going to include sequences in here because it’s a pretty short post otherwise. And if I learned nothing else from German philosophers, turgid and obscurantist prose is the best prose.

Reseeding an Existing Identity Column

Database pre-release
Database releaseReseed value
Code release
Database post-release

There are some tables where you create an identity value and expect to cycle through data. An example for this might be a queue table, where the data isn’t expected to live permanently but it is helpful to have a monotonically increasing function to determine order (just watch out for those wrap-arounds and you’re fine). An example of reseeding is below:


This operation needs to take a LCK_M_SCH_M lock, otherwise known as a schema modification lock. Any transactions which are writing to the table will block your transaction but so will any readers unless you have Read Committed Snapshot Isolation turned on or the reader is in the READ UNCOMMITTED or SNAPSHOT transaction isolation level.

If you are using RCSI and don’t have extremely long-running transactions, this is an in-and-out operation, so even though there’s a little bit of blocking, it’s minimal.

Adding an Identity Column

Database pre-releaseCreate new table
Database releasePartition switch
Rename tables
Code release
Database post-releaseDrop old table

Kenneth Fisher has a great technique for switching out tables to avoid having to write a complicated backfill process. So let’s try it out with limiting downtime in mind.

First, I want to create a table. I’m going to load this with a pretty fair amount of data, too.

CREATE TABLE [dbo].[PredictionTest]
    [PredictedQuantitySold] [decimal](8, 3) NOT NULL,
    [ActualQuantitySold] [decimal](8, 3) NULL
INSERT INTO dbo.PredictionTest 
SELECT TOP(1000000)
FROM sys.all_columns ac1
	CROSS JOIN sys.all_columns ac2;

Now we have a million rows and nowhere to call home. Let’s create a new table with an identity column and the same schema otherwise (including indexes, key constraints, and everything else). For more notes on this, check out Kenneth’s other blog post on changing identity column increments.

Here is our create statement for the pre-release phase:

IF (OBJECT_ID('dbo.PredictionTestNew') IS NULL)
	CREATE TABLE [dbo].[PredictionTestNew]
		[PredictedQuantitySold] [DECIMAL](8, 3) NOT NULL,
		[ActualQuantitySold] [DECIMAL](8, 3) NULL,
		[PredictionID] BIGINT IDENTITY(5000000,1) NOT NULL

Now, if you’re eagle-eyed, you noticed a little bit of a complication I added: PredictionID did not exist in the old table but it does in the new. If I simply try to perform a partition switch, I’ll get an error:

Msg 4943, Level 16, State 1, Line 42
ALTER TABLE SWITCH statement failed because table ‘tempdb.dbo.PredictionTest’ has 2 columns and table ‘tempdb.dbo.PredictionTestNew’ has 3 columns.

This means that I need to add the PredictionID column to the new table before performing this swap. Because I have to add a column at the end, our identity column needs to be the last one—columns between the two tables much be in the same order. Here is the script to perform the partition swap:

	FROM sys.columns sc
	WHERE = N'PredictionID'
		AND sc.object_id = OBJECT_ID('dbo.PredictionTest')
	ALTER TABLE dbo.PredictionTest ADD PredictionID BIGINT NOT NULL DEFAULT(0);
	ALTER TABLE dbo.PredictionTest SWITCH TO dbo.PredictionTestNew;
	EXEC sp_rename
	EXEC sp_rename

We are doing four things here:

  1. Add PredictionID to the existing table and filling it with 0 values so that we have something.
  2. Switching our data from PredictionTest over to PredictionTestNew.
  3. Renaming PredictionTest to PredictionTest_DELETEMENOW.
  4. Renaming PredictionTestNew to PredictionTest.

All of these are metadata operations and we will need to take a schema modification lock. All told, if there’s no upstream blocking, this is done in milliseconds.

In the post-release process, drop PredictionTest_DELETEMENOW. There won’t be any rows in this table.

Here are a few notes as I wrap up this section:

  • Your existing rows will not get identity values; only new rows will take advantage of the identity attribute. This means that if you don’t want a million values with a prediction ID of 0, you’ll need to fix that yourself with a backfill process.
  • This also works to remove an identity integer, such as if you’re migrating from identity values to sequences.
  • This also also works if you’re changing the increment value, so you might go from jumping 1 each insert to 2 each insert because you really hate odd numbers.
  • If you want to add a primary key constraint as well, you’re probably better off taking the backfill route, which we saw in the post on table modifications.

Creating a Sequence

Database pre-releaseCreate sequence
Database releaseAlter tables
Code release
Database post-release

Creating a sequence is pretty easy:

	FROM sys.sequences s
	WHERE = N'PredictionTestSequence'
	CREATE SEQUENCE [dbo].[PredictionTestSequence]

Once we have a sequence, we can use it in stored procedures to populate something like PredictionID (supposing we don’t use an identity integer). We might add something like this to an insert procedure for the prediction test table:

INSERT INTO dbo.PredictionTest
(3, 2, NEXT VALUE FOR dbo.PredictionTestSequence);

Application code would not care how PredictionID gets populated, so all of our changes are subsumed into stored procedure changes during the database release.

Restarting a Sequence

Database pre-releaseAlter sequence
Database releaseAlter procedures to reference sequence
Code release
Database post-release

Restarting a sequence is as simple as running an ALTER SEQUENCE command. I include it separately because it’s probably the most common change and I want to save “alteration” for changing the data type.

ALTER SEQUENCE [dbo].[PredictionTestSequence]

Any processes trying to get the next value for the sequence will require a schema stability lock on the sequence object itself, so there can be some blocking, but only for a moment until it clears up.

One interesting thing: processes selecting the next value for a sequence will not block your sequence restart. Unlike some of the changes we’ve seen in supporting objects (like many constraints), you can have operations in progress pulling values before the restart occurs.

Also, note that you can change the increment, min value, max value, cycling, and cache options the same way.

Altering a Sequence Data Type

Database pre-release
Database releaseAlter procedures (use new data type)
Code release
Database post-releaseDrop constraints
Drop sequence
Create sequence
Create constraints

You cannot use the ALTER SEQUENCE syntax to change the data type for a sequence. Microsoft’s recommendation is to drop and re-create the sequence, so let’s see how that works for us.

If all you have is stored procedures calling the sequence, then the first step would be to update your stored procedures to use the new data type. For example, suppose you have a sequence of type INT but you’re bumping up to BIGINT (probably the most common change by far). Change your variables which read the data type from INT to BIGINT and you’ll be fine. If your tables are defined as INTs, you’ll want to review my notes on changing a column’s data type and possibly reflect upon how that job as a street sweeper would be really nice right now. Street sweepers have zero stress. What’s the worst thing that happens to you when you’re driving a street sweeper? You hit the curb. Big whoop; the curb gets more damage than your vehicle. And if some yokel’s parked in the middle of the street on street sweeping day, you do your best to try to avoid the car but if they can’t follow the signs, what’s it to you?

Now that I’ve gotten the less hardcore types thinking about new positions in street maintenance, let’s suppose we have a table called SomeTbl which uses a sequence for its default like so:

	Id BIGINT DEFAULT(NEXT VALUE FOR dbo.PredictionTestSequence)

What happens when I try to drop and create the sequence as an INT?

Msg 3729, Level 16, State 1, Line 85
Cannot DROP SEQUENCE ‘dbo.PredictionTestSequence’ because it is being referenced by object ‘DF__SomeTbl__Id__68487DD7’.

If you guessed error, you win the prize, which is to fix the problem. The fix is that we have to drop all of those default constraints which directly reference sequences. But if you have a default covering some null data in an otherwise non-nullable column, I leave that as an exercise to you, dear reader.

Otherwise, the steps in a transaction are:

  1. Drop any default constraints referencing the sequence.
  2. Drop the existing sequence.
  3. Define the sequence as a new type.
  4. Re-add default constraints.

Let me give you an alternative, though: just create a new sequence and modify any calling code to use it instead. Make life easy.


In this post, we took a look at identity columns and sequences. There is some special behavior here for both which we can exploit for low-downtime environments.

Course Reviews: Getting Started With HDP

This is a review of Thomas Henson’s Pluralsight course entitled Getting Started with Hortonworks Data Platform.

This course felt a little light. Its primary focus is using Ambari to install elements of the Hortonworks Data Platform from scratch. Having done this a few times but only for local environments, it was useful to compare Thomas’s work against what I’ve done.

But this section leaves out a lot of pain that I’ve had to endure, in part due to admittedly odd circumstances: I have had to build everything offline so that I can take my Hadoop cluster on the go for presentations without needing Internet access; otherwise, when services start up, they try to phone home to update. There’s probably something that I’ve missed in my research which fixes this, but I haven’t found it yet.

I don’t fault Thomas for not showing local repository installation, but there are some things I would have preferred to see a bit more coverage on. I’ve run into plenty of weird errors and warnings which lead to installation failure, and we see almost none of that in the course. We only get the best-case scenario, but I’m not sure how common that scenario is. We also don’t get any detail on Kerberos or cluster security other than setting passwords. I know Kerberos is a topic in his follow-up course, so it might have been out of scope. Imagining myself as a sysadmin new to Hadoop, that’s something I’d be really interested in knowing more detail about.

On the plus side, there were some things I learned during installation, like how much of a resource hog Accumulo can be.

The second half of the course was around administration topics: rack awareness, rebalancing data, configuration, and alerting within Ambari and at the command line. These are good themes and I liked the coverage of rack awareness in particular. Configuration was a bit on the short side, though: we saw how to alter configurations, but nothing really on what to configure. There’s a huge surface area here, but again, if I’m a junior sysadmin or someone new to Hadoop, one of my first configuration questions is, “What do I need to tune out of the box and how do I understand the consequences of changing settings?” Even picking one or two common services like Hive, HDFS, or Spark and going through some of the important settings would help a lot, especially if the logic applies to other services too.

Overall, this course feels too short. I don’t want some 8-hour biopic that I’ll never finish, but at 2 hours, it gives me just enough to get stuck someplace. Thomas does a good job presenting and the material is clear, but a judicious 30 minutes of extra content would have made this that much better. Perhaps I’ll be more content after seeing his follow-up course.