Winning at Pong via Reinforcement Learning

I finally got around to trying out a reinforcement learning exercise this weekend in an attempt to learn about the technique. One of the most interesting blog posts I read is Andrej Karpathy’s post on using reinforcement learning to play Pong on the Atari 2600. In it, Andrej uses the Gym package in Python to play the game.

This won’t be a post diving into the details of how reinforcement learning works; Andrej does that far better than I possibly could, so read the post. Instead, the purpose of this post is to provide a minor update to Andrej’s code to switch it from Python 2 to Python 3. In doing this, I went with the most convenient answer over a potentially better solution (e.g., switching xrange() to range() rather then re-working the code), but it does work. I also bumped up the learning rate a little bit to pick up the pace a bit.

The code is available as a GitHub Gist, which I’ve reproduced below.

import numpy as np
import pickle
import gym
# hyperparameters
H = 200 # number of hidden layer neurons
batch_size = 10 # after how many episodes do we do a parameter update?
learning_rate = 3e-4
gamma = 0.99 # discount factor for reward
decay_rate = 0.99 # decay factor for RMSProp leaky sum of grad^2
resume = False # resume from prior checkpoint?
render = False
# model initialization
D = 80 * 80 # input dimensionality: 80×80 grid
if resume:
model = pickle.load(open('save.p', 'rb'))
model = {}
model['W1'] = np.random.randn(H,D) / np.sqrt(D) # "Xavier" initialization
model['W2'] = np.random.randn(H) / np.sqrt(H)
grad_buffer = { k : np.zeros_like(v) for k,v in model.items() } # update buffers that add up gradients over a batch
rmsprop_cache = { k : np.zeros_like(v) for k,v in model.items() } # rmsprop memory
def sigmoid(x):
return 1.0 / (1.0 + np.exp(x)) # sigmoid "squashing" function to interval [0,1]
def prepro(I):
""" prepro 210x160x3 uint8 frame into 6400 (80×80) 1D float vector """
I = I[35:195] # crop
I = I[::2, ::2, 0] # downsample by a factor of 2
I[I == 144] = 0 # erase background (background type 1)
I[I == 109] = 0 # erase background (background type 2)
I[I != 0] = 1 # everything else (paddles, ball) just set to 1
return I.astype(np.float64).ravel()
def discount_rewards(r):
""" take 1D float array of rewards and compute discounted reward """
discounted_r = np.zeros_like(r)
running_add = 0
for t in reversed(range(0, r.size)):
if r[t] != 0: running_add = 0 # reset the sum, since this was a game boundary (specific to Pong!)
running_add = running_add * gamma + r[t]
discounted_r[t] = running_add
return discounted_r
def policy_forward(x):
h =['W1'], x)
h[h<0] = 0 # ReLU nonlinearity
logp =['W2'], h)
p = sigmoid(logp)
return p,h # return probability of taking action 2, as well as hidden state
def policy_backward (eph, epdlogp):
""" backward pass. (eph is an array of intermediate hidden states) """
dW2 =, epdlogp).ravel()
dh = np.outer(epdlogp, model['W2'])
dh[eph <= 0] = 0 # backpro prelu
dW1 =, epx)
return {'W1':dW1, 'W2':dW2}
env = gym.make("Pong-v0")
observation = env.reset()
prev_x = None # used in computing the difference frame
xs,hs,dlogps,drs = [],[],[],[]
running_reward = None
reward_sum = 0
episode_number = 0
while True:
if render: env.render()
# preprocess the observation, set input to network to be difference image
cur_x = prepro(observation)
x = cur_x prev_x if prev_x is not None else np.zeros(D)
prev_x = cur_x
# forward the policy network and sample an action from the returned probability
aprob, h = policy_forward(x)
action = 2 if np.random.uniform() < aprob else 3 # roll the dice!
# record various intermediaries (needed later for backprop)
xs.append(x) # observation
hs.append(h) # hidden state
y = 1 if action == 2 else 0 # a "fake label"
dlogps.append(y aprob) # grad that encourages the action that was taken to be taken
# step the environment and get new measurements
observation, reward, done, info = env.step(action)
reward_sum += reward
drs.append(reward) # record reward (has to be done after we call step() to get the reward for the previous action)
if done: # an episode finished
episode_number += 1
# stack together all inputs, hidden states, action gradients, and rewards for this episode
epx = np.vstack(xs)
eph = np.vstack(hs)
epdlogp = np.vstack(dlogps)
epr = np.vstack(drs)
xs,hs,dlogps,drs = [],[],[],[] # reset array memory
# compute the discounted reward backwards through time
discounted_epr = discount_rewards(epr)
# standardize the rewards to be unit normal (helps control the gradient estimator variance)
discounted_epr -= np.mean(discounted_epr)
discounted_epr /= np.std(discounted_epr)
epdlogp *= discounted_epr # modulate the gradient with advantage (PG magic happens right here.)
grad = policy_backward(eph, epdlogp)
for k in model: grad_buffer[k] += grad[k] # accumulate grad over batch
# perform rmsprop parameter update every batch_size episodes
if episode_number % batch_size == 0:
for k,v in model.items():
g = grad_buffer[k] # gradient
rmsprop_cache[k] = decay_rate * rmsprop_cache[k] + (1 decay_rate) * g**2
model[k] += learning_rate * g / (np.sqrt(rmsprop_cache[k]) + 1e-5)
grad_buffer[k] = np.zeros_like(v) # reset batch gradient buffer
# book-keeping work
running_reward = reward_sum if running_reward is None else running_reward * 0.99 + reward_sum * 0.01
print('resetting env. episode reward total was %f. running mean: %f' % (reward_sum, running_reward))
if episode_number % 100 == 0: pickle.dump(model, open('save.p', 'wb'))
reward_sum = 0
observation = env.reset() # reset environment
prev_x = None
if reward != 0: # Pong has either +1 or -1 reward exactly when the game ends.
print('ep %d: game finished, reward: %f' % (episode_number, reward) + ('' if reward == 1 else ' !!!!!!' ))
view raw hosted with ❤ by GitHub

After running the code for a solid weekend, I was able to build an agent which can hold its own against the CPU, though won’t dominate the game. Still, it’s nice to see an example of training a computer to perform a reasonably complex task (deflecting a ball into the opponent’s goal while preventing the same) when all you provide is a set of possible instructions on how to act (move the paddle up or down) and an indication of how you did in the prior round.

Space age graphics!

Dataset Drift Monitoring with Azure ML

One of the things I like to say about machine learning model is, “shift happens.” By that, I mean that models lose effectiveness over time due to changes in underlying circumstances. Relationships between variables that used to hold no longer do, and so our model quality degrades. This means that we sometimes need to retrain models.

But there’s a cost to retraining models—that work can be computationally expensive and time-consuming. This concern is particularly salient if you’re in a cloud, as you pay directly for everything there. This means that we don’t want to retrain models unless we need to. But when do we know if we should retrain the model? We can watch for model degradation, but there’s another method: drift detection in your datasets.

The concept of drift detection is simple: we want to figure out if the distribution of the data has changed significantly since we built our model. If it has, that’s a sign that we should potentially look at retraining our model, or at least paying closer attention to model results. Let’s go through an example of drift detection using Azure Machine Learning. Note that as of the time of this blog post (March of 2021), dataset monitors are still in public preview, so things may have changed between now and the time you read this, visitor from the future.

Battlespace Preparation

The first thing we need to do is create a dataset. I’ve created a dataset for my expenses data demo, where I know that the Amount value has changed significantly for some people over the years 2017-2019 versus the starting point of 2011-2016. Here’s my schema:

One perfectly normal dataset.

Note that I needed to set one column to Timestamp. This is necessary for drift detection, as it needs to track results over time. To do that, there are two options available to us: either select a column to be the timestamp like in this case, or if you are using blob storage, you can use a folder structure like yyyy/mm/dd to track that data.

Now it’s time to create our dataset monitor. To do that, select Datasets from the Assets menu, choose Dataset monitors, and then select + Create to create a new dataset monitor.

Monitoring a dataset.

Our next step is to choose a target dataset. I chose the expense-reports dataset that I’ve created. Note that you also get to choose a specific version of the dataset.

Choose the form of your destructor.

After choosing a target dataset, you have to set up a baseline. The baseline defines what normal is for the dataset. That can be either a date range in the target dataset or a separate dataset altogether. In my case, I chose that the date range was part of the target dataset.

Picking a time frame

One thing I do want to complain about here is that in the UI, I don’t have the ability to type in a date. For something where I’m going back to January of 2011, that’s a lot of clicking to do. If I use the code-first approach, I can of course enter a timeframe, but I wanted to try out the UI approach first and it was not ideal.

Anyhow, the next step is to configure monitor settings. Here is where you can name the monitor, select which features you want to track, choose which machines will execute scheduled (or ad hoc) drift checks, and how frequently you want to check data. You can also optionally enter e-mail addresses if you’d like to receive e-mails when drift goes above the set threshold. All in all, this screen is straightforward.

Monitor settings

I decided to experiment with three different dataset monitors. The first one, that you see above, tracks all features weekly. The second, whose screen I did not capture, monitors all features monthly. The third monitors just the amount, but does so monthly.

Monitoring the most important thing: the moolah.

The reason I did this is that I know the dataset well enough to understand that Amount is the volatile variable. I wondered if drift detection would be able to alert me on potential drift for the all-features example, or if I needed to narrow it down to the one thing which does change dramatically.

Backfilling Data

After creating a monitor, navigating to its page shows that it is…kind of bare.

No data = no drift. Problem solved.

We haven’t provided any non-baseline data set, so of course this is empty. Also, the start and end dates run from March 2020 through March 2021, and I know I don’t have any data for that time frame. So let’s backfill some data. To do that, I select Analyze existing data and that shows a fly-out menu. In that menu, I can set the time frame for analysis, as well as my compute target.

Preparing a backfill

Let’s take a moment now and talk about timeframes. When we created the monitors, we set the frequency to one of three values: Daily, Weekly, or Monthly. This has two effects. First, it sets up an automated schedule to run on that period. Second, it assumes that you want that period for backfills as well. So for this weekly expenses monitor, the drift detection process will group data by week and perform an analysis. This becomes important in a moment. But let’s first submit this run.

After submitting a run, we learn that data drift backfills are stored in Azure ML as experiments, so we can collect details on the run there.

It was always Ohio.

Getting Results

After selecting the expenses-monitor-all-weekly-Monitor-Runs experiment, we can select the first run, and that gives us an important hint about how we’re doing it wrong.

Needs more data.

It turns out that we need to have at least 50 non-null data points per group. My group is a week, so I need at least 50 rows for the period 2017-01-01 until 2017-01-08, and then another 50 rows from 2017-01-08 until 2017-01-15, and so on. Well, for my data set, I don’t have 50 rows per week at all. It’s a fairly sparse data set in that regard, and thus the weekly monitor won’t work. It will keep telling me “No Data” because there aren’t enough rows to count any particular week.

Fortunately, we thought ahead and did this for the month as well. Clever us.

Something seems fishy here.

We can see that the monthly drift does return results and those numbers are quite high, driven mostly by Amount. Note that Amount is responsible for 93% of the total drift, and that our magnitude of drift is way above the arbitrary threshold. We can also see that it was increasing month-over-month for January, February, and March of 2017.

From there, we can see charts on the relative importance of features with respect to drift, as well as the measures Azure Machine Learning uses to track drift.

Yep, that’s your problem right there.

For numeric features, Azure ML uses four components: minimum value, maximum value, mean value, and Wasserstein distance (also called earth-mover’s distance). That is, three point value comparisons and one measure comparing the baseline distribution to the target distribution.

For categorical features, Azure ML uses the number of unique categories and the Euclidian distance. They describe it as:

Computed for categorical columns. Euclidean distance is computed on two vectors, generated from empirical distribution of the same categorical column from two datasets. 0 indicates there is no difference in the empirical distributions. The more it deviates from 0, the more this column has drifted. Trends can be observed from a time series plot of this metric and can be helpful in uncovering a drifting feature.

As we can see here, Amount drives our change and the others stay pretty well the same over time. I’d next like to run a backfill on the rest of my data points, but I don’t want to spend all day click-click-clicking in the UI. Fortunately, there’s a code-first notebook experience.

Code-First Backfills

In order to perform a backfill, we need only a few lines of Python code.

from azureml.core import Workspace, Dataset
from azureml.datadrift import DataDriftDetector
from datetime import datetime

ws = Workspace.from_config()

monitor = DataDriftDetector.get_by_name(ws, 'expenses-monitor-all-monthly')

backfill1 = monitor.backfill(datetime(2017, 4, 1), datetime(2017, 6, 30))
backfill1 = monitor.backfill(datetime(2017, 7, 1), datetime(2017, 9, 30))
# etc. etc.  Or, you know, create a loop.
backfill1 = monitor.backfill(datetime(2019, 10, 1), datetime(2019, 12, 31))

Each of the calls to monitor.backfill() will queue up a run of the relevant experiment, so the call will finish within a couple of seconds, but that doesn’t mean your backfill has completed.

Added to queue; mission accomplished.

In my simple scenario, each 3-month period took about 3 minutes to run. Obviously, this will depend greatly on compute power, number of rows of data, and number of features to compare.

Many thumbs were twiddled in the making of this blog post.

Now that everything is complete, we can take a look at drift over the course of our several-year dataset.

Just a little bit of drift here.

In the Future

So far, we’ve only looked at backfills. The real benefit of data drift analysis, however, is that you can use it to monitor data going forward based on the schedule you’ve set. Then, if the monitor catches drift-related issues, it can send you an e-mail and alert you to this change in data.

Can you believe I’ve made it all the way through this blog post and haven’t once made any Fast & Furious references? Yeah, me neither.


Dataset drift monitoring with Azure Machine Learning is really simple, to the point where you can set it up within minutes and have reasonable data within hours. I think it’d be hard to find a simpler method to perform this sort of analysis.

Azure ML: Reading from and Writing to Azure SQL DB

Not too long ago, I worked through an interesting issue with Azure Machine Learning. The question was, what’s the best way to read from Azure SQL Database, perform model processing, and then write results out to Azure SQL Database? Oh, by the way, I want to use a service principal rather than SQL authentication. Here’s what I’ve got.

Set up a Service Principal

I’m going to assume that you already have Azure SQL Database and the Azure Machine Learning workspace set up. In the Azure search bar, search for App Registrations and then select + New registration.

Register a new application.

Name the account whatever you’d like. I called mine AzureMLDatabaseAccess to keep its purpose clear. On the Overview page, you will see some important information:

Your details probably will not be blurred out. If they are, clean your glasses and try again.

You’ll want to copy the Application ID and the Directory ID. And note that each one of these has a name in parentheses. As we all know, two extremely common names is better than one extremely common name.

After you copy those down, navigate to the Certificates & secrets tab on the Manage menu and select + New client secret.

What’s the answer? It’s a secret.

Name your client secret. I happened to name mine DatabaseConnection. This generates a Value which you can use as a password. Copy this down as well, as you’ll need it later and this is your one chance to see it. Unless you delete this secret and create a new one, I guess.

Tie the Service Principal to Azure SQL Database

Now we need to make this app registration mean something. To do that, we need to connect to Azure SQL Database and tie the app registration to our database. We can do that by running a code snippet like the following:


Where naturally, AzureMLDatabaseAccess is the name of your app registration. Then, you can grant whatever SQL Server rights you need, such as adding this app registration to specific roles, all the way up to granting it db_owner because what’s the worst that could happen, amirite?

Oh, by the way, make sure you’re connected to the database using Azure Active Directory. Otherwise, you’ll get an error like this:

Well, that didn’t work out as I wanted.

If you’re like me and you don’t have an Azure Active Directory tenant set up, you can connect to the SQL server (and not the SQL database), navigate to Active Directory admin in the Settings menu, select Set admin, and choose your app registration. Be careful with this, however—that app registration can now do anything on the server, so don’t go sharing details all willy-nilly.

Usually it’s a good idea to avoid doing much of anything all willy-nilly.

Save that and you’re golden. Now let’s go to Azure Machine Learning.

Configure the Azure Machine Learning Datastore

Navigate to Azure Machine Learning Studio and select your workspace. Then, in the Manage menu, select Datastores. Select + New datastore to create a new datastore.

Create a new datastore.

I’m going to call this datastore expense-reports because that’s the name of my database. I’ll choose Azure SQL database as the Datastore type and select my subscription and database. Change the Authentication type to Service principal and you’ll need to enter a Tenant ID, Client ID, and Client secret. Those are the three things I just happened to have you write down. It’s great how we sometimes luck out that way. With all of that done, we can create the datastore and move on.

One filled-out datastore later.

The good news is that we have a datastore. The bad news is that we have no idea if it worked. So let’s figure that out. Navigate to Datasets in the Assets menu and choose + Create dataset and then From datastore.

Create a new dataset.

I’m going to call my dataset expense-reports because that’s what it is. Keep the Dataset type as Tabular and move to the next section.

In this section, I’ll select Previously created datastore and pick the expense_reports datastore. Then, choose Select datastore. As a quick note, I want to make sure that the Skip data validation box is unchecked. We absolutely want to validate the data, as we’re using this test to make sure that our app registration is set up correctly.

I can then enter my SQL query and select Next to continue.

This definitely looks like a query.

If your security setup is correct, you’ll be able to continue. Otherwise, you’ll see an error like the following:

This message is simultaneously verbose and empty.

The real error message is in the More details link. In this case, my error code was 40615. Check out the guide for troubleshooting connectivity errors in Azure SQL Database and Azure SQL Managed Instance and 40615 is a firewall issue. This one’s pretty easy to fix. Return to the SQL server and navigate to Firewalls and virtual networks in the Security menu. The easy button here is to allow Azure services and resources to access the server. The hard button is to figure out the relevant IP addresses that Azure Machine Learning will use and add those. I voted for easy.

Let those Azure services roam free.

After saving and updating the firewall rule, give it a few minutes to propagate and try again. If you paid your Azure bill and included the bribes special payments, you should get your dataset preview.

“Special payments” successful.

Now that we have this in place, let’s test it all out.

Building a Notebook

For this last part, we’ll want to have a table in Azure SQL Database to store predictions. Here’s the table creation statement:

CREATE TABLE [dbo].[ExpenseReportPrediction](
	[ExpenseReportPredictionID] [int] IDENTITY(1,1) NOT NULL,
	[EmployeeID] [int] NOT NULL,
    [EmployeeName] [nvarchar](50) NOT NULL,
	[ExpenseCategoryID] [tinyint] NOT NULL,
    [ExpenseCategory] [nvarchar](50) NOT NULL,
	[ExpenseDate] [date] NOT NULL,
    [ExpenseYear] [int] NOT NULL,
    [Amount] [decimal](5, 2) NOT NULL,
	[PredictedAmount] [decimal](5, 2) NOT NULL
ALTER TABLE [dbo].[ExpenseReportPrediction] ADD  CONSTRAINT [PK_ExpenseReportPrediction] PRIMARY KEY CLUSTERED 
	[ExpenseReportPredictionID] ASC

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
view raw sqldb-demo.ipynb hosted with ❤ by GitHub

And here’s a picture of the output:

We have predictions.


In this post, we learned a few things.

  • It is possible to read from Azure SQL Database in Azure Machine Learning. It’s pretty easy, even.
  • It is not easy at all to write to Azure SQL Database from Azure Machine Learning.
  • Service Principals are pretty cool.

I’m Hiring: the Quest for an Intern

Catallaxy Services is looking for a paid intern. By which I mean that I am looking for an intern, but it sounds more official when I take on the corporate guise. The short version of the skills I am looking for are as follows:

  • Video editing
  • Graphic design
  • Slide deck design
  • Fluent or native English language skills
  • Willingness to learn about data platform technologies

Catallaxy Services, LLC provides training and consulting services in the data platform space. 2020 was a busy year and 2021 is shaping up to be even busier. In order to expand and grow, I am looking for an intern with visual design skills to perform approximately 20 hours of work per week. The primary focus of this position will be to assist me with content design and development, editing videos, and improving the quality of existing digital assets.

If this sounds good to you (or someone you know), please reach out to me at and we can talk about the position in more detail. Or you can apply via LinkedIn.

Official Rankings: Date and Time Data Types in SQL Server

Over on Curated SQL today, I linked to Deborah Melkin’s review of the different date and time data types available in SQL Server.

As I was reading that, I decided that it was time to create an Official Ranking. So here it is for the date and time data types:

  1. DATETIME2, especially DATETIME2(0). If you’re using UTC everywhere, this is a no-brainer best data type. You get to choose whether you want your DATETIME2 to be smaller than DATETIME or more precise than DATETIME.
  2. DATE. A large percentage of the time, I just need the date and don’t care about the time. This three-byte beauty gives me that, but with zero risk that I miss a record because instead of 2020-01-01 00:00:00 they enter something that rounds to 2020-01-01 00:00:00.003.
  3. DATETIMEOFFSET. It’s a hefty data type, the largest in the bunch. But if you aren’t dealing with UTC everywhere, it does something no other date or time data type can: it keeps track of your time zone. Therefore, if you for some reason hate storing times in UTC, flip this with DATETIME2 and use it everywhere. Note that you also need to use the AT TIME ZONE syntax to specify what the time zone is or have your user correctly specify the time zone by passing in a pre-created DATETIMEOFFSET. It’s like getting a puppy: you wanted it, now you deal with the extra work taking care of it.
  4. TIME. This is where the utility curve really starts to fall off. I rarely use the TIME data type. However, I do appreciate that it’s just 3 bytes and can be useful for process scheduling.
  5. DATETIME. It’s really not that great of a data type. If you care about size, three of the four have it beat. If you care about precision, three of the four (assuming you bump up the precision on DATETIME2) have it beat. If you care about not messing things up, three of the four have it beat (DATE and TIME if you use one but not both elements, and DATETIMEOFFSET if you don’t want to mess up time zones).
  6. SMALLDATETIME. If you’re really that desperate about two bytes, just use DATETIME2. This was a lot more useful prior to SQL Server 2008, but we can say the same thing about Kurt Warner. That doesn’t make me want to plug him into my fantasy team today.

Apollo 13: Engineering Lessons Under Crisis

This video is a bit different from my normal fare, but by about halfway through the film I knew I wanted to do this.

Links of Note


Hey, everybody. This video is going to be a little different from my normal fare. I recently got the opportunity to re-watch Apollo 13, a movie I enjoyed growing up but have not seen in more than two decades. Even as a kid, I loved the engineering sequences, in which Mission Control needed to solve unexpected and extremely challenging problems to bring back home Jim Lovell, Fred Haise, and Jack Swigert. In this video, I’m going to delve into detail on those challenges and important take-aways for engineers in any discipline, like, say, software development. By the way, when I use the term “engineer” in this video, I’m purposefully including all kinds of roles, from engineers proper to technicians and operators. The exact job title here isn’t important; it’s the skill set and mentality which matters the most. Anyhow, sit back and enjoy:


Before I get started in earnest, I want to make note that this video is based solely on the movie Apollo 13, a dramatization of Jim Lovell’s book, Lost Moon: the Perilous Voyage of Apollo 13. This is not a recounting of actual events as they happened–I understand there were areas where the film wasn’t completely accurate to reality and they made up a couple of bits to ratchet up the suspense, but my plan is to appreciate the film as such rather than inasmuch as it was a reflector of actual events. That said, it’s still a lot closer to reality than most such movies.

As another aside, something I really appreciated was just how good the acting was in this movie. It struck me throughout the film that they were doing a great job of the adage “Show, don’t tell.” For example, we learn a bit about the personality of Jim Lovell–played by Tom Hanks–as a former test pilot in the opening sequence, where we see him zooming along the highway in Houston, starting with a forward-facing shot of a car zipping through traffic and switching to a view of Lovell in his shiny, red Corvette. (By the way, I said stick to the film, but in real life, Lovell’s Corvette was blue. How dare you, Hollywood.) Anyway, he’s a straight-laced guy who lives fast, and we get that information simply from this first scene, comparing Lovell’s bearing and appearance with his need for speed. This fits the early astronaut profile to a tee.

Another instance of this is later on in the film, when you see a television interview–taped sometime before the mission–of Lovell recounting a story as a combat pilot in which he is, at night, with no homing signal or radar, trying to find an aircraft carrier running dark. Then the electronics short out on his plane, leaving him completely in the dark, but instead of panicking he keeps his head straight, maintains control, and ends up finding the carrier safely due to a bit of luck and a trail of phosphorescent algae. This is playing on the television for us to get a measure of the man and I appreciate that they do it this way rather than having Lovell tell his fellow astronauts “Hey, I survived finding an aircraft carrier with no running or interior lights in the middle of a dark ocean, so we can do this!” He acts calm and collected by showing his training, and he doesn’t need to tell us any of this.

Speaking of the other astronauts, in early sequences in the film, we see Lovell, Ken Mattingly–played by Gary Sinise–and Fred Haise–played by Bill Paxton–become very close in training, to the point where they can interpret each others’ moves and have implicit trust. When Jack Swigert, played by Kevin Bacon, has to swap in for Mattingly at the 11th hour, there is a bit of conflict. We can see an example of that conflict in the way the trio address each other–Lovell and Mattingly call Haise Freddo, but Swigert, being the outsider, refers to him as Fred. But by the climax of the film, we see that conflict resolved in a natural way, and we see the comradery between the two men as Swigert too refers to Haise as Freddo. It’s these little character touches, as well as exactly the right facial expression or the right tone of a word which tell so much more to us than what the actors need to say. I wanted to get that in early because the acting and characterization really hold up and they’re worth discussing even in a video about Mission Control.

So now let’s get to the main event.

My first point is:


Conflict happens, but interpersonal conflict shouldn’t drive animosities. In a healthy team, people should have the attitude that we’re on the same side here, trying to solve the same problems. But we do see conflict in ideas, in part because they have different weights on preferences and different sets of information available to them. In the film, one of the first Mission Control crisis scenes has a pair of engineers arguing over whether to turn the Apollo 13 command module around or try to slingshot around the moon. Both men make good points: the engineer wanting to turn the ship around notes that it would be the fastest way to get the crew back to Earth and that’s critical because they’ve already lost a considerable amount of oxygen. The opposing side points out that the service module engine is damaged and might explode if they try to use it, so it’s effectively dead. They’re hurtling toward the moon, so to turn around, they would need to cancel all of that velocity and then build up velocity in the opposite direction, and the vessel doesn’t have enough known-good engine power to do this. But if they continue on their path, they can redirect their existing momentum and use the moon’s gravitational pull to slingshot them back toward Earth. It will take longer to get back, but is much more likely to succeed.

Although tensions run high during the scene, the engineers all understand that this is an important part of decision-making under crisis: you need to understand the options available and be willing to speak up if you see a mistake in progress. Ed Harris’s Gene Krantz also does a good job here by allowing the squabbling but shutting it down when it stops being productive. He makes sure that the good engineers are able to voice their concerns and lay out all of the necessary information–even if they don’t do a perfect job of remaining calm and thoughtful. As a group, they are able to think through the problems, and as the flight director, Krantz’s job is to make the call.

This works because:


Something I appreciated in the film is that we didn’t have a singular, brilliant mind answering all of the problems. Because that’s not how it works in good organizations. You have different groups of people with specializations in different areas and the capability to solve different types of problems. Mission Control had different groups of engineers solving problems like CO2 scrubbing, turning back on the flight computer, and coming up with a flight path that will let the command module land in the ocean without burning up in atmosphere or skipping past the earth like a rock on a pond. This wasn’t Gene Krantz sitting in his lair, meticulously plotting and having scribes and underlings carry things out; this was a big effort from a lot of people quite capable of solving problems.

What’s interesting in the film is that Ken Mattingly is kind of set up in the “supergenius” role due to his tremendous work ethic and deep knowledge of the command module. But even in the sequences where Mattingly works to get the flight computer power-on sequence correct, he’s a member of a team. He’s not the one who realized that there would be power problems in the first place, and although he’s the one who we see come up with the final process, he’s not doing it alone. Furthermore, they don’t have Ken fix the flight computer, figure out burn time, and scrub CO2–he’s a smart man performing a heroic effort in doing his part to bring three men home alive. That he’s not a sole supergenius takes nothing away from that, but it does give us a critical insight that engineering under crisis is a team sport.

And while I’m talking about Mattingly, I do want to point out that although he’s an astronaut, he’s tied in with the engineers too. At that point in its history, NASA selected primarily for a few critical traits: intelligence, composure, and creativity under stress. This is part of why they loved choosing test pilots, as that’s a great demonstration of all three. The training program was rigorous and included intimate knowledge of the critical systems, as once you’re out in space, you’re on your own. Mission Control may be available most of the time–assuming no radio outages–but you have to know how to operate the different systems, how to troubleshoot problems, and how to make things work in stressful conditions. The film shows this during the sequences where Mattingly and Swigert dock with and retrieve the Lunar Module, or LM (pronounced LEM). Mission Control simulates different failures and forces the command module pilot to think on his feet, diagnosing the problem, assessing the situation, and performing the right response. In order to be that highly proficient of an operator, the astronauts need to have almost the same understanding of the systems that engineers have. You may not have asked Ken Mattingly to design and fabricate a command module, but he has to know the hows and whys behind its design because it’s literally his life on the line.

Let me give you a more prosaic example of how all of this fits into the real world. IT teams are made up of people with a range of specialties: front-end developers, back-end developers, database developers, database administrators, network administrators, systems administrators, architects, and a whole lot more. If you have a Sev-1 crisis and your company’s applications are down, it’s natural to expect these groups of people to work within their specialties to solve problems. Ideally, you don’t have a single The Expert whose galaxy brain is responsible for solving all of the problems. This is because The Expert can be wrong and, no matter how long The Expert has been around in an organization, there’s always going to be a surfeit of knowledge. In a well-functioning engineering team under crisis, each person understands their piece of the puzzle and knows how to get information. App developers are looking through their logs while database administrators try to figure out if the issue is in their domain, network administrators review packet and firewall logs, and systems administrators check server logs. They can take and synthesize the information they’re processing to help shed light on the problem and develop viable solutions. When there are people with demonstrated expertise–as opposed to being The Expert–they can help narrow down places to look, compare the current issue to prior issues, and make connections which specialists alone might not have done. But even so, that person is still one member of a group.

This is also bleeding over into my next point:


In one interesting scene, Swigert mentions to Lovell that they’re coming in too shallow and at this trajectory, will skip right out of the atmosphere. Lovell respond by saying “There are a thousand things that have to happen in order. We are on number 8; you’re talking about number 692.” In a tough situation, it’s easy to focus on a specific problem to the detriment of coming up with a solution. In a few lines of dialogue, we get the crux of an important point: document that there is an issue, follow up on the issue when it’s the appropriate time, and continue working on the most important thing. Under normal circumstances, it’s easy to prioritize issues based on who’s complaining loudest or most recently. To give you a data platform example, I might notice some slow reports in a log and decide to take the day speeding those up. It may not have been the single most important thing I could have done at that point in time, but it was a reasonable thing to do as it was causing people pain right then. But in a crisis situation, we have to avoid that temptation and focus on the single most important thing. Yes, those reports are slow. And they might cause us to lose the contract if we don’t fix them. But right now, the database is refusing connections and we need to fix that before we can think about reports.

Tying back to the previous idea of expertise without The Expert, this also allows engineering teams to focus on solving a problem. We have navigation specialists focusing on burn calculations, a sub-set of engineers trying to figure out how to put a square peg into a round hole, and a group of engineers working with Mattingly to power on a flight computer while using less than 12 amps and “You can’t run a vacuum cleaner on 12 amps, John.” Krantz and Glynn Lunney, as the two shifts’ flight directors we get to see in the film, need to worry about all of these problems, ensure that teams are working on the most critical things, and coordinate between teams. But they also need to be able to delegate tasks to team leads–they have to trust that those team leads will do the right thing because they don’t have the time or energy needed to coordinate efforts and solve problems.

Taking this back to the real world, part of my company’s production emergency process is to designate a coordinator. Typically, that person is someone separate from the engineers working on the problem, as the coordinator needs to communicate and process information effectively and it’s hard to do that while you’re typing furiously and scrambling to remember where that stupid log file gets written to and where you saw error code 3400. That coordinator role is only one portion of what Krantz and Lunney needed to do, but it’s a vital part, especially when a lot of people are working in small groups on different parts of the problem.

The film did an excellent job of portraying my next theme:


Apollo 13 really captured the spirit of great engineers under pressure: it is a combination of creativity in solving problems mixed with a down-to-earth thought process which forces you to grapple with the problem as it is rather than as you’d like it to be.

I think the line which most exemplified the pragmatism side was when Ken Mattingly prepared to enter the command module simulator, demanding that conditions be set the same as what the crew on board were experiencing. “I need a flashlight,” he says. When a tech offers up the flashlight in his hands, Mattingly responds, “That’s not what they have up there. Don’t give me anything they don’t have onboard.” Mattingly limits himself to the conditions and cirumstances of the actual command module to ensure that what he does on the ground can be replicated exactly by the crew onboard the Apollo 13 command module as it currently is, not as they ideally would wish it to be.

Meanwhile, the first major scene which really brought the combination of skills together was when Gene Krantz brought together all of his engineers and makes a show of throwing away the existing flight plan. All of their effort and planning is out the window because circumstances changed to the point where that plan is untenable. And now it’s time to improvise a new mission. We are in a crisis, so our expectations of normal behavior have gone in the trash along with that flight plan.

But we see another example of the brilliance that can come from creative, pragmatic people under pressure. In one of the most iconic sequences in the film, NASA engineers need to fit a square peg into a round hole. The carbon dioxide scrubbers in the lunar module are cylindrical and designed for a cabin with two people living in it for a fairly short amount of time. The command module CO2 scrubbers are larger, boxier, and completely useless as-is. Because nobody planned on using the lunar module as a lifeboat, there weren’t any spare CO2 scrubbers for the lunar module and without one, the crew would die of carbon dioxide poisoning long before they reached Earth.

Cue the engineers, dumping out everything available to them on a table and one technician saying, “Okay, people, listen up. The people upstairs handed us this one and we gotta come through.” No grousing, no complaints about how this isn’t normal. Instead, it’s a challenge. The other technicians start with the most logical step: let’s get it organized. Oh, and they get to brewing coffee too. I guess that’s the second logical step.

What they come up with is a bit of a mess, but it solves the problem. It is a hack in the most positive sense of the term: a clever and unexpected solution to a problem. You don’t want to make a living building these, but it’s good enough to get three men home.

Taking this a step further, use of the LM itself was a great example of creativity and pragmatism. The lunar module was designed to land on the moon, collect some rocks, and get them back to the command module. Using it as a lifeboat was, as the Grumman rep points out, outside of expected parameters. Even in this crisis, the rep is thinking as a bureaucrat: his team received specifications and requirements, his team met those specifications and requirements, and anything outside of those specifications and requirements he can’t guarantee. They did exactly what they needed to do in order to fulfill the contract and that, to them, was all they could do. This bureaucratic mindset is not a bad thing in normal circumstances–it reduces risk of legal liability by setting clear boundaries on expected behavior and promotes success by ensuring that people are willing to agree on the objectives to reach. But in the crisis, the Grumman rep gets shunted off to the side and the engineers take over. Because that’s what engineers do to bureaucrats when the going gets tough.

Another example of creativity marrying pragmatism involves the LM burn sequence. Lovell and Haise needed to perform a precise, controlled burn, but because the flight computer was turned off, they needed a way to ensure they stayed on track and did not stray too far. Their answer was to find a single, fixed point: Earth. As Lovell states, “If we can keep the earth in the window, fly manually, the co-ax crosshairs right on its terminator, all I have to know is how long do we need to burn the engine.” That sequence displays knowledge, creativity, and a way to make the best out of a given situation.

My third LM example takes us back to Ken Mattingly and the LM power supply. Mattingly is struggling to power on the flight computer with fewer than 12 amps and failing. He hits on a wild idea: the LM batteries still have power remaining, and there is an umbilical which provides power from the command module to the LM as a backup system. If they reverse the flow of power, they’ll draw from the LM batteries and into the command module, giving them enough power to get over the hump. Their unexpected lifeboat gives them one final miracle.

Creativity and pragmatism are invaluable assets for engineers in a crisis situation, letting them take stock of how things are and make best use of the circumstances to solve challenging problems. But they also need:


Being creative in a simulated situation or in day-to-day work is different from being able to get the same results under pressure. Perseverence is one of those key traits which can make the difference between success and failure in a crisis.

We get a glimpse of engineers persevering–and another example of show, don’t tell–when we see engineers sleeping in cots in a back room. “Is it A.M. or P.M.?” is a question you don’t generally want to hear from your employees, but sometimes a crisis requires this. Now, if your engineers have to do this regularly, you have a serious organizational problem. But when you are in the middle of a crisis, it’s good to know that you have people willing to push their limits like this. In other words, you want your employees to be willing to do this, but your employees shouldn’t be doing this except in true emergencies. Like three guys you’re trying to bring home from outer space.

Pressure doesn’t have to be huge and external, either. Think of the everyday, little things. Like, say, a projector bulb bursting. By the way, I had one of those happen to me during a training once. They do make a pretty loud bang when they go. Anyhow, what does Gene Krantz do after the first tool he uses fails on him? Move to a backup: the chalkboard. Have backups to your backups and be willing to move on. If they spent twenty minutes waiting for a staff member to find a spare overhead bulb, that’s twenty minutes they’re wasting during a crisis, at a point in which they need every minute they can get. Fail fast and be ready to abandon tools which are impeding a solution. By the way, I liked Krantz’s response to this: he’s frustrated. Frustration is okay–it’s part of the human condition, especially under stress. But don’t let it prevent you from doing what you need to do, and don’t get tunnel vision on this frustration, as you’ll only compound it without helping the situation any.

Underlying most of my video so far is an implicit cheerful, go-get-em, “We can do this!” attitude. But you don’t need that attitude all the time as an engineer in a state of crisis. Doubt, like frustration, is part of the human condition and to experience doubt in a situation of stress is normal. But again, just like frustration, focusing on the doubt gets nothing accomplished. Even if we doubt we can come up with something, we keep searching for a solution. A great example of this is where the technician doubts Mattingly will be able to turn on the flight computer without breaking 12 amps, but even so, he keeps at it because he refuses to let Mattingly or the crew on Apollo 13 down.

While we’re talking about topics like frustration and doubt, let’s make something clear:


Just because you’re an engineer–maybe even a smart engineer, maybe even one of the best in the world at what you do–it doesn’t mean you live mistake-free. And it’s surprisingly easy to make big mistakes simply by assuming that what you know is correct and failing to take into consideration changing circumstances.

Let me give you an example of that. During one attempt at maneuver, Lovell reports “We’re all out of whack. I’m trying to pitch down, but we’re yawing to the left. Why can’t I null this out?” Fred responds that “She wasn’t designed to fly like this, our center of gravity with the command module.” Lovell trained in the simulators and knew exactly how the LM is supposed to handle, but none of the simulations assumed that the LM would maneuver while still attached to the command module. As a result, all of those in-built expectations of behavior go right out the window and the crew are forced to compensate and learn anew.

Similarly, when Mission Control provided the crew with burn times to correct a trajectory problem, they made a crucial mistake: one of the flight path specialists tells Krantz, “We’re still shallowing up a bit in the re-entry corridor. It’s almost like they’re under weight.” They quickly realize the problem: Apollo 13 was intended to land on the moon, collect approximately 200 pounds of moon rocks, and bring them back to Earth. Therefore, they calculated the return trajectory based on that weight, but when the mission profile changed drastically, that small discrepancy made a difference and forced the crew to transfer items from the LM to the command module to keep things properly balanced.

A third example involves Haise, who is confused about why the CO2 numbers are rising so quickly, as “I went over those numbers three times.” But what he went over were the expectations for a two-man crew. Jack Swigert’s breathing threw off Fred Haise’s numbers, leading to the crew having less time than expected to solve this problem.

We generally do not need to be perfect in a crisis, and that’s a good thing–even at the best of times, I’m pretty sure nobody is perfect. What is important in these cases is that the engineers and crew understand that there is a mistake and correct things before they cause too much damage. And one way to limit the damage of mistakes is to have cross-checkers. I really liked this scene in which Lovell is filling in calculations for gimbal conversions and needs a double check of the arithmetic. Instead of trusting that Lovell will get the numbers right, he goes to Mission Control. And instead of one person doing the math, you see half a dozen engineers independently working on the problem. Having people who can check that what you are doing is correct is huge, and even more so if they’re able to perform independent evaluations.

As we near the close of this video, I want to talk about one last thing:


I really liked the way we saw Gene Krantz and Glynn Lunney manage engineers during the film, especially Krantz. He does an admirable job in performing the role of a manager. Early on, when the different teams are in a back room trying to figure out what to do, Krantz opens the floor and lets engineers confer and share ideas. This is an important part of management of engineers: they are idea people, but to get ideas out of them, they need to know you’re willing to listen and not shut them down immediately. But it’s also the job of a manager to make a decision. Engineers can go around and around on an issue until the cows come home or get knee-deep into the minutae of a situation, and sometimes, it’s the manager’s job to refocus people. Let your specialists figure out the options, but it’s your job to understand when it’s time to move.

Also, don’t solve a problem you don’t need to. We can spend a lot of time thinking about hypothetical situations and how we might proceed, but that’s a waste of brainpower and time, two critical resources during a crisis. Focus on the things you need to solve. Your team might need to solve more than one problem at a time, but make sure they’re actual problems and that they are the most important problems to solve right then.

And make sure your engineers understand exactly what it is they have to do. If you expect them to sleep in cots at the office for the duration of a crisis, you don’t want them twiddling their thumbs or having to look for work. If you’re a manager of managers, make sure that your leads have a grasp on the problem, understand the current state of affairs, and know what to work on. I’ve been in engineering crises–thankfully none which were life or death situations–and the most frustrating part is sitting there with nothing to do. You can’t go home (and it’s usually pretty late, so you want to go home), but there’s nothing you can actually do at the moment. That’s a failing of management we want to avoid.

To that extent, maximize communication. Bias toward over-sharing of problems and solutions rather than under-sharing. If a team sees a problem, it may turn out that it doesn’t actually need a solution and someone can tell them that. Or perhaps a member of another team sees an aspect of the problem which is worse than first anticipated, causing you to re-evaluate the problem altogether. Or maybe it turns out that a third team actually has the solution, but didn’t realize it was needed because they weren’t experiencing the problem.

Let me give you a real-life example of this, though I’m changing a few of the details to maintain appropriate confidences. Customers are calling because their data is not showing up in our system and it’s a critical time of the year where their data absolutely needs to be correct and up-to-date. Application developers are looking through their logs, trying to figure out what’s going on (and sometimes, trying to remember exactly how things work). I, as the database specialist, am looking through log entries in tables, database error logs, and whatever monitoring I can get my hands on to figure out if the database is rejecting data, if there are application errors when trying to write data, or what the deal is. We’re working together on this, but without sysadmins or network engineers to help us out. After some struggle, we engage the sysadmin, who tells us that things look fine on the servers, but noticed a major drop in network utilization over the past few hours, and the network engineer tells us that, oh yeah, there was an issue with a network link at a separate site and our co-lo’s technicians were working on the problem, but it only affected a few servers…specifically, our servers. This piece of information ended up being crucial, but there was a communication gap between the development teams (application and database) and the administration teams (systems and network). Had it been made clear to those administrative teams at the beginning, we might have saved hours of time trying to diagnose something that turned out not even to be something we caused.

That sharing also needs to be bi-directional. As a manager, getting regular reports from engineers is nice, but you have to be able to share syntheses of reports with everyone involved. One team might know the problem but lack the solution, and another team might have the solution but not know that there’s a problem; ensuring that people know what they need to know without a flood of unnecessary information is tricky, but that’s what a great manager has to do.

One thing great managers don’t do is management by hovering. Engineers know this all too well: we’re in a crisis, so you have levels of management standing over your desk as you’re trying to solve the problem. You can tell how big of a crisis it is by how many levels of management are all standing at your desk. They’re standing there, observing, often without providing direct insight or meaningful guidance. Oh, they may provide you guidance, but it’s rare that hoverers give you anything helpful. Hovering is a natural human instinct when you feel like you don’t have anything you can do but need to get a problem solved. To the extent that it signals the importance of the issue, it’s not the worst thing ever, but it does stress out engineers working on the problem. And hovering managers aren’t coordinating, which means all of those people who need information to help solve the problem aren’t getting it because the manager is standing over one person and watching that person type and click stuff.

Instead of hovering, be available. Frankly, even if you have nothing to do at that point in time, don’t hover. Gene Krantz? He doesn’t hover. He’s at his desk, he’s leading meetings, he’s talking to people. He’s available, but he doesn’t need to stand over the engineers working on the CO2 scrubber solution or the flight computer power solution or any other solution–he knows he has capable people who understand the problem and are motivated to solve it, so he’s in his role as collector and dissemninator of information and as the person responsible for ensuring that people are doing what they need to do.

Finally, if you are a manager, expressing in a positive but realistic manner is the way to go. Both traits are necessary here: if you are overly negative, people will wonder why they’re even there. If you’re sending the signal that you expect failure, that demotivates the people working on problems–after all, if we’re going to fail anyhow, why should I even try? But being overly cheerful or optimistic sounds Polyannish and runs the risk of losing the trust of engineers. Watch Gene Krantz throughout the film and he does a great job of straddling the line, including at two critical points in the film:

“Failure is not an option!”


“With all due respect, sir, I believe this is going to be our finest hour.”

These are not the types of phrases you deploy on a day-to-day basis, but if you’re in a crisis, you want to know that the people in charge understand the situation and will move heaven and Earth to make it right. And from the manager’s side, that means making sure you give your engineers the best possible chance of success.

I hope you enjoyed this video. As I mentioned at the beginning, it’s a bit outside the norm for me, but it was a lot of fun to put together. And if you haven’t seen Apollo 13 in a while (or at all!), if you sat through this video, you’re definitely going to enjoy the movie. So until next time, take care.

Q&A: The Curated Data Platform

On Thursday, I presented a session at PASS Summit entitled The Curated Data Platform. You can grab slides and links to additional information on my website. Thank you to everyone who attended the session.

During and after the session, I had a few questions come in from the audience, and I wanted to cover them here.

Cross-Platform Data Strategies

The first question was, “What handles the translation between the relational truth system and the Document system?” The context of the question comes from a discussion about product catalogs, and specifically this slide.

Document databases are great for things like product catalogs, where we meet the following properties:

  • Data has a very high read-to-write ratio.
  • You generally look at one item per page—in this example, one product.
  • The set of data to appear on a page is complex and typically has nested items: a product has attributes (title, price, brand, description) but also collections of sub-items (product image links, current stock in different stores, top reviews, etc.).
  • The data is not mission-critical: if updates are delayed or even occasionally lost, that is acceptable.

But I do like keeping a “golden record” version of the data and my biases push me toward storing that golden record in a relational database. I mentioned two processes in my talk: a regular push on updates and an occasional full sync to true up the document database(s).

And that leads to the question of, how do we do that? There are products from companies like CData and Oracle which can handle this, or you can write your own. If your source is SQL Server, I’d push for a two-phase process:

  1. Enable Change Data Capture on the SQL Server instance and have a scheduled task query the latest changes and write them to your document database(s). You can use constructs like FOR JSON PATH in SQL Server to shape the documents directly, or pull in the source data and shape it in your application code.
  2. Periodically (e.g., once an hour, once a day), grab all of the data, shape the documents, and perform a comparison with what’s out there. This will confirm that nothing slips through the cracks for longer than one time period and will keep disparate clusters of document data separated.

Of course, this golden record doesn’t need to be in a relational database—you could store it in a document database and use replication there to push data to different clusters. If you use Cosmos DB, for example, you can replicate to other regions easily.

Document Databases: Scale-Out vs Replication

Another attendee asked about “Document databases and scale-out vs replication.” In retrospect, I think I misinterpreted the question as asked, as I mentioned that scale-out and replication are one and the same: you replicate data between nodes in a cluster to achieve scale-out.

But this time around, I’m going to answer the question, “How do I choose between making my current cluster bigger and replicating out to a new cluster?”

Here are some key considerations:

  • If the issue you are trying to solve is geographical in nature, replicate out to a new cluster closer to your users. In other words, suppose you have a cluster hosted in Virginia. Many of your users are in Japan, so they have to deal with the network latency of pulling data from a Virginia-based data center. If this is the problem, create another document database cluster in Japan and replicate to it from Virginia.
  • If your cluster is in Virginia and is getting hammered hard by relatively local users, scaling out is generally a good option. That is, adding more servers to the existing cluster. Depending on your technology, there will be a maximum number of nodes or a maximum scale-out size available to you, so you’d have to check out those details.
  • If you’re getting close to that maximum scale-out size, it may make sense to replicate to another cluster in the same region and use a load balancer to shift load between the two. I have to be vague here because different technologies have different limits and I’m definitely not an expert on any document database technology’s systems architecture.

Cosmos DB and MongoDB

Another attendee asked, “I have heard that Azure Cosmos DB is built upon an older version of MongoDB – do you know if this is true?”

The answer is no, it’s not. The two platforms are different. I believe where the question comes from is around the MongoDB API for Cosmos DB. For a while, Cosmos DB supported an older version of the MongoDB API, specifically 3.2. That API was released in December of 2015. New Cosmos DB clusters support the MongoDB 3.6 API, which is still active.

But something I want to point out is that the API is an interface, not an implementation. That Cosmos DB supports a specific MongoDB API version doesn’t mean that the code bases are similar; it only means that you can safely (well, presumably safely) interact with both and expect to get the same results when you perform the same set of API steps with the same inputs.

Graph Languages

My last question came from an attendee who mentioned, “I thought GraphQL was the common standard for graph querying.”

The context for this is in my discussion of graph databases, particularly the slide in which I talk about the key issues I have with graph databases. For a bit more background than what I had time to get into during the session, Mala Mahadevan and I have talked about graph databases in a little bit of detail on a couple of occasions, once on the SQL Data Partners Podcast and once on Shop Talk.

As for the question, the comment I had made was that there is no common graph language. We have SQL for relational databases (and other mature data platform technologies) but historically haven’t had a common language for graph platforms, meaning that you have to learn a new language each time you move to a different platform. The Gremlin language is an attempt at creating a single language for graph databases and it’s making enough strides that it may indeed become the standard. But it’s not there yet.

Meanwhile, GraphQL, despite the name, is not a language for graph databases. It’s actually a language for querying APIs. The key idea is that you ask for data from an API and you get back just the data you want. But behind the API, your data can be stored in any sort of data source—or even multiple data sources. In other words, I might expose a product catalog API which hits Cosmos DB, a finance API which hits SQL Server, and a product associations API which hits Neo4j. Those three APIs could all be queried using GraphQL, as it’s up to the API to interpret inputs and return the appropriate outputs.

Query Store: QDS Toolbox

I wanted to announce the first open source project officially released by ChannelAdvisor: the QDS Toolbox. This is an effort which Pablo Lozano and Efraim Sharon pushed hard internally and several database administrators and database engineers contributed to (though I wasn’t one of them).

From the summary page:

This is a collection of tools (comprised of a combination of views, procedures, functions…) developed using the Query Store functionality as a base to facilitate its usage and reports’ generation. These include but are not limited to:

– Implementations of SSMS’ GUI reports that can be invoked using T-SQL code, with added funcionalities (parameterization, saving results to tables) so they can be programmatically executed and used to send out mails.
– Quick analysis of a server’s overall activity to identify bottlenecks and points of high pressure on the SQL instance at any given time, both in real time or in the past.
– Cleanup of QDS’ cache with a smaller footprint than the internal one generates, with customization parameters to enable a customizable cleanup (such as removing information regarding dropped objects, cleaning details of ad-hoc or internal queries executed on the server as index maintenance operations).

The biggest of these is the third item. In our environment, Query Store could be a beast when trying to delete old data, and would often be the biggest performance problem on a given server.

In addition, several procedures exist as a way of aggregating data across databases. We have a sharded multi-tenant environment, where we might have 5-15 replicas of a database schema and assign customers to those databases. QDS Toolbox helps aggregate information across these databases so that you don’t need to look at each individually to understand performance problems. The database team has then created reports off of this to improve their understanding of what’s going on.

Check out the QDS Toolbox as a way to clean up data better than the built-in cleanup process and get additional information aggregated in a smart way.

Transaction Modes in SQL Server

In the following video, I take a look at the three most important transaction modes in SQL Server: autocommit, explicit transactions, and implicit transactions. Sorry, batch-scoped transactions, but nobody loves you.

If you’d prefer a textual rendering of the video, here are the pre-demo and post-demo sections, lightly edited for narrative flow.

Setting the Stage: Transactions and Modes

What I want to do in today’s post is to cover the different sorts of transaction modes and get into the debate about whether you should use explicit transactions or rely on auto-committed transactions for data modification in SQL Server. This came from an interesting discussion at work, where some of the more recent database engineers were curious about our company policy around transaction modes and understanding the whys behind it. I didn’t come up with the policy, but my thinking isn’t too far off from the people who did.

But before I get too far off course, let’s briefly lay out some of the basics around transactions.

When you modify data (that is, run a command like INSERT, UPDATE, MERGE, or TRUNCATE) or tables (CREATE, ALTER, DROP, etc.), that operation takes place inside a transaction.  A transaction is, according to Microsoft Docs, a single unit of work.  Everything in a transaction will either succeed as a whole or fail as a whole–you won’t end up with some operations succeeding and others not–it’s really all or nothing.  The importance of this goes back to relational databases having ACID properties, but because that’s a little far afield of where I want to go, I’ll give you a link if you’d like to learn more about the topic, as it helps explain why transactions are useful for relational database developers.

What I do want to get to is that there are three kinds of transactions:  autocommit transactions, explicit transactions, and implicit transactions.  There’s actually a fourth kind, batch-scoped transactions, but that only applies to Multiple Active Result Sets transactions and if you find yourself there, you’ve got bigger problems than deciding how you want to deal with transactions.

In the demo for the video, I show off each of the three transaction modes, including how you enable them, how you work with them, and any important considerations around them.


The easy recommendation is, don’t use implicit transactions.  For SQL Server developers and database administrators, this is unexpected behavior–the default is to use autocommit, so that if you run an INSERT statement by itself, the transaction automatically commits at the end.  If you set implicit transactions on, there is no UI indication that this is on and it becomes really easy to forget to commit a transaction.  I understand that if you come from an Oracle background, where implicit transactions are the norm, it might feel comfortable to enable this, but it becomes really easy to start a transaction, forget to commit or rollback, and leave for lunch, blocking access to a table for a considerable amount of time. And if you’re using Azure Data Studio, it appears that implicit transactions might not even work, so you’d be in a world of hurt if you were relying upon them.  So let’s throw this one away as a recommendation.

My recommendation, whenever you have data modification on non-temporary tables, is to use explicit transactions over autocommit.  I have a few reasons for this.

First, consistency.  Sometimes you will need explicit transactions.  For example, if I need to ensure that I delete from table A only if an insert into table B and an update of table C are successful, I want to link those together with an explicit transaction.  That way, either all three operations succeed or none of them succeed.  Given that I need explicit transactions some of the time, I’d rather be in the habit of using them; so to build that habit, I’d prefer to use them for all data modification queries.

Second, explicit transactions give you clarity around what is actually necessary in a transaction.  Suppose you query a table and load the results into a temporary table.  From there, you make some modifications, join to other tables, and reshape the data a bit.  So far, nothing I’ve mentioned requires an explicit transaction because you’re only working with temp tables here.  When you take the final results and update a real table, now we want to open a transaction.  By using an explicit transaction, I make it clear exactly what I intend to have in the transaction:  the update of a real table, but not the temp table shenanigans.

Third, as an implication of the second point, explicit transactions can help you reduce the amount of time you’re locking tables.  You can do all of your heavy lifting in temp tables before opening the transaction, and that means you don’t have to do that while locking the real table. In the best case, autocommit will behave the same, but saying “Here is where I want my transaction to be” also lets you think about whether you really want to do everything at one statement or break it up into smaller chunks.

Finally, if you use a loop, whether that be a cursor or WHILE statement, you can control whether you want one transaction per loop iteration or one transaction in total, and that’s entirely to do with whether you begin and commit the transaction outside of the loop or inside.  Having one transaction in total can be considerably faster in some circumstances, but if you have an expensive action in the loop, you can commit after each loop iteration.  This will minimize the amount of time you block any single operation waiting to access this table.  It will increase the total runtime of your query, but minimize the pain to other users, and that’s a trade-off  you can only make if you use explicit transactions.

Rules of Thumb

First, if you have a stored procedure which is simply running a SELECT statement, use autocommit.  There’s no real advantage to putting this into an explicit transaction and there is the downside that you might forget to commit.

Second, if you have a stored procedure which performs data modification on non-temporary tables, use an explicit transaction only over the area which modifies data.  Don’t begin the transaction until you’re ready to start modifying tables; this will minimize the amount of time you need to keep the transaction open and resources locked.

As a corollary of the second point, note that you can use explicit transactions to control parent-child relationships with stored procedures, where the parent begins a transaction, calls each child, and rolls back or commits at the end depending upon the results. That’s something you can’t do with autocommit, as each data modification statement would run in its own auto-committed transaction.

Third, if you are working with non-global temporary tables beforehand, don’t include any modification of those inside the explicit transaction.  If you are working with global temporary tables, I suppose you should treat them like non-temporary tables here if you expect other sessions to use them and care about blocking, though there’s a pretty small number of cases where it makes sense to have global temporary tables with multiple users, so I’d call that an edge case.

Fourth, in a loop, choose whether you want to put the explicit transaction around the loop or inside it.  In most cases, I prefer to put the transaction inside the loop to minimize the amount of time that I’m blocking others. This is probably the smarter move to make in busy transactional environments, where you want to prevent blocking as much as possible.  Also, If one loop iteration fails, you’ll have less you need to roll back, so you can fix the issue and pick back up where you left off. Note that at that point, you trade atomicity on the entire set of data for atomicity on a block of data, so if that’s a big enough concern, bite the bullet and put an explicit transaction around the loop. Or see if you can make it faster without a loop.

Fifth, outside of a stored procedure—that is, when I’m just writing ad hoc statements in a client tool—use explicit transactions if you’re doing something potentially risky. I know this brings up the question of “Why are you doing risky things in a client tool to begin with?” But that’s a story for a different day.

Sixth, watch out for nested transactions.  In SQL Server, there’s very little utility in them and their behavior is weird. Paul Randal explains in great detail just how broken they are, and I’d rather the product never have had them.  Anyhow, check to see if you’re in a transaction before opening one. The pattern I like to use comes from my Working Effectively with Legacy SQL talk (which, ironically enough, needs some changes to be brought up to date) and originally from smart people in the office who put it together before I got there. Here’s a simplified version of it for a sample stored procedure:

@Divisor INT = 5
    @AlreadyInTransaction BIT;

    IF ( @@TRANCOUNT > 0 )
        SET @AlreadyInTransaction = 1;
        SET @AlreadyInTransaction = 0;
    -- Note:  this is where you'd put your data modification statements.
        1.0 / @Divisor AS Quotient;
    IF ( @AlreadyInTransaction = 0 AND @@TRANCOUNT > 0 )
    IF ( @AlreadyInTransaction = 0 AND @@TRANCOUNT > 0 )


--Test the procedure
EXEC dbo.GetFraction @Divisor = 5;

--Start an explicit transaction
EXEC dbo.GetFraction @Divisor = 5;

Finally, make sure you roll back the transaction on failure.  If you write code using try-catch blocks, commit at the end of the TRY block or rollback at the beginning of the CATCH. Explicit transactions offer you more power, but come with the responsibility of handling transactions appropriately.


What are your thoughts on explicit transactions versus autocommit? Do you prefer the ease of autocommit or the power of explicit transactions? Or where do you draw the line between the two? Leave your thoughts in the comments section below—either here or on the video—and let me know.

With ML Services, Watch Those Resource Groups

I wanted to cover something which has bitten me in two separate ways regarding SQL Server Machine Learning Services and Resource Governor.

Resource Governor and Default Memory

If you install a brand new copy of SQL Server and enable SQL Server Machine Learning Services, you’ll want to look at sys.resource_governor_external_resource_pools:

That’s a mighty fine cap you’re wearing.

By default, SQL Server will grant 20% of available memory to any R or Python scripts running. The purpose of this limit is to prevent you from hurting server performance with expensive external scripts (like, say, training large neural networks on a SQL Server).

Here’s the kicker: this affects you even if you don’t have Resource Governor enabled. If you see out-of-memory exceptions in Python or error messages about memory allocation in R, I’d recommend bumping this max memory percent up above 20, and I have scripts to help you with the job. Of course, making this change assumes that your server isn’t stressed to the breaking point; if it is, you might simply want to offload that work somewhere else.

Resource Governor and CPU

Notice that by default, the max CPU percent for external pools is 100, meaning that we get to push the server to its limits with respect to CPU.

Well, what happens if you accidentally change that? I found out the answer the hard way!

In my case, our servers were accidentally scaled down to 1% max CPU utilization. The end result was that even something as simple as print("Hello") in either R or Python would fail after 30 seconds. I thought it had to do with the Launchpad service causing problems, but after investigation, this was the culprit.

Identities blurred to protect the innocent.

The trickiest part about diagnosing this was that the Launchpad logs error messages gave no indication what the problem was—the error message was a vague “could not connect to Launchpad” error and the Launchpad error logs didn’t have any messages about the failed queries. So that’s one more thing to keep in mind when troubleshooting Machine Learning Services failures.