New Training: the Curated Data Platform (Free for a Limited Time)

I am pleased to announce a new course: The Curated Data Platform.

Given two hours of footage, you’d think I would find a spot where I smiled.

A Brief Summary

The Curated Data Platform is a 2-hour video training aimed at providing you a 30,000 foot overview of the data platform space. In this course, I take you through a variety of data platform technologies—such as relational databases, document databases, caching technologies, data lakes, and graph databases. I show you use cases in which these technologies can be great fits, as well as which companies and products are most relevant in that space today. This includes on-premises technologies as well as major services in Amazon Web Services and Azure.

Pour one out for Riak here.

Get This Course for Free! (Limited Time Offer)

Through Sunday, July 25, 2021, you can register for this course for free using the coupon code FIRSTMOVER when you check out. My one request with this is, if you use the coupon code, please be sure to leave feedback on the course—things which you liked, as well as things you wanted to see but didn’t. I intend to update this course over time to make it better based in part on learner feedback.

Winning at Pong via Reinforcement Learning

I finally got around to trying out a reinforcement learning exercise this weekend in an attempt to learn about the technique. One of the most interesting blog posts I read is Andrej Karpathy’s post on using reinforcement learning to play Pong on the Atari 2600. In it, Andrej uses the Gym package in Python to play the game.

This won’t be a post diving into the details of how reinforcement learning works; Andrej does that far better than I possibly could, so read the post. Instead, the purpose of this post is to provide a minor update to Andrej’s code to switch it from Python 2 to Python 3. In doing this, I went with the most convenient answer over a potentially better solution (e.g., switching xrange() to range() rather then re-working the code), but it does work. I also bumped up the learning rate a little bit to pick up the pace a bit.

The code is available as a GitHub Gist, which I’ve reproduced below.

import numpy as np
import pickle
import gym
# hyperparameters
H = 200 # number of hidden layer neurons
batch_size = 10 # after how many episodes do we do a parameter update?
learning_rate = 3e-4
gamma = 0.99 # discount factor for reward
decay_rate = 0.99 # decay factor for RMSProp leaky sum of grad^2
resume = False # resume from prior checkpoint?
render = False
# model initialization
D = 80 * 80 # input dimensionality: 80×80 grid
if resume:
model = pickle.load(open('save.p', 'rb'))
model = {}
model['W1'] = np.random.randn(H,D) / np.sqrt(D) # "Xavier" initialization
model['W2'] = np.random.randn(H) / np.sqrt(H)
grad_buffer = { k : np.zeros_like(v) for k,v in model.items() } # update buffers that add up gradients over a batch
rmsprop_cache = { k : np.zeros_like(v) for k,v in model.items() } # rmsprop memory
def sigmoid(x):
return 1.0 / (1.0 + np.exp(x)) # sigmoid "squashing" function to interval [0,1]
def prepro(I):
""" prepro 210x160x3 uint8 frame into 6400 (80×80) 1D float vector """
I = I[35:195] # crop
I = I[::2, ::2, 0] # downsample by a factor of 2
I[I == 144] = 0 # erase background (background type 1)
I[I == 109] = 0 # erase background (background type 2)
I[I != 0] = 1 # everything else (paddles, ball) just set to 1
return I.astype(np.float64).ravel()
def discount_rewards(r):
""" take 1D float array of rewards and compute discounted reward """
discounted_r = np.zeros_like(r)
running_add = 0
for t in reversed(range(0, r.size)):
if r[t] != 0: running_add = 0 # reset the sum, since this was a game boundary (specific to Pong!)
running_add = running_add * gamma + r[t]
discounted_r[t] = running_add
return discounted_r
def policy_forward(x):
h =['W1'], x)
h[h<0] = 0 # ReLU nonlinearity
logp =['W2'], h)
p = sigmoid(logp)
return p,h # return probability of taking action 2, as well as hidden state
def policy_backward (eph, epdlogp):
""" backward pass. (eph is an array of intermediate hidden states) """
dW2 =, epdlogp).ravel()
dh = np.outer(epdlogp, model['W2'])
dh[eph <= 0] = 0 # backpro prelu
dW1 =, epx)
return {'W1':dW1, 'W2':dW2}
env = gym.make("Pong-v0")
observation = env.reset()
prev_x = None # used in computing the difference frame
xs,hs,dlogps,drs = [],[],[],[]
running_reward = None
reward_sum = 0
episode_number = 0
while True:
if render: env.render()
# preprocess the observation, set input to network to be difference image
cur_x = prepro(observation)
x = cur_x prev_x if prev_x is not None else np.zeros(D)
prev_x = cur_x
# forward the policy network and sample an action from the returned probability
aprob, h = policy_forward(x)
action = 2 if np.random.uniform() < aprob else 3 # roll the dice!
# record various intermediaries (needed later for backprop)
xs.append(x) # observation
hs.append(h) # hidden state
y = 1 if action == 2 else 0 # a "fake label"
dlogps.append(y aprob) # grad that encourages the action that was taken to be taken
# step the environment and get new measurements
observation, reward, done, info = env.step(action)
reward_sum += reward
drs.append(reward) # record reward (has to be done after we call step() to get the reward for the previous action)
if done: # an episode finished
episode_number += 1
# stack together all inputs, hidden states, action gradients, and rewards for this episode
epx = np.vstack(xs)
eph = np.vstack(hs)
epdlogp = np.vstack(dlogps)
epr = np.vstack(drs)
xs,hs,dlogps,drs = [],[],[],[] # reset array memory
# compute the discounted reward backwards through time
discounted_epr = discount_rewards(epr)
# standardize the rewards to be unit normal (helps control the gradient estimator variance)
discounted_epr -= np.mean(discounted_epr)
discounted_epr /= np.std(discounted_epr)
epdlogp *= discounted_epr # modulate the gradient with advantage (PG magic happens right here.)
grad = policy_backward(eph, epdlogp)
for k in model: grad_buffer[k] += grad[k] # accumulate grad over batch
# perform rmsprop parameter update every batch_size episodes
if episode_number % batch_size == 0:
for k,v in model.items():
g = grad_buffer[k] # gradient
rmsprop_cache[k] = decay_rate * rmsprop_cache[k] + (1 decay_rate) * g**2
model[k] += learning_rate * g / (np.sqrt(rmsprop_cache[k]) + 1e-5)
grad_buffer[k] = np.zeros_like(v) # reset batch gradient buffer
# book-keeping work
running_reward = reward_sum if running_reward is None else running_reward * 0.99 + reward_sum * 0.01
print('resetting env. episode reward total was %f. running mean: %f' % (reward_sum, running_reward))
if episode_number % 100 == 0: pickle.dump(model, open('save.p', 'wb'))
reward_sum = 0
observation = env.reset() # reset environment
prev_x = None
if reward != 0: # Pong has either +1 or -1 reward exactly when the game ends.
print('ep %d: game finished, reward: %f' % (episode_number, reward) + ('' if reward == 1 else ' !!!!!!' ))
view raw hosted with ❤ by GitHub

After running the code for a solid weekend, I was able to build an agent which can hold its own against the CPU, though won’t dominate the game. Still, it’s nice to see an example of training a computer to perform a reasonably complex task (deflecting a ball into the opponent’s goal while preventing the same) when all you provide is a set of possible instructions on how to act (move the paddle up or down) and an indication of how you did in the prior round.

Space age graphics!

New DataCamp Course: Data Modeling in Power BI

I wanted to announce a brand new DataCamp course, entitled Data Modeling in Power BI. This course provides an introduction to techniques which you can use to simplify and speed up Power BI data models, with an emphasis on dimensional modeling and the Kimball technique of creating and working with star schemas.

This is a little bit different from my last DataCamp course, in that I was a collaborator on this one, so all of my work was behind the scenes. The final course is a product of my vision and Sara & Maarten’s excellent job of implementation. So go check out the course and share your thoughts.

Upcoming Events: Techorama Virtual Edition

Key Details

What: Techorama 2021.
Where: Internet Belgium, UTC+2.
When: Monday, May 17th through Wednesday, May 19th.
Tickets are available for sale on the Techorama website.

What I’m Presenting

8:45 AM — 9:45 AM EDT — Of Types and Measures

It’s a little rare that I get to give an F#-focused talk, so I’m glad they selected this one. Also, Techorama is a talk I really would like to get to do in person one of these years.

SQL Day Poland

Key Details

What: SQL Day Poland.
Where: Internet Poland.
When: Monday, May 10th through Wednesday, May 12th.
Admission is paid, 500 Polish zloty for the conference (roughly $130 USD). RSVP on the SQL Day website.

What I’m Presenting

9:00 AM — 10:00 AM EDT — Does this look weird to you? An introduction to Anomaly Detection

This is a nearly new talk: I’ve given it a couple of times at user groups to warm it up, so I’m starting to get into the groove with this talk.

Upcoming Events: St Louis SQL Server and BI User Group

Key Details

What: St. Louis SQL Server and Business Intelligence User Group.
Where: On the Internet, UTC-5.
When: Tuesday, May 11th.
Admission is free. RSVP on Meetup.

What I’m Presenting

1:00 PM — 2:30 PM EDT — The Curated Data Platform

I enjoy giving this talk. It’s a whirlwind tour of data platform products and straddling the line between “Come check out all of these technologies!” and “Maybe you don’t need all of these technologies…”

Dataset Drift Monitoring with Azure ML

One of the things I like to say about machine learning model is, “shift happens.” By that, I mean that models lose effectiveness over time due to changes in underlying circumstances. Relationships between variables that used to hold no longer do, and so our model quality degrades. This means that we sometimes need to retrain models.

But there’s a cost to retraining models—that work can be computationally expensive and time-consuming. This concern is particularly salient if you’re in a cloud, as you pay directly for everything there. This means that we don’t want to retrain models unless we need to. But when do we know if we should retrain the model? We can watch for model degradation, but there’s another method: drift detection in your datasets.

The concept of drift detection is simple: we want to figure out if the distribution of the data has changed significantly since we built our model. If it has, that’s a sign that we should potentially look at retraining our model, or at least paying closer attention to model results. Let’s go through an example of drift detection using Azure Machine Learning. Note that as of the time of this blog post (March of 2021), dataset monitors are still in public preview, so things may have changed between now and the time you read this, visitor from the future.

Battlespace Preparation

The first thing we need to do is create a dataset. I’ve created a dataset for my expenses data demo, where I know that the Amount value has changed significantly for some people over the years 2017-2019 versus the starting point of 2011-2016. Here’s my schema:

One perfectly normal dataset.

Note that I needed to set one column to Timestamp. This is necessary for drift detection, as it needs to track results over time. To do that, there are two options available to us: either select a column to be the timestamp like in this case, or if you are using blob storage, you can use a folder structure like yyyy/mm/dd to track that data.

Now it’s time to create our dataset monitor. To do that, select Datasets from the Assets menu, choose Dataset monitors, and then select + Create to create a new dataset monitor.

Monitoring a dataset.

Our next step is to choose a target dataset. I chose the expense-reports dataset that I’ve created. Note that you also get to choose a specific version of the dataset.

Choose the form of your destructor.

After choosing a target dataset, you have to set up a baseline. The baseline defines what normal is for the dataset. That can be either a date range in the target dataset or a separate dataset altogether. In my case, I chose that the date range was part of the target dataset.

Picking a time frame

One thing I do want to complain about here is that in the UI, I don’t have the ability to type in a date. For something where I’m going back to January of 2011, that’s a lot of clicking to do. If I use the code-first approach, I can of course enter a timeframe, but I wanted to try out the UI approach first and it was not ideal.

Anyhow, the next step is to configure monitor settings. Here is where you can name the monitor, select which features you want to track, choose which machines will execute scheduled (or ad hoc) drift checks, and how frequently you want to check data. You can also optionally enter e-mail addresses if you’d like to receive e-mails when drift goes above the set threshold. All in all, this screen is straightforward.

Monitor settings

I decided to experiment with three different dataset monitors. The first one, that you see above, tracks all features weekly. The second, whose screen I did not capture, monitors all features monthly. The third monitors just the amount, but does so monthly.

Monitoring the most important thing: the moolah.

The reason I did this is that I know the dataset well enough to understand that Amount is the volatile variable. I wondered if drift detection would be able to alert me on potential drift for the all-features example, or if I needed to narrow it down to the one thing which does change dramatically.

Backfilling Data

After creating a monitor, navigating to its page shows that it is…kind of bare.

No data = no drift. Problem solved.

We haven’t provided any non-baseline data set, so of course this is empty. Also, the start and end dates run from March 2020 through March 2021, and I know I don’t have any data for that time frame. So let’s backfill some data. To do that, I select Analyze existing data and that shows a fly-out menu. In that menu, I can set the time frame for analysis, as well as my compute target.

Preparing a backfill

Let’s take a moment now and talk about timeframes. When we created the monitors, we set the frequency to one of three values: Daily, Weekly, or Monthly. This has two effects. First, it sets up an automated schedule to run on that period. Second, it assumes that you want that period for backfills as well. So for this weekly expenses monitor, the drift detection process will group data by week and perform an analysis. This becomes important in a moment. But let’s first submit this run.

After submitting a run, we learn that data drift backfills are stored in Azure ML as experiments, so we can collect details on the run there.

It was always Ohio.

Getting Results

After selecting the expenses-monitor-all-weekly-Monitor-Runs experiment, we can select the first run, and that gives us an important hint about how we’re doing it wrong.

Needs more data.

It turns out that we need to have at least 50 non-null data points per group. My group is a week, so I need at least 50 rows for the period 2017-01-01 until 2017-01-08, and then another 50 rows from 2017-01-08 until 2017-01-15, and so on. Well, for my data set, I don’t have 50 rows per week at all. It’s a fairly sparse data set in that regard, and thus the weekly monitor won’t work. It will keep telling me “No Data” because there aren’t enough rows to count any particular week.

Fortunately, we thought ahead and did this for the month as well. Clever us.

Something seems fishy here.

We can see that the monthly drift does return results and those numbers are quite high, driven mostly by Amount. Note that Amount is responsible for 93% of the total drift, and that our magnitude of drift is way above the arbitrary threshold. We can also see that it was increasing month-over-month for January, February, and March of 2017.

From there, we can see charts on the relative importance of features with respect to drift, as well as the measures Azure Machine Learning uses to track drift.

Yep, that’s your problem right there.

For numeric features, Azure ML uses four components: minimum value, maximum value, mean value, and Wasserstein distance (also called earth-mover’s distance). That is, three point value comparisons and one measure comparing the baseline distribution to the target distribution.

For categorical features, Azure ML uses the number of unique categories and the Euclidian distance. They describe it as:

Computed for categorical columns. Euclidean distance is computed on two vectors, generated from empirical distribution of the same categorical column from two datasets. 0 indicates there is no difference in the empirical distributions. The more it deviates from 0, the more this column has drifted. Trends can be observed from a time series plot of this metric and can be helpful in uncovering a drifting feature.

As we can see here, Amount drives our change and the others stay pretty well the same over time. I’d next like to run a backfill on the rest of my data points, but I don’t want to spend all day click-click-clicking in the UI. Fortunately, there’s a code-first notebook experience.

Code-First Backfills

In order to perform a backfill, we need only a few lines of Python code.

from azureml.core import Workspace, Dataset
from azureml.datadrift import DataDriftDetector
from datetime import datetime

ws = Workspace.from_config()

monitor = DataDriftDetector.get_by_name(ws, 'expenses-monitor-all-monthly')

backfill1 = monitor.backfill(datetime(2017, 4, 1), datetime(2017, 6, 30))
backfill1 = monitor.backfill(datetime(2017, 7, 1), datetime(2017, 9, 30))
# etc. etc.  Or, you know, create a loop.
backfill1 = monitor.backfill(datetime(2019, 10, 1), datetime(2019, 12, 31))

Each of the calls to monitor.backfill() will queue up a run of the relevant experiment, so the call will finish within a couple of seconds, but that doesn’t mean your backfill has completed.

Added to queue; mission accomplished.

In my simple scenario, each 3-month period took about 3 minutes to run. Obviously, this will depend greatly on compute power, number of rows of data, and number of features to compare.

Many thumbs were twiddled in the making of this blog post.

Now that everything is complete, we can take a look at drift over the course of our several-year dataset.

Just a little bit of drift here.

In the Future

So far, we’ve only looked at backfills. The real benefit of data drift analysis, however, is that you can use it to monitor data going forward based on the schedule you’ve set. Then, if the monitor catches drift-related issues, it can send you an e-mail and alert you to this change in data.

Can you believe I’ve made it all the way through this blog post and haven’t once made any Fast & Furious references? Yeah, me neither.


Dataset drift monitoring with Azure Machine Learning is really simple, to the point where you can set it up within minutes and have reasonable data within hours. I think it’d be hard to find a simpler method to perform this sort of analysis.

Azure ML: Reading from and Writing to Azure SQL DB

Not too long ago, I worked through an interesting issue with Azure Machine Learning. The question was, what’s the best way to read from Azure SQL Database, perform model processing, and then write results out to Azure SQL Database? Oh, by the way, I want to use a service principal rather than SQL authentication. Here’s what I’ve got.

Set up a Service Principal

I’m going to assume that you already have Azure SQL Database and the Azure Machine Learning workspace set up. In the Azure search bar, search for App Registrations and then select + New registration.

Register a new application.

Name the account whatever you’d like. I called mine AzureMLDatabaseAccess to keep its purpose clear. On the Overview page, you will see some important information:

Your details probably will not be blurred out. If they are, clean your glasses and try again.

You’ll want to copy the Application ID and the Directory ID. And note that each one of these has a name in parentheses. As we all know, two extremely common names is better than one extremely common name.

After you copy those down, navigate to the Certificates & secrets tab on the Manage menu and select + New client secret.

What’s the answer? It’s a secret.

Name your client secret. I happened to name mine DatabaseConnection. This generates a Value which you can use as a password. Copy this down as well, as you’ll need it later and this is your one chance to see it. Unless you delete this secret and create a new one, I guess.

Tie the Service Principal to Azure SQL Database

Now we need to make this app registration mean something. To do that, we need to connect to Azure SQL Database and tie the app registration to our database. We can do that by running a code snippet like the following:


Where naturally, AzureMLDatabaseAccess is the name of your app registration. Then, you can grant whatever SQL Server rights you need, such as adding this app registration to specific roles, all the way up to granting it db_owner because what’s the worst that could happen, amirite?

Oh, by the way, make sure you’re connected to the database using Azure Active Directory. Otherwise, you’ll get an error like this:

Well, that didn’t work out as I wanted.

If you’re like me and you don’t have an Azure Active Directory tenant set up, you can connect to the SQL server (and not the SQL database), navigate to Active Directory admin in the Settings menu, select Set admin, and choose your app registration. Be careful with this, however—that app registration can now do anything on the server, so don’t go sharing details all willy-nilly.

Usually it’s a good idea to avoid doing much of anything all willy-nilly.

Save that and you’re golden. Now let’s go to Azure Machine Learning.

Configure the Azure Machine Learning Datastore

Navigate to Azure Machine Learning Studio and select your workspace. Then, in the Manage menu, select Datastores. Select + New datastore to create a new datastore.

Create a new datastore.

I’m going to call this datastore expense-reports because that’s the name of my database. I’ll choose Azure SQL database as the Datastore type and select my subscription and database. Change the Authentication type to Service principal and you’ll need to enter a Tenant ID, Client ID, and Client secret. Those are the three things I just happened to have you write down. It’s great how we sometimes luck out that way. With all of that done, we can create the datastore and move on.

One filled-out datastore later.

The good news is that we have a datastore. The bad news is that we have no idea if it worked. So let’s figure that out. Navigate to Datasets in the Assets menu and choose + Create dataset and then From datastore.

Create a new dataset.

I’m going to call my dataset expense-reports because that’s what it is. Keep the Dataset type as Tabular and move to the next section.

In this section, I’ll select Previously created datastore and pick the expense_reports datastore. Then, choose Select datastore. As a quick note, I want to make sure that the Skip data validation box is unchecked. We absolutely want to validate the data, as we’re using this test to make sure that our app registration is set up correctly.

I can then enter my SQL query and select Next to continue.

This definitely looks like a query.

If your security setup is correct, you’ll be able to continue. Otherwise, you’ll see an error like the following:

This message is simultaneously verbose and empty.

The real error message is in the More details link. In this case, my error code was 40615. Check out the guide for troubleshooting connectivity errors in Azure SQL Database and Azure SQL Managed Instance and 40615 is a firewall issue. This one’s pretty easy to fix. Return to the SQL server and navigate to Firewalls and virtual networks in the Security menu. The easy button here is to allow Azure services and resources to access the server. The hard button is to figure out the relevant IP addresses that Azure Machine Learning will use and add those. I voted for easy.

Let those Azure services roam free.

After saving and updating the firewall rule, give it a few minutes to propagate and try again. If you paid your Azure bill and included the bribes special payments, you should get your dataset preview.

“Special payments” successful.

Now that we have this in place, let’s test it all out.

Building a Notebook

For this last part, we’ll want to have a table in Azure SQL Database to store predictions. Here’s the table creation statement:

CREATE TABLE [dbo].[ExpenseReportPrediction](
	[ExpenseReportPredictionID] [int] IDENTITY(1,1) NOT NULL,
	[EmployeeID] [int] NOT NULL,
    [EmployeeName] [nvarchar](50) NOT NULL,
	[ExpenseCategoryID] [tinyint] NOT NULL,
    [ExpenseCategory] [nvarchar](50) NOT NULL,
	[ExpenseDate] [date] NOT NULL,
    [ExpenseYear] [int] NOT NULL,
    [Amount] [decimal](5, 2) NOT NULL,
	[PredictedAmount] [decimal](5, 2) NOT NULL
ALTER TABLE [dbo].[ExpenseReportPrediction] ADD  CONSTRAINT [PK_ExpenseReportPrediction] PRIMARY KEY CLUSTERED 
	[ExpenseReportPredictionID] ASC

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
view raw sqldb-demo.ipynb hosted with ❤ by GitHub

And here’s a picture of the output:

We have predictions.


In this post, we learned a few things.

  • It is possible to read from Azure SQL Database in Azure Machine Learning. It’s pretty easy, even.
  • It is not easy at all to write to Azure SQL Database from Azure Machine Learning.
  • Service Principals are pretty cool.