I wanted to announce a brand new DataCamp course, entitled Data Modeling in Power BI. This course provides an introduction to techniques which you can use to simplify and speed up Power BI data models, with an emphasis on dimensional modeling and the Kimball technique of creating and working with star schemas.
This is a little bit different from my last DataCamp course, in that I was a collaborator on this one, so all of my work was behind the scenes. The final course is a product of my vision and Sara & Maarten’s excellent job of implementation. So go check out the course and share your thoughts.
What: SQL Day Poland. Where: Internet Poland. When: Monday, May 10th through Wednesday, May 12th. Admission is paid, 500 Polish zloty for the conference (roughly $130 USD). RSVP on the SQL Day website.
I enjoy giving this talk. It’s a whirlwind tour of data platform products and straddling the line between “Come check out all of these technologies!” and “Maybe you don’t need all of these technologies…”
One of the things I like to say about machine learning model is, “shift happens.” By that, I mean that models lose effectiveness over time due to changes in underlying circumstances. Relationships between variables that used to hold no longer do, and so our model quality degrades. This means that we sometimes need to retrain models.
But there’s a cost to retraining models—that work can be computationally expensive and time-consuming. This concern is particularly salient if you’re in a cloud, as you pay directly for everything there. This means that we don’t want to retrain models unless we need to. But when do we know if we should retrain the model? We can watch for model degradation, but there’s another method: drift detection in your datasets.
The concept of drift detection is simple: we want to figure out if the distribution of the data has changed significantly since we built our model. If it has, that’s a sign that we should potentially look at retraining our model, or at least paying closer attention to model results. Let’s go through an example of drift detection using Azure Machine Learning. Note that as of the time of this blog post (March of 2021), dataset monitors are still in public preview, so things may have changed between now and the time you read this, visitor from the future.
The first thing we need to do is create a dataset. I’ve created a dataset for my expenses data demo, where I know that the Amount value has changed significantly for some people over the years 2017-2019 versus the starting point of 2011-2016. Here’s my schema:
Note that I needed to set one column to Timestamp. This is necessary for drift detection, as it needs to track results over time. To do that, there are two options available to us: either select a column to be the timestamp like in this case, or if you are using blob storage, you can use a folder structure like yyyy/mm/dd to track that data.
Now it’s time to create our dataset monitor. To do that, select Datasets from the Assets menu, choose Dataset monitors, and then select + Create to create a new dataset monitor.
Our next step is to choose a target dataset. I chose the expense-reports dataset that I’ve created. Note that you also get to choose a specific version of the dataset.
After choosing a target dataset, you have to set up a baseline. The baseline defines what normal is for the dataset. That can be either a date range in the target dataset or a separate dataset altogether. In my case, I chose that the date range was part of the target dataset.
One thing I do want to complain about here is that in the UI, I don’t have the ability to type in a date. For something where I’m going back to January of 2011, that’s a lot of clicking to do. If I use the code-first approach, I can of course enter a timeframe, but I wanted to try out the UI approach first and it was not ideal.
Anyhow, the next step is to configure monitor settings. Here is where you can name the monitor, select which features you want to track, choose which machines will execute scheduled (or ad hoc) drift checks, and how frequently you want to check data. You can also optionally enter e-mail addresses if you’d like to receive e-mails when drift goes above the set threshold. All in all, this screen is straightforward.
I decided to experiment with three different dataset monitors. The first one, that you see above, tracks all features weekly. The second, whose screen I did not capture, monitors all features monthly. The third monitors just the amount, but does so monthly.
The reason I did this is that I know the dataset well enough to understand that Amount is the volatile variable. I wondered if drift detection would be able to alert me on potential drift for the all-features example, or if I needed to narrow it down to the one thing which does change dramatically.
After creating a monitor, navigating to its page shows that it is…kind of bare.
We haven’t provided any non-baseline data set, so of course this is empty. Also, the start and end dates run from March 2020 through March 2021, and I know I don’t have any data for that time frame. So let’s backfill some data. To do that, I select Analyze existing data and that shows a fly-out menu. In that menu, I can set the time frame for analysis, as well as my compute target.
Let’s take a moment now and talk about timeframes. When we created the monitors, we set the frequency to one of three values: Daily, Weekly, or Monthly. This has two effects. First, it sets up an automated schedule to run on that period. Second, it assumes that you want that period for backfills as well. So for this weekly expenses monitor, the drift detection process will group data by week and perform an analysis. This becomes important in a moment. But let’s first submit this run.
After submitting a run, we learn that data drift backfills are stored in Azure ML as experiments, so we can collect details on the run there.
After selecting the expenses-monitor-all-weekly-Monitor-Runs experiment, we can select the first run, and that gives us an important hint about how we’re doing it wrong.
It turns out that we need to have at least 50 non-null data points per group. My group is a week, so I need at least 50 rows for the period 2017-01-01 until 2017-01-08, and then another 50 rows from 2017-01-08 until 2017-01-15, and so on. Well, for my data set, I don’t have 50 rows per week at all. It’s a fairly sparse data set in that regard, and thus the weekly monitor won’t work. It will keep telling me “No Data” because there aren’t enough rows to count any particular week.
Fortunately, we thought ahead and did this for the month as well. Clever us.
We can see that the monthly drift does return results and those numbers are quite high, driven mostly by Amount. Note that Amount is responsible for 93% of the total drift, and that our magnitude of drift is way above the arbitrary threshold. We can also see that it was increasing month-over-month for January, February, and March of 2017.
From there, we can see charts on the relative importance of features with respect to drift, as well as the measures Azure Machine Learning uses to track drift.
For numeric features, Azure ML uses four components: minimum value, maximum value, mean value, and Wasserstein distance (also called earth-mover’s distance). That is, three point value comparisons and one measure comparing the baseline distribution to the target distribution.
For categorical features, Azure ML uses the number of unique categories and the Euclidian distance. They describe it as:
Computed for categorical columns. Euclidean distance is computed on two vectors, generated from empirical distribution of the same categorical column from two datasets. 0 indicates there is no difference in the empirical distributions. The more it deviates from 0, the more this column has drifted. Trends can be observed from a time series plot of this metric and can be helpful in uncovering a drifting feature.
As we can see here, Amount drives our change and the others stay pretty well the same over time. I’d next like to run a backfill on the rest of my data points, but I don’t want to spend all day click-click-clicking in the UI. Fortunately, there’s a code-first notebook experience.
In order to perform a backfill, we need only a few lines of Python code.
from azureml.core import Workspace, Dataset
from azureml.datadrift import DataDriftDetector
from datetime import datetime
ws = Workspace.from_config()
monitor = DataDriftDetector.get_by_name(ws, 'expenses-monitor-all-monthly')
backfill1 = monitor.backfill(datetime(2017, 4, 1), datetime(2017, 6, 30))
backfill1 = monitor.backfill(datetime(2017, 7, 1), datetime(2017, 9, 30))
# etc. etc. Or, you know, create a loop.
backfill1 = monitor.backfill(datetime(2019, 10, 1), datetime(2019, 12, 31))
Each of the calls to monitor.backfill() will queue up a run of the relevant experiment, so the call will finish within a couple of seconds, but that doesn’t mean your backfill has completed.
In my simple scenario, each 3-month period took about 3 minutes to run. Obviously, this will depend greatly on compute power, number of rows of data, and number of features to compare.
Now that everything is complete, we can take a look at drift over the course of our several-year dataset.
In the Future
So far, we’ve only looked at backfills. The real benefit of data drift analysis, however, is that you can use it to monitor data going forward based on the schedule you’ve set. Then, if the monitor catches drift-related issues, it can send you an e-mail and alert you to this change in data.
Dataset drift monitoring with Azure Machine Learning is really simple, to the point where you can set it up within minutes and have reasonable data within hours. I think it’d be hard to find a simpler method to perform this sort of analysis.
Not too long ago, I worked through an interesting issue with Azure Machine Learning. The question was, what’s the best way to read from Azure SQL Database, perform model processing, and then write results out to Azure SQL Database? Oh, by the way, I want to use a service principal rather than SQL authentication. Here’s what I’ve got.
Set up a Service Principal
I’m going to assume that you already have Azure SQL Database and the Azure Machine Learning workspace set up. In the Azure search bar, search for App Registrations and then select + New registration.
Name the account whatever you’d like. I called mine AzureMLDatabaseAccess to keep its purpose clear. On the Overview page, you will see some important information:
You’ll want to copy the Application ID and the Directory ID. And note that each one of these has a name in parentheses. As we all know, two extremely common names is better than one extremely common name.
After you copy those down, navigate to the Certificates & secrets tab on the Manage menu and select + New client secret.
Name your client secret. I happened to name mine DatabaseConnection. This generates a Value which you can use as a password. Copy this down as well, as you’ll need it later and this is your one chance to see it. Unless you delete this secret and create a new one, I guess.
Tie the Service Principal to Azure SQL Database
Now we need to make this app registration mean something. To do that, we need to connect to Azure SQL Database and tie the app registration to our database. We can do that by running a code snippet like the following:
CREATE USER [AzureMLDatabaseAccess] FROM EXTERNAL PROVIDER
Where naturally, AzureMLDatabaseAccess is the name of your app registration. Then, you can grant whatever SQL Server rights you need, such as adding this app registration to specific roles, all the way up to granting it db_owner because what’s the worst that could happen, amirite?
Oh, by the way, make sure you’re connected to the database using Azure Active Directory. Otherwise, you’ll get an error like this:
If you’re like me and you don’t have an Azure Active Directory tenant set up, you can connect to the SQL server (and not the SQL database), navigate to Active Directory admin in the Settings menu, select Set admin, and choose your app registration. Be careful with this, however—that app registration can now do anything on the server, so don’t go sharing details all willy-nilly.
Save that and you’re golden. Now let’s go to Azure Machine Learning.
Configure the Azure Machine Learning Datastore
Navigate to Azure Machine Learning Studio and select your workspace. Then, in the Manage menu, select Datastores. Select + New datastore to create a new datastore.
I’m going to call this datastore expense-reports because that’s the name of my database. I’ll choose Azure SQL database as the Datastore type and select my subscription and database. Change the Authentication type to Service principal and you’ll need to enter a Tenant ID, Client ID, and Client secret. Those are the three things I just happened to have you write down. It’s great how we sometimes luck out that way. With all of that done, we can create the datastore and move on.
The good news is that we have a datastore. The bad news is that we have no idea if it worked. So let’s figure that out. Navigate to Datasets in the Assets menu and choose + Create dataset and then From datastore.
I’m going to call my datasetexpense-reports because that’s what it is. Keep the Dataset type as Tabular and move to the next section.
In this section, I’ll select Previously created datastore and pick the expense_reports datastore. Then, choose Select datastore. As a quick note, I want to make sure that the Skip data validation box is unchecked. We absolutely want to validate the data, as we’re using this test to make sure that our app registration is set up correctly.
I can then enter my SQL query and select Next to continue.
If your security setup is correct, you’ll be able to continue. Otherwise, you’ll see an error like the following:
The real error message is in the More details link. In this case, my error code was 40615. Check out the guide for troubleshooting connectivity errors in Azure SQL Database and Azure SQL Managed Instance and 40615 is a firewall issue. This one’s pretty easy to fix. Return to the SQL server and navigate to Firewalls and virtual networks in the Security menu. The easy button here is to allow Azure services and resources to access the server. The hard button is to figure out the relevant IP addresses that Azure Machine Learning will use and add those. I voted for easy.
After saving and updating the firewall rule, give it a few minutes to propagate and try again. If you paid your Azure bill and included the bribes special payments, you should get your dataset preview.
Now that we have this in place, let’s test it all out.
Building a Notebook
For this last part, we’ll want to have a table in Azure SQL Database to store predictions. Here’s the table creation statement:
SET ANSI_NULLS ON
SET QUOTED_IDENTIFIER ON
CREATE TABLE [dbo].[ExpenseReportPrediction](
[ExpenseReportPredictionID] [int] IDENTITY(1,1) NOT NULL,
[EmployeeID] [int] NOT NULL,
[EmployeeName] [nvarchar](50) NOT NULL,
[ExpenseCategoryID] [tinyint] NOT NULL,
[ExpenseCategory] [nvarchar](50) NOT NULL,
[ExpenseDate] [date] NOT NULL,
[ExpenseYear] [int] NOT NULL,
[Amount] [decimal](5, 2) NOT NULL,
[PredictedAmount] [decimal](5, 2) NOT NULL
) ON [PRIMARY]
ALTER TABLE [dbo].[ExpenseReportPrediction] ADD CONSTRAINT [PK_ExpenseReportPrediction] PRIMARY KEY CLUSTERED
)WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
And here’s a picture of the output:
In this post, we learned a few things.
It is possible to read from Azure SQL Database in Azure Machine Learning. It’s pretty easy, even.
It is not easy at all to write to Azure SQL Database from Azure Machine Learning.
Catallaxy Services is looking for a paid intern. By which I mean that I am looking for an intern, but it sounds more official when I take on the corporate guise. The short version of the skills I am looking for are as follows:
Slide deck design
Fluent or native English language skills
Willingness to learn about data platform technologies
Catallaxy Services, LLC provides training and consulting services in the data platform space. 2020 was a busy year and 2021 is shaping up to be even busier. In order to expand and grow, I am looking for an intern with visual design skills to perform approximately 20 hours of work per week. The primary focus of this position will be to assist me with content design and development, editing videos, and improving the quality of existing digital assets.
If this sounds good to you (or someone you know), please reach out to me at email@example.com and we can talk about the position in more detail. Or you can apply via LinkedIn.
As I was reading that, I decided that it was time to create an Official Ranking. So here it is for the date and time data types:
DATETIME2, especially DATETIME2(0). If you’re using UTC everywhere, this is a no-brainer best data type. You get to choose whether you want your DATETIME2 to be smaller than DATETIME or more precise than DATETIME.
DATE. A large percentage of the time, I just need the date and don’t care about the time. This three-byte beauty gives me that, but with zero risk that I miss a record because instead of 2020-01-01 00:00:00 they enter something that rounds to 2020-01-01 00:00:00.003.
DATETIMEOFFSET. It’s a hefty data type, the largest in the bunch. But if you aren’t dealing with UTC everywhere, it does something no other date or time data type can: it keeps track of your time zone. Therefore, if you for some reason hate storing times in UTC, flip this with DATETIME2 and use it everywhere. Note that you also need to use the AT TIME ZONE syntax to specify what the time zone is or have your user correctly specify the time zone by passing in a pre-created DATETIMEOFFSET. It’s like getting a puppy: you wanted it, now you deal with the extra work taking care of it.
TIME. This is where the utility curve really starts to fall off. I rarely use the TIME data type. However, I do appreciate that it’s just 3 bytes and can be useful for process scheduling.
DATETIME. It’s really not that great of a data type. If you care about size, three of the four have it beat. If you care about precision, three of the four (assuming you bump up the precision on DATETIME2) have it beat. If you care about not messing things up, three of the four have it beat (DATE and TIME if you use one but not both elements, and DATETIMEOFFSET if you don’t want to mess up time zones).
SMALLDATETIME. If you’re really that desperate about two bytes, just use DATETIME2. This was a lot more useful prior to SQL Server 2008, but we can say the same thing about Kurt Warner. That doesn’t make me want to plug him into my fantasy team today.