One of the things I like to say about machine learning model is, “shift happens.” By that, I mean that models lose effectiveness over time due to changes in underlying circumstances. Relationships between variables that used to hold no longer do, and so our model quality degrades. This means that we sometimes need to retrain models.
But there’s a cost to retraining models—that work can be computationally expensive and time-consuming. This concern is particularly salient if you’re in a cloud, as you pay directly for everything there. This means that we don’t want to retrain models unless we need to. But when do we know if we should retrain the model? We can watch for model degradation, but there’s another method: drift detection in your datasets.
The concept of drift detection is simple: we want to figure out if the distribution of the data has changed significantly since we built our model. If it has, that’s a sign that we should potentially look at retraining our model, or at least paying closer attention to model results. Let’s go through an example of drift detection using Azure Machine Learning. Note that as of the time of this blog post (March of 2021), dataset monitors are still in public preview, so things may have changed between now and the time you read this, visitor from the future.
The first thing we need to do is create a dataset. I’ve created a dataset for my expenses data demo, where I know that the
Amount value has changed significantly for some people over the years 2017-2019 versus the starting point of 2011-2016. Here’s my schema:
Note that I needed to set one column to
Timestamp. This is necessary for drift detection, as it needs to track results over time. To do that, there are two options available to us: either select a column to be the timestamp like in this case, or if you are using blob storage, you can use a folder structure like
yyyy/mm/dd to track that data.
Now it’s time to create our dataset monitor. To do that, select Datasets from the Assets menu, choose Dataset monitors, and then select + Create to create a new dataset monitor.
Our next step is to choose a target dataset. I chose the
expense-reports dataset that I’ve created. Note that you also get to choose a specific version of the dataset.
After choosing a target dataset, you have to set up a baseline. The baseline defines what normal is for the dataset. That can be either a date range in the target dataset or a separate dataset altogether. In my case, I chose that the date range was part of the target dataset.
One thing I do want to complain about here is that in the UI, I don’t have the ability to type in a date. For something where I’m going back to January of 2011, that’s a lot of clicking to do. If I use the code-first approach, I can of course enter a timeframe, but I wanted to try out the UI approach first and it was not ideal.
Anyhow, the next step is to configure monitor settings. Here is where you can name the monitor, select which features you want to track, choose which machines will execute scheduled (or ad hoc) drift checks, and how frequently you want to check data. You can also optionally enter e-mail addresses if you’d like to receive e-mails when drift goes above the set threshold. All in all, this screen is straightforward.
I decided to experiment with three different dataset monitors. The first one, that you see above, tracks all features weekly. The second, whose screen I did not capture, monitors all features monthly. The third monitors just the amount, but does so monthly.
The reason I did this is that I know the dataset well enough to understand that
Amount is the volatile variable. I wondered if drift detection would be able to alert me on potential drift for the all-features example, or if I needed to narrow it down to the one thing which does change dramatically.
After creating a monitor, navigating to its page shows that it is…kind of bare.
We haven’t provided any non-baseline data set, so of course this is empty. Also, the start and end dates run from March 2020 through March 2021, and I know I don’t have any data for that time frame. So let’s backfill some data. To do that, I select Analyze existing data and that shows a fly-out menu. In that menu, I can set the time frame for analysis, as well as my compute target.
Let’s take a moment now and talk about timeframes. When we created the monitors, we set the frequency to one of three values: Daily, Weekly, or Monthly. This has two effects. First, it sets up an automated schedule to run on that period. Second, it assumes that you want that period for backfills as well. So for this weekly expenses monitor, the drift detection process will group data by week and perform an analysis. This becomes important in a moment. But let’s first submit this run.
After submitting a run, we learn that data drift backfills are stored in Azure ML as experiments, so we can collect details on the run there.
After selecting the
expenses-monitor-all-weekly-Monitor-Runs experiment, we can select the first run, and that gives us an important hint about how we’re doing it wrong.
It turns out that we need to have at least 50 non-null data points per group. My group is a week, so I need at least 50 rows for the period 2017-01-01 until 2017-01-08, and then another 50 rows from 2017-01-08 until 2017-01-15, and so on. Well, for my data set, I don’t have 50 rows per week at all. It’s a fairly sparse data set in that regard, and thus the weekly monitor won’t work. It will keep telling me “No Data” because there aren’t enough rows to count any particular week.
Fortunately, we thought ahead and did this for the month as well. Clever us.
We can see that the monthly drift does return results and those numbers are quite high, driven mostly by
Amount. Note that
Amount is responsible for 93% of the total drift, and that our magnitude of drift is way above the arbitrary threshold. We can also see that it was increasing month-over-month for January, February, and March of 2017.
From there, we can see charts on the relative importance of features with respect to drift, as well as the measures Azure Machine Learning uses to track drift.
For numeric features, Azure ML uses four components: minimum value, maximum value, mean value, and Wasserstein distance (also called earth-mover’s distance). That is, three point value comparisons and one measure comparing the baseline distribution to the target distribution.
For categorical features, Azure ML uses the number of unique categories and the Euclidian distance. They describe it as:
Computed for categorical columns. Euclidean distance is computed on two vectors, generated from empirical distribution of the same categorical column from two datasets. 0 indicates there is no difference in the empirical distributions. The more it deviates from 0, the more this column has drifted. Trends can be observed from a time series plot of this metric and can be helpful in uncovering a drifting feature.
As we can see here,
Amount drives our change and the others stay pretty well the same over time. I’d next like to run a backfill on the rest of my data points, but I don’t want to spend all day click-click-clicking in the UI. Fortunately, there’s a code-first notebook experience.
In order to perform a backfill, we need only a few lines of Python code.
from azureml.core import Workspace, Dataset from azureml.datadrift import DataDriftDetector from datetime import datetime ws = Workspace.from_config() monitor = DataDriftDetector.get_by_name(ws, 'expenses-monitor-all-monthly') backfill1 = monitor.backfill(datetime(2017, 4, 1), datetime(2017, 6, 30)) backfill1 = monitor.backfill(datetime(2017, 7, 1), datetime(2017, 9, 30)) # etc. etc. Or, you know, create a loop. backfill1 = monitor.backfill(datetime(2019, 10, 1), datetime(2019, 12, 31))
Each of the calls to
monitor.backfill() will queue up a run of the relevant experiment, so the call will finish within a couple of seconds, but that doesn’t mean your backfill has completed.
In my simple scenario, each 3-month period took about 3 minutes to run. Obviously, this will depend greatly on compute power, number of rows of data, and number of features to compare.
Now that everything is complete, we can take a look at drift over the course of our several-year dataset.
In the Future
So far, we’ve only looked at backfills. The real benefit of data drift analysis, however, is that you can use it to monitor data going forward based on the schedule you’ve set. Then, if the monitor catches drift-related issues, it can send you an e-mail and alert you to this change in data.
Dataset drift monitoring with Azure Machine Learning is really simple, to the point where you can set it up within minutes and have reasonable data within hours. I think it’d be hard to find a simpler method to perform this sort of analysis.