This is part three in a series on low-code machine learning with Azure ML.
The Place of AutoML
Automated Machine Learning (AutoML) provides two distinct benefits. The first benefit is the one that AutoML providers tend to tout: you don’t need (much) machine learning experience to use them. According to the marketing, AutoML does all of the work and you sit back and enjoy the fruits of its labor.
I am nowhere near sold on this use case for AutoML. Yes, you can get answers in a few clicks, but to get good answers, you need a lot more knowledge of data processing and statistics than they let on. Feeding in garbage data will get you mediocre results.
The other use case for AutoML is one I’m much happier with on principle: forming a minimum baseline for model quality. For example, suppose I’m talking to a business executive about a classification engine, and as we get talking about requirements, that business executive notes that we need to be right at least 95% of the time in our predictions. Well, I can take the base data, perform some amount of feature engineering, and throw it at AutoML. If the measures for accuracy come close to 95%, then I know this is potentially viable: with proper time spent on feature engineering, algorithm research, and hyperparameter tuning, I can beat what AutoML gives me and I’ll be likely to break that 95% threshold. If, however, the measures for accuracy are in the 30s or 40s, then I know it’s a much more difficult challenge with a high risk of failure, and the business executive and I can move forward with that understanding in mind, knowing that we can pull the plug quickly if we don’t see good results.
You might note that regardless of the reason, the process is still the same, so let’s get into it.
Automating Machine Learning
Inside Azure ML, we have an Automated ML menu option in the Author menu. Select that and then choose New Automated ML run.

The first thing we’ll need to do is choose a dataset. AutoML, as of the time of writing, only supports tabular datasets, so file-based datasets like collections of images are not going to work. I have quite a few datasets from which to choose, though if you’re following along at home, you probably don’t have most of these. I’m going to choose the penguin dataset, which you can get from the following link. It’s a simple classification problem in which we wish to determine which type of penguin we have, based on a few characteristics.

Next up, I need to configure my run. I’ll create a new experiment and call it penguin-classifier
. I need to select a target column, which is the thing I want to predict. Note that we may only have one target column. I can also select my compute type; AutoML can use a compute cluster or a compute instance. I’ll choose a compute cluster and use a fairly small cluster which costs $0.15 per hour.

On the Task selection page, we may choose from one of three tasks: classification, regression, and time series forecasting. In case these are new terms to you, here’s a really quick primer:
- Classification means that our target column (or label) is discrete—that is, there is a fixed number of entries, and these entries represent different things. Here, the species indicator (which includes the values 0, 1, and 2) represents three different breeds of penguin.
- Regression means that our label is continuous—that is, it can take (approximately) any value within a range. Predicting the height of a penguin based on various characteristics would be a regression problem.
- Time series forecasting is typically a regression problem with a time element (also known as autocorrelation or serial correlation). The idea here is that the current value depends in part on the prior value, and possibly on a series of prior values. With time series data, we rarely see major breaks; movement instead tends to be relatively smooth, and so we need to incorporate that into the model. This also introduces the idea of periodic patterns, like temperatures rising as we get to summer and falling as we get to winter.
We have a classification problem here, so we’ll choose that option. There is also a checkbox to enable deep learning, a topic which is beyond the scope of this series. I’ll leave it unchecked, as it can lead to considerably more accurate models, but will take a lot longer to process, and this dataset is too small to make good use of those techniques.

Additional Configuration Settings
From here, the View additional configuration settings menu allows you to pick from several options.

For classification problems, the Primary metric drop-down lets you choose what you want to use as the measurement of model quality.
- Weighted Area Under the Curve (AUC) measures the relationship between true positives and false positives in the dataset. The more true positives and fewer false positives a model has, the closer AUC is to a perfect measure of 1.0.
- Accuracy measures the percentage of correct guesses divided by total guesses. Accuracy is sometimes a good measure, but has a couple of problems, especially with highly imbalanced class data, such as if we had 9,998 examples of penguin type 0 and one each of types 1 and two. In that case, an algorithm which guessed 0 for all penguins would be right 9,998 out of 10,000 times, for a score of 99.98%. But in practice, it would be useless at determining the actual type of a penguin.
- Norm macro recall is a normalized version of the macro recall score, that is, arithmetic mean of the recall for each class. Recall in this case means the percentage of correct classifications for penguins. Norm macro recall would solve our prior problem quite well, as we’d be 100% right on class 0 and 0% right on classes 1 and 2, giving us a norm_macro_recall of about 0.33.
- Precision score weighted looks at the ability of our model to avoid labeling a penguin as class 0 if it’s really a 1 or a 2.
- Average precision score weighted averages out precision and recall
Microsoft Docs has a more detailed explanation of what all of these mean, including the specific calculations performed.
The Blocked algorithms option lets you prevent AutoML from using specific algorithms. You might know that they’re bad, or you want to focus on just a few to cut down on the time and effort for training. Regardless of your reasoning, you can use this to simplify the task.
We don’t need a Positive class label here—this is used in binary classification, and helps us understand positive versus negative, whether that’s True, 1, or whatever. The reason we might want to specify the positive class is that False might actually be the “positive” class—suppose we are looking for an enzyme which naturally occurs in people, and we build a test to determine if you do not have the enzyme. In that case, a False score is actually the thing we’re testing, which makes False the positive and True the negative outcome. It’s a little weird to think about it in those terms, so I’d recommend trying to keep things aligned as much as possible.
The Exit criterion menu lets us cap the amount of time we spend on AutoML. By default, it’s set to run for 6 hours with no end in sight. And by the way, when it says 6 hours, expect it to take more like 8-9 hours, especially if you have a lot of data to process. AutoML takes a while to get started, and it might take 30-60 minutes for it to get off its keister and churn through the first model. So don’t expect this to be the type of thing you kick off 10 minutes before an important meeting. In addition, there is a metric score threshold. I’m going to set the value at 0.93, meaning that if I have a model with an AUC of at least 0.93, AutoML will stop processing and declare victory. This is a good way of ensuring that the process won’t go on forever.
Finally, Concurrency will let us run several tasks at once. By default, that number is 4, though you can change it.
Featurization Settings
The Featurization settings menu allows you to work with the individual columns.

We can choose whether or not to use specific features. If, for example, we have a text column with the type of penguin, that’s something we would not want to include in our AutoML run, as the whole idea is that we don’t actually know the type and want to have the system make its best guess. I normally just leave these all as Auto, although there are some interesting options in the Impute with drop-down. Imputation happens when a particular value is missing, and the options are:
- Auto
- Mean
- Most frequent (Mode)
- Median
- Fill with constant
Validate and Test
Finally, we can perform model validation and testing. It’s important to note that these are two different tasks and we’ll want to use different data for each:
- Training data is what we use to build a model. This is usually 50-80% of the total data, depending on how much we have and how much variety there is. More data allows us to push this number down.
- Validation data is what we use to compare quality between several models. This is usually 10-20% of the total data. AutoML has a few different options for validation, but I’d leave at it Auto unless you know what you’re doing. If you already have your own separate validation dataset, you can use it, but normally, we use a technique like k-fold cross-validation to split the dataset up and still make fullest use of the data.
- Test data lets us determine how good the best-performing model is. Test data is something the model has never before seen, and we normally reserve 10-30% of total data for it. Ideally, our test measures are almost as good as our training measures; if there’s a major discrepancy, we have a problem.
Finish and Wait
Once you get through all of the settings, it’s time to kick back and wait a while. It can take a while for AutoML even to begin, and from there, you might end up running it for hours or days to come up with your answer.
Of course, as I say that, it only took about 6 minutes to finish and we ended up with an AUC of 0.9985, which is outstanding. It was also on the first try, which says that we were probably destined for a great score regardless.

The Data guardrails tab tells us about the choices AutoML made in featurization. It turns out that for validation, it chose 10-fold cross-validation, and it decided to use the mean for any missing values. Granted, there were only two penguins with missing values, so not a big number there.
The Models tab lets us see all of the models AutoML created. We can drill into the models for more details on how they work, why they do what they do, and which algorithms were used. In our case, we see the combination of MaxAbsScaler and LightGBM. All MaxAbsScaler does is convert data to range between 0 and 1. Then, AutoML used the LightGBM classification algorithm to do the dirty work.
There’s a good bit more that we could dive into but I’ll save that for the presentation version of this.
Conclusion
In this post, we saw how AutoML works. Because we used a clean, simple dataset, AutoML was able to generate an effective answer in just a few minutes. In practice, I’d definitely allocate more time to the task than that. AutoML should not replace human intelligence, but it can speed up the process of finding a good model.
2 thoughts on “Low-Code ML: Using AutoML”