This is part two in a series on low-code machine learning with Azure ML.
Welcome to Machine Learning Studio
Azure Machine Learning Services is the full-on product for machine learning in Azure. In addition to this, there is Azure ML Studio, a graphical interface for using Azure ML. You’ll definitely spend a lot of time in here as you get used to Azure ML, though in the follow-up series we’ll end up spending less and less time in-Studio. But let’s not get too far ahead of ourselves here.
This is Azure ML Studio:
There are a lot of options and settings we have available to us, but let’s take our time and walk through these in a sensible way. The first thing we are going to need is some data and to do that, we need a data source.
Creating a Datastore
In Azure ML, data sources are called Datastores, and creating one of these is pretty easy:
Select Datastores from the Manage menu and then choose New datastore. This will bring up a flyout panel allowing you to choose a datastore. All of the datastore options are Azure data sources, so if your data is in S3, on-premises, or somewhere else, you’ll want to have a separate process move that data into Azure.
As far as ease of use regarding these data sources go, they’re all going to be pretty easy for retrieving data, but unless you’re using a file-based datastore (Blob Storage, file share, Data Lake Storage), writing data will be a royal pain, as I describe in a post from last year. Because we’ll primarily be reading data from these sources in this series, you can use whatever datastore you’d like. Note as well that Azure ML automatically creates two datastores for you: one for blob storage and one for file storage. Those are related to the storage account that we set up in the prior post in the series.
Creating a Dataset
Once you have a datastore, you’re going to want to create at least one dataset. Datasets are versioned collections of data in some datastore. The Azure ML model is quite file-centric, and this concept makes the most sense with something like a data lake, where we have different extracts of data over different timeframes. Perhaps we get an extract of customer behavior up to the year 2018, and then the next year we get customer behavior up to 2019, and so on. The idea here is that you can use the latest training data for your models, but if you want to see how current models would have stacked up against older data, the opportunity is there.
Creating a new dataset is pretty easy, and we see a few options for how we might load in our data.
If you do happen to have files on your machine that you want Azure ML to use, the From local files option will let you upload them into an Azure Blob Storage (or Data Lake Storage Gen2) account. You also have the choice between two dataset types: Tabular and File.
Tabular datasets are the most common format, and it’s what you’ll need to use for the majority of ML tasks. This can be the result set from a SQL query, a CSV file, a Parquet file, or anything else you could make into a DataFrame. By contrast, File datasets are binary data like audio, video, and image files. There are some ML techniques which can accept these types of data for tasks such as image classification, but most of the time you’ll be working with tabular data.
You can also retrieve data from websites, so if you have the direct link to a file, you can use that instead of downloading the file to your local machine and uploading it to Azure Blob Storage. If you use a dataset from a web file, it will not require you to make a copy in Blob Storage, but that comes with a risk: if the file disappears sometime in the future, your dataset won’t work anymore. For that reason, you might want to make a copy if you don’t directly control the data.
The final option is to pull data from Open Datasets, which is public domain data readily available for you to use. This is great for experimentation and you get to see some examples of just how messy real-world data is.
Now that we have data, we need some compute resources to process the data.
There are four different types of compute available to you in Azure Machine Learning. Let’s go over each in turn.
A compute instance is a hosted virtual machine which contains a number of data science and machine learning libraries pre-installed. These are great for easy remote development, and you’ll need a compute instance for some of the work we will do in this series. When you run any code in a notebook, it runs against a compute instance you specify.
The single biggest problem I have with compute instances is that there is no auto-stop functionality to them. This is really frustrating because you’re paying for that virtual machine like you would any other, so if you forget to turn it off when you go home for the weekend, it’ll cost you. I wish there were a built-in option to shut off a compute instance after a certain amount of inactivity. Instead, you’ll need to start and stop them manually.
Typically, I’d recommend one compute instance per developer, and treat them like disposable developer VMs. Don’t be afraid to trash one of these compute instances and build a new one from scratch; it’s probably the easiest way to update packages, especially if you maintain notebooks which include any custom packages or configuration you want to do on the box.
Compute clusters are going to give you a more powerful training experience, as you can have multiple machines working together on certain types of problems. Azure ML controls these automatically, spinning them up and down on an as-needed basis. When you run training jobs in the Designer (or in code), you pick a compute cluster for this work.
Unlike compute instances, compute clusters will spin down when they aren’t active, so you don’t have to start and stop these machines and the likelihood of a nasty surprise in your bill is a lot lower. Also, unlike compute instances, you probably don’t want one of these per developer; instead, developers can share compute clusters.
Inference clusters are machines (and sets of machines) which you have available to perform inference—that is, generating predictions. There are five different options for inference:
- You can perform inference on a local machine. We won’t cover that in this series, but it is useful for troubleshooting issues.
- Azure Container Instances are a simple way of making real-time inference available. It’s not going to be the most robust solution, but it is easy, as there is no cluster management necessary. Microsoft recommends it only for dev/test scenarios, though I’d extend that to small production jobs as well. This is intended for real-time inference, meaning you call a REST API, pass in one or a few records, and get back a small set of predictions. It’s not intended for batch scenarios, in which the system processes whole files at a time.
- Azure Kubernetes Service is the Microsoft-recommended solution for production workloads. Azure ML creates and deploys a web service, and you can use AKS to scale out how many nodes will be available. The downside is that Kubernetes has a pretty steep learning curve, so managing one of these can be a lot more difficult than ACI.
- Arc-enabled Kubernetes lets you run jobs not just in AKS, but also on-premises and in Amazon’s EKS. It also lets you perform batch inference, where you can point the service at a directory and let the service read through all of these files and write out predictions someplace.
- Azure ML compute clusters let you run batch inference without needing to manage Kubernetes. It won’t let you do real-time scoring, but for low-priority tasks, this and ACI are good choices.
In this series, we’ll focus on Azure Container Instances, though we may also hit on Azure ML compute clusters.
Attached compute is a fairly new option in Azure ML. It allows you to make use of compute resources in another service rather than creating them in Azure ML.
The image above shows which options are currently available. One quick note is that the “Virtual machine” option must be an Azure Data Science VM, not just any virtual machine you’ve created.
For the most part, these are custom operators for specific operations. For example, I have an attached Azure Data Factory compute because I’ve run Azure ML tasks via Data Factory. You can also attach a Databricks runtime and execute Databricks notebooks as part of an Azure ML task.
In this post, we looked at different ways we can make data available to Azure ML, as well as the different methods we have to process data. In the next post, we’ll begin training a model with Automated Machine Learning.
3 thoughts on “Low-Code ML: Data and Compute”