Machine Learning Setup
Go to your old Azure portal. Then, go to New –> Data Services –> Machine Learning –> Quick Create. You’ll need an Azure account and an Azure storage account.
Once we have our workspace, we can connect to Azure ML Studio to continue.
Build An Experiment
We want to build the following Azure ML experiment:
Let’s build it one step at a time. First, click on the [+] NEW link and select Blank Experiment.
This will give you a blank canvas. We want to look for Flight Delay Data, which is a sample data set Microsoft provides. We’re going to take this data set and drag it onto the canvas.
Visualize The Data
Before we build a model, we need to understand the data set to get an idea of what kind of information is available to us. In this case, we get a set of data dealing with flight delays, but to learn a little bit more about this data set, we can right-click on the set and go to Dataset –> Visualize. That will let us get some basic details:
In this data set, we have a few attributes like year of flight, day of flight, carrier, origin airport ID, and destination airport ID. We also have a few metrics, including departure delay, arrival delay, and whether the flight was cancelled. I clicked ArrDelay (arrival delay) and got some basic statistics and a bar chart. This result set shows that the average flight (mean) is 6.6 minutes late, but the median flight is 3 minutes early.
Determine the Goal
My goal here is to see if I can predict how late a flight will be based on this data set. If I can make a good prediction, I might be able to use it for my personal goals—figuring out my flight schedule a little bit better.
In other words, what I want to do is predict the arrival delay (ArrDelay). The rest of my experiment will go toward that goal.
Add More Components
Ignore Cancelled Flights
The first thing I want to do is throw out cancelled flights. I expect that those flights won’t be marked as delayed (because they were cancelled), and so I don’t want them to throw my predictions off.
The way that I can do this is to drag on a Split Data component. To do this, I’m going to select “Relative Expression” in the Splitting mode.
I also need to set a Relational expression, specifying that I want to split data based on whether Cancelled is greater than 0. As a quick note, my first attempt with this Split Data component had me set \”Cancelled”=0, but that caused an error when I tried to run the ML experiment, so instead of using an equality operator, I needed to use an inequality operator.
The Split Data component gives you two outputs, but I only want to use the second, non-cancelled flights output.
I’ve decided what my goal is, and now I need to throw out any unnecessary columns so that the model I choose doesn’t try to use them. In my case, I want to throw out the following columns:
- CRSDepTime and CRSArrTime — These are the times the flight departed and arrived, respectively. I could make use of them, but I figured they’d have little predicative power.
- DepDelay — The number of minutes late the flight is at departure. This probably correlates pretty well to arrival lateness, but I want to see if the other columns predict late arrival times, as I won’t know how late the flight departs until I’m on the plane.
- DepDel15 and ArrDel15 — These tell us whether the flight was at least 15 minutes late for departure or arrival. They’re just a grouping mechanism for DepDelay and ArrDelay, so don’t let my analysis include these.
- Cancelled — Was the flight cancelled? Well, I just filtered out those flights, so no cancelled flights remain. Therefore, we don’t need this column.
The Project Columns component is nice in that it gives me a drop-down list with the column names in my data set.
Separate Training From Test Data
The next thing that I want to do is add another Split Data component. this time, I want to split a percentage of rows off into a training set (for the model) and a test set (to test the efficacy of the model). This Split Data component has a different splitting mode: Split rows.
In this case, 70% of my rows will go to the training set, with the remaining 30% held back for the test set. I’m specifying a “random” seed because I want others to be able to replicate my work.
Adding A Model
Now that I have the data set laid out how I want, I need to hook up a model. In this case, I’ll use a linear regression model. In practice, we’ll want to use a model which makes sense based on our expectations, but I’m starting with a very simple model.
Select the Linear Regression component and drag it onto the canvas. We won’t need to set any options here, although there are a couple of parameters we can tweak.
One thing to note is that our model has zero inputs and one output. This means we don’t directly move a data set to a model. Instead, we need to use the model as an input for something else.
Train A Model
That “something else” is a Train Model component. We drag one of those onto the canvas and hook up the Linear Regression and Split Data as the inputs for this component.
On the Train Model component, I’ve select arrival delay as the variable we’d like to explain, and Azure ML takes care of the rest.
Score the Model
We have a trained model at this point, but now it’s time to see how useful that model actually is. To do this, we’re going to drag on a Score Model component and set its inputs. The Score Model component requires two inputs: a model and a data set. We’ll use the Train Model component as our model, and the test data for our data set. We want to use the test data because we’ve already used training data to build our model, and testing against data we’ve already used will give us a false image of how good our model really is.
Evaluate the Model
The very last component on this experiment is an Evaluate Model component. This lets us see how well the model predicted test data. After putting this component on the canvas, right-click on it and go to Evaluation results –> Visualize. You’ll get a bar graph as well as a summary data set which looks like this:
For a linear regression model, the Coefficient of Determination is better known as the R-squared value, and it ranges from 0 (no correlation between independent and dependent variables) up to 1 (absolute correlation). In this case, our R-squared is a pitiful 0.009792, meaning that we cannot predict whether a flight would be delayed based solely on the inputs collected.
Although our first model was an abysmal failure at predicting flight delay time, it did show just how easy putting together an experiment in Azure ML can be. We were able to cleanse a data set, project specific columns, split out training and test data, train a model, score that model, and evaluate the model using little more than a canvas. Anyone familiar with SQL Server Integration Services or Informatica will be right at home with Azure ML.
The problem that database people will run into, however, is that although creating a model is easy, creating a good model is hard. We can slap together a few components, but it takes more than that to generate a useful model.