Azure ML: Custom Data Sets

Last time around, I built a simple Azure ML experiment based on flight delay data made available to us as a Microsoft sample.  Today, I want to load a custom data set and build an experiment against it.

The Data Set

I’m using SQL Saturday data for this demo.  I came up with a set of data for each SQL Saturday.  This comma-delimited data file includes:

  1. City name
  2. Region.  This is something I personally coded, breaking each SQL Saturday down into the following regions:  Southeast, Mid-Atlantic, Northeast, Midwest, South, West, Canada, Latin America, Europe, Africa, and Asia.
  3. Date
  4. Month
  5. International.  1 if this event took place outside of the United States.
  6. Personal Connection.  1 if I personally knew at least one of the speaker committee members.
  7. Spoke.  1 if I spoke at this event.

My plan is to regress Spoke against Region, Month, International, and Personal Connection to see if there is a meaningful pattern.  (Hint:  in reality, there is.)

Uploading A Data Set

Uploading a data set to Azure ML is easy.  In the Azure ML Studio, go to My Experiments.  Then, on the left-hand side of the screen, click the “Datasets” option.  To add a new dataset, click the [+] NEW button at the bottom and select Dataset –> From Local File.

NewDataset

Doing this brings up a modal dialog:

UploadDataset

Uploading my hundred-row dataset took a few seconds, and now it’s available in the datasets section.

MyDatasets

Now let’s experiment.

Create The Experiment

Our experiment is going to look similar to the one we did yesterday:  a simple linear regression.  Because the structure is fundamentally the same, I won’t walk through it in as much detail as I did yesterday.

Build the Data

I dropped my 2015 SQL Saturdays custom dataset onto the canvas.  After that, I dragged on a Project Columns component and selected the columns I want to use in my regression model:  Region, Month, International, PersonalConnection, and Spoke.

SQLSaturdayColumns

From there, I dragged on a Split Data component and split the rows 70-30 using random seed 21468.

Build the Model

My model is a Linear Regression, so I dragged that onto the canvas.  I also dragged on a Train Model component, a Score Model component, and an Evaluate Model component.  The only thing which needed tweaked was the Train Model component, whose selected column is “Spoke” because I want to try to determine whether I’m likely to speak at a particular event.  Otherwise, the final model looks like:

SQLSaturdayModel

Check the Model

When I ran this model, I got back my results in approximately 30 seconds.  My model results show an R^2 of 0.48, which is pretty good for a social sciences result.

SQLSaturdayModelResults

An interesting line of questioning would be, what would improve the R^2 here?  Maybe variables around how many other events there were, the size of the event, and whether I had just gone to a SQL Saturday the week before would help, but I’m satisfied with these results.  So now I’m going to go the next step:  setting up a web service.

Setting Up A Service

Setting up a web service is pretty easy:  click the “Set Up Web Service” button at the bottom of the dashboard.

SetUpWebService

Doing this changes our canvas and gives us a new tab:  Predictive experiment.

PredictiveExperiment

After running that service, you can then click “Deploy Web Service” at the bottom of the screen to deploy it.  From there, you’ll get a web service screen which gives you information on how to connect to the web service.  It also gives you information on how to test the web service.

Testing the Web Service

I want to do seven tests for the web service:

  1. Cleveland in February, and I know the committee.  I expect “Spoke” to be very close to 1, meaning nearly a 100% likelihood of speaking.  Actual Result:  0.73.
  2. Baltimore in August, and I know the committee.  I expect “Spoke” to be somewhere around 0.5 because I do some Mid-Atlantic SQL Saturdays but not too many of them.  Actual Result:  0.52.
  3. Dallas, Texas in May.  I don’t know the committee.  I expect “Spoke” to be relatively close to 0 because I didn’t do any SQL Saturdays in the South.  Actual Result:  0.009.
  4. Kansas City in October.  I know the committee.  I expect “Spoke” to be relatively close to 0, but that’s because the model is missing some information (like how delicious the barbecue crawl is).  In reality, I want to go and would expect the probability to be closer to 1.  Actual Result:  0.67.
  5. Berlin in June.  I don’t know the committee.  I expect “Spoke” to be very close to 0.  Actual Result:  0.01.
  6. Raleigh in October.  I am the law.  I expect “Spoke” to be nearly 1.  Actual Result:  0.91.
  7. Raleigh in December.  I am the law.  I expect “Spoke” to be high, but less than case 6.  Actual Result:  0.898.

TestResults

In these cases, the model was accurate 7 out of 7 times, and actually tripped me up once when I expected failure (because I had actually missed Kansas City in October).  If you’re a harsh grader, I expected Cleveland to be higher.

Conclusions

This is obviously a toy model with a small data set, but it shows the power of what you can do with Azure ML.  With just a few clicks, you can be up and running with a data set, and after building a web service, you can easily consume it in your own application.  This significantly shortens the development time necessary to go from a basic plan to a production process.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s