Last week, I promised some changes to the Big Data and Visualization Microsoft Cloud Workshop. The bulk of these changes are now in development and I wanted to cover what some of the changes look like prior to their release.

Goodbye, Azure ML

In the current architecture, you can see that we use both Databricks and Azure Machine Learning.

All of the machine learning options.

We train a model using Databricks, save it, and use a notebook to move the model to Azure ML where we can serve it using Azure Container Instances for web application consumption. This is great, except that the Azure Machine Learning team has removed AzureML-PySpark-MmlSpark-0.15 from their list of curated environments and building your own environment to host Spark is…not something I wanted to cram into this lab. Instead, we’re going to take advantage of MLflow model serving on Databricks, which is rather convenient, except that there doesn’t appear to be an SDK option to enable it. Still, this does let me simplify the architecture a bit:

Airbrushed like a Moscow pro.

WTF, .NET

The Databricks REST API is similar to what Azure ML provided, but not quite the same. For example, the prediction model we built doesn’t return a confidence score, so the segment in Exercise 8’s web app which displays confidence is gone. The big change to this app, however, is an oddity in the way you call the Databricks REST API.

First, you need to use a Personal Access Token (PAT) to make calls. That’s not a big deal; you simply add a bearer token:

client.DefaultRequestHeaders.Authorization = new AuthenticationHeaderValue("Bearer", pat);

I could put this as a header on the HttpContent call, but 100% of calls need the token so I figured I’d might as well just put it on the HttpClient. But the real sticky wicket was when I tried to call the Databricks API and got 500 errors because it wants the JSON in a specific format.

When building a Postman call, I successfully called the API with normal JSON syntax: { "Attribute":"Value", "Attribute2":"Value2" ... }.

Instead of this nice, easy to build JSON structure which works with Postman calls, something in the .NET call required switching the format to Pandas split DataFrame format: { "columns":["Attribute", "Attribute2", ...], "data":[["Value", "Value2", ...]] }.

This is not trivial to build in .NET. If I were going to do this for several calls and a lot of code, I’d probably end up using reflection to loop through the list of field names and build a mapper to reshape the data as arrays. As it was in the demo, I hard-coded the attribute names (knowing that they haven’t changed in several years) and used a method to perform all of that mapping work. It’s not beautiful by any means, but does serve the purpose.

One-Click Deployment, and Then More Clicks Later

The last big change to the lab is to add an ARM template to deploy the Databricks workspace, Azure Blob Storage account (and a container named sparkcontainer), and Data Factory. There’s still an optional step to create a VM which I left as step-by-step work rather than a deployment script because it’s optional—you can choose to deploy a VM or simply install the Integration Runtime on a Windows machine of your choice.

When Will This Be Available?

I am jumping the gun a little bit on announcing these changes, as they aren’t quite available yet. They, along with any other changes I make before the deadline, should be available sometime in November with this refresh. I’ll have an announcement when it is officially out, but now you have a sneak peek and can wait with bated breath. Or something.

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s