Upcoming Events: SQL Saturday Albany Pre-Con

Key Details

What: SQL Saturday Albany Pre-Con
Where: Hilton Garden Inn Albany / SUNY Area, 1389 Washington Avenue, Albany, NY 12206
When: Friday, July 19th.
Admission is $150. Register on Eventbrite.

What I’m Presenting

08:30 AM — 05:00 PM — R for the Data Professional

This is my full-day R training under a slightly different name. Over the course of the day, we will cover a broad swath of the R ecosystem, giving you a general familiarity and setting you up to learn more. We will also spend a considerable amount of time covering scenarios specific to database administrators and other data platform specialists who might think that R is just for data scientists.

Advertisements

Machine Learning with .NET: Model Serialization

This is part five in a series on Machine Learning with .NET.

So far in this series, we’ve looked at training a model at runtime. Even in the prior post, where we looked at predictions, we first had to train a model. With a Naive Bayes model against a small data set, that’s not an onerous task—it’s maybe a second or two the first time we instantiate our objects and we can leave the prediction engine around for the lifetime of our application, so we are able to amortize the cost fairly effectively.

Suppose, however, that we have a mighty oak of a neural network which we have trained over the past two months. It finally completed training and we now have a set of weights. Obviously, we don’t want to retrain this if we can avoid it, so we need to find a way to serialize our model and save it to disk somewhere so that we can re-use it later.

Saving a Model to Disk

Saving a model to disk is quite easy. Here’s the method I created to do just that:

public void SaveModel(MLContext mlContext, ITransformer model, string modelPath)
{
	using (var stream = File.Create(modelPath))
	{
		mlContext.Model.Save(model, null, stream);
	}
}

We create a file stream and use the built-in Save() method to write our data to that stream. Here is an example of me calling that method:

string modelPath = "C:\\Temp\\BillsModel.mdl";
bmt.SaveModel(mlContext, model, modelPath);

This generates a binary .mdl file which contains enough information to reconstitute our model and generate predictions off of it.

Loading a Model from Disk

Here’s where things get annoying again. If you have any kind of custom mapping, you need to do a bit of extra work, as I mentioned in my rant.

Next, when it comes time to deserialize the model, we need to register assemblies. I have two custom mappings but they’re both in the same assembly. Therefore, I only need to register the assembly using one of them. Should I desire to migrate these mappings into separate assemblies later, I would need to update the code to include each assembly at least once. I’m not sure yet which is a better practice: include all custom assemblies regardless of whether you need them, or go back and modify calling code later. What I do know is that it’s a bit annoying when all I really wanted was a simple string or integer translation. Anyhow, here’s the code:

mlContext.ComponentCatalog.RegisterAssembly(typeof(QBCustomMappings).Assembly);
mlContext.ComponentCatalog.RegisterAssembly(typeof(PointsCustomMappings).Assembly);

Once I’ve registered those assemblies, I can reconstitute the model using a method call:

var newModel = bmt.LoadModel(mlContext, modelPath);

That method call wraps the following code:

public ITransformer LoadModel(MLContext mlContext, string modelPath)
{
	ITransformer loadedModel;
	using (var stream = File.OpenRead(modelPath))
	{
		DataViewSchema dvs;
		loadedModel = mlContext.Model.Load(stream, out dvs);
	}

	return loadedModel;
}

What we’re doing is building a model (of interface type ITransformer) as well as a DataViewSchema which I don’t need here. I get back my completed model and can use it like I just finished training. In fact, I have a test case in my GitHub repo which compares a freshly-trained model versus a saved and reloaded model to ensure function calls work and the outputs are the same:

private string GenerateOutcome(PredictionEngineBase<RawInput, Prediction&gt; pe)
{
	return pe.Predict(new RawInput
	{
		Game = 0,
		Quarterback = "Josh Allen",
		Location = "Home",
		NumberOfPointsScored = 17,
		TopReceiver = "Robert Foster",
		TopRunner = "Josh Allen",
		NumberOfSacks = 0,
		NumberOfDefensiveTurnovers = 0,
		MinutesPossession = 0,
		Outcome = "WHO KNOWS?"
	}).Outcome;
}

[Test()]
public void SaveAndLoadModel()
{
	string modelPath = "C:\\Temp\\BillsModel.mdl";
	bmt.SaveModel(mlContext, model, modelPath);

	// Register the assembly that contains 'QBCustomMappings' with the ComponentCatalog
	// so it can be found when loading the model.
	mlContext.ComponentCatalog.RegisterAssembly(typeof(QBCustomMappings).Assembly);
	mlContext.ComponentCatalog.RegisterAssembly(typeof(PointsCustomMappings).Assembly);

	var newModel = bmt.LoadModel(mlContext, modelPath);

	var newPredictor = mlContext.Model.CreatePredictionEngine<RawInput, Prediction&gt;(newModel);
	var po = GenerateOutcome(predictor);
	var npo = GenerateOutcome(newPredictor);

	Assert.IsNotNull(newModel);
	Assert.AreEqual(po, npo);
}

Results, naturally, align.

Conclusion

In today’s post, we looked at the ability to serialize and deserialize models. We looked at saving the model to disk, although we could save to some other location, as the Save() method on MLContext requires merely a Stream and not a FileStream. Saving and loading models is straightforward as long as you’ve done all of the pre-requisite work around custom mappings (or avoided them altogether).

Upcoming Events: SQL Saturday Cork

Key Details

What: SQL Saturday Cork
Where: University College Cork, Western Gateway, Western Road, Cork, Cork, County Cork, T12 XF62, Ireland
When: Saturday, June 29th.
Admission is free. Register on the SQL Saturday website.

What I’m Presenting

09:30 AM — 10:30 AM — Launching a Data Science Project

This will be my first visit to Ireland, so something to which I’m quite looking forward. If I’m lucky, Andrew Pruski will help me find a good burrito place.

Machine Learning with .NET: Predictions

This is part four in a series on Machine Learning with .NET.

Thanks to some travels, part 4 in the series is a bit later than I originally anticipated. In the prior two posts, we looked at different ways to create, train, and test models. In this post, we’ll use a trained model and generate predictions from it. We’ll go back to the model from part two.

Predictions in a Jar

Our first step will be to create a test project. This way, we can try out our code without the commitment of a real project. Inside that test project, I’m going to include NUnit bits and prep a setup method. I think I also decided to figure out how many functional programming patterns I could violate in a single file, but that was an ancillary goal. Here is the code file; we’ll work through it block by block.

Setting Things Up

First, we have our setup block.

public class Tests
{
	MLContext mlContext;
	BillsModelTrainer bmt;
	IEstimator<ITransformer&gt; trainer;
	ITransformer model;
	PredictionEngineBase<RawInput, Prediction&gt; predictor;
		

	[SetUp]
	public void Setup()
	{
		mlContext = new MLContext(seed: 9997);
		bmt = new BillsModelTrainer();

		var data = bmt.GetRawData(mlContext, "Resources\\2018Bills.csv");
		var split = mlContext.Data.TrainTestSplit(data, testFraction: 0.25);

		trainer = mlContext.MulticlassClassification.Trainers.NaiveBayes(labelColumnName: "Label", featureColumnName: "Features");
		model = bmt.TrainModel(mlContext, split.TrainSet, trainer);

		predictor = mlContext.Model.CreatePredictionEngine<RawInput, Prediction&gt;(model);
	}

In here, I have mutable objects for context, the Bills model trainer class, the output trainer, the ouptut model, and a prediction engine which takes data of type RawInput and generates predictions of type Prediction. This means that any data we feed into the model must have the same shape as our raw data.

The Setup() method will run before each batch of tests. In it, we build a new ML context with a pre-generated seed of 9997. That way, every time I run this I’ll get the same outcomes. We also new up our Bills model trainer and grab raw data from the Resources folder. I’m going to use the same 2018 game data that we looked at in part 2 of the series, as that’s all of the data I have. We’ll pull out 25% of the games for testing and retain the other 75% for training. For the purposes of this demo, I’m using the Naive Bayes classifier for my trainer, and so TrainModel() takes but a short time, returning me a model.

To use this model for prediction generation, I call mlContext.Model.CreatePredictionEngine with the model as my parameter and explain that I’ll give the model an object of type RawInput and expect back an object of type Prediction. The Prediction class is trivial:

public class Prediction
{
	[ColumnName("PredictedOutcome")]
	public string Outcome { get; set; }
}

I would have preferred the option to get back a simple string but it has to be a custom class.

Knocking Things Down

Generating predictions is easy once you have the data you need. Here’s the test function, along with five separate test cases:

[TestCase(new object[] { "Josh Allen", "Home", 17, "Robert Foster", "LeSean McCoy", "Win" })]
[TestCase(new object[] { "Josh Allen", "Away", 17, "Robert Foster", "LeSean McCoy", "Win" })]
[TestCase(new object[] { "Josh Allen", "Home", 17, "Kelvin Benjamin", "LeSean McCoy", "Win" })]
[TestCase(new object[] { "Nathan Peterman", "Home", 17, "Kelvin Benjamin", "LeSean McCoy", "Loss" })]
[TestCase(new object[] { "Josh Allen", "Away", 7, "Charles Clay", "LeSean McCoy", "Loss" })]
public void TestModel(string quarterback, string location, float numberOfPointsScored,
	string topReceiver, string topRunner, string expectedOutcome)
{
	var outcome = predictor.Predict(new RawInput
	{
		Game = 0,
		Quarterback = quarterback,
		Location = location,
		NumberOfPointsScored = numberOfPointsScored,
		TopReceiver = topReceiver,
		TopRunner = topRunner,
		NumberOfSacks = 0,
		NumberOfDefensiveTurnovers = 0,
		MinutesPossession = 0,
		Outcome = "WHO KNOWS?"
	});

	Assert.AreEqual(expectedOutcome, outcome.Outcome);
}

What I’m doing here is taking in a set of parameters that I’d expect my users could input: quarterback name, home/away game, number of points scored, leading receiver, and leading scorer. I also need to pass in my expected test result—that’s not something I’d expect a user to give me, but is important for test cases to make sure that my model is still scoring things appropriately.

The Predict() method on my predictor takes in a RawInput object, so I can generate that from my input data. Recall that we ignore some of those raw inputs, including game, number of sacks, number of defensive turnovers, and minutes of possession. Instead of bothering my users with that data, I just pass in defaults because I know we don’t use them. In a production-like scenario, I’m more and more convinced that I’d probably do all of this cleanup before model training, especially when dealing with wide data sets with dozens or maybe hundreds of inputs.

Note as well that I need to pass in the outcome. That outcome doesn’t have to be our best guess or anything reasonable—it just needs to have the same data type as our original input data set.

What we get back is a series of hopefully-positive test results:

Even in this model, Kelvin Benjamin drops the ball.

Predictions are that easy: a simple method call and you get back an outcome of type Prediction which has a single string property called Outcome.

Trying Out Different Algorithms

We can also use this test project to try out different algorithms and see what performs better given our data. To do that, I’ll build an evaluation tester. The entire method looks like this:

[TestCase(new object[] { "Naive Bayes" })]
[TestCase(new object[] { "L-BFGS" })]
[TestCase(new object[] { "SDCA Non-Calibrated" })]
public void BasicEvaluationTest(string trainerToUse)
{
	mlContext = new MLContext(seed: 9997);
	bmt = new BillsModelTrainer();

	var data = bmt.GetRawData(mlContext, "Resources\\2018Bills.csv");
	var split = mlContext.Data.TrainTestSplit(data, testFraction: 0.4);

	// If we wish to review the split data, we can run these.
	var trainSet = mlContext.Data.CreateEnumerable<RawInput&gt;(split.TrainSet, reuseRowObject: false);
	var testSet = mlContext.Data.CreateEnumerable<RawInput&gt;(split.TestSet, reuseRowObject: false);

	IEstimator<ITransformer&gt; newTrainer;
	switch (trainerToUse)
	{
		case "Naive Bayes":
			newTrainer = mlContext.MulticlassClassification.Trainers.NaiveBayes(labelColumnName: "Label", featureColumnName: "Features");
			break;
		case "L-BFGS":
			newTrainer = mlContext.MulticlassClassification.Trainers.LbfgsMaximumEntropy(labelColumnName: "Label", featureColumnName: "Features");
			break;
		case "SDCA Non-Calibrated":
			newTrainer = mlContext.MulticlassClassification.Trainers.SdcaNonCalibrated(labelColumnName: "Label", featureColumnName: "Features");
			break;
		default:
			newTrainer = mlContext.MulticlassClassification.Trainers.NaiveBayes(labelColumnName: "Label", featureColumnName: "Features");
			break;
	}

	var newModel = bmt.TrainModel(mlContext, split.TrainSet, newTrainer);
	var metrics = mlContext.MulticlassClassification.Evaluate(newModel.Transform(split.TestSet));

	Console.WriteLine($"Macro Accuracy = {metrics.MacroAccuracy}; Micro Accuracy = {metrics.MicroAccuracy}");
	Console.WriteLine($"Confusion Matrix with {metrics.ConfusionMatrix.NumberOfClasses} classes.");
	Console.WriteLine($"{metrics.ConfusionMatrix.GetFormattedConfusionTable()}");

	Assert.AreNotEqual(0, metrics.MacroAccuracy);
}

Basically, we’re building up the same set of operations that we did in the Setup() method, but because that setup method has mutable objects, I don’t want to run the risk of overwriting the “real” model with one of these tests, so we do everything locally.

The easiest way I could think of to handle multiple trainers in my test case was to put in a switch statement over my set of valid algorithms. That makes the method brittle with respect to future changes (where every new algorithm I’d like to test involves modifying the test method itself) but it works for our purposes here. After choosing the trainer, we walk through the same modeling process and then spit out a few measures: macro and micro accuracy, as well as a confusion matrix.

To understand macro versus micro accuracy (with help from ML.NET and Sebastian Raschka):

  • Micro accuracy sums up all of your successes and failures, whether the outcome is win, lose, draw, suspension due to blizzard, or cancellation on account of alien abduction. We add up all of those cases and if we’re right 95 out of 100 times, our micro accuracy is 95%.
  • Macro accuracy calculates the harmonic mean of all of the individual classes. If our accuracies for different outcomes are: [ 85% (win), 80% (loss), 50% (draw), 99% (suspension due to blizzard), 33% (cancellation on account of alien abduction) ], we sum them up and divide by the number of outcomes: (0.85 + 0.8 + 0.5 + 0.99 + 0.33) / 5 = 69.4%.

Both of these measures are useful but macro accuracy has a tendency to get pulled down when a small category is wrong. Let’s say that there are 3 alien abduction scenarios out of 100,000 games and we got 1 of those 3 correct. Unless we were trying to predict for alien abductions, the outcome is uncommon enough that it’s relatively unimportant that we get it right. Our macro accuracy is 69.4% but if we drop that class, we’re up to 78.5%. For this reason, micro accuracy is typically better for heavily imbalanced classes.

Finally, we print out a confusion matrix, which shows us what each model did in terms of predicted category versus actual outcome.

Conclusion

Generating predictions with ML.NET is pretty easy, definitely one of the smoothest experiences when working with this product. Although we looked at a test in the blog post, applying this to real code is simple as well as you can see in my MVC project demo code (particularly the controller). In this case, I made the conscious decision to train anew at application start because training is fast. In a real project, we’d want to save a model to disk and load it, which is what we’ll cover next time.

DataCamp Course: Time Series Analysis in SQL Server

I have a DataCamp course which just launched entitled Time Series Analysis in SQL Server.

Course Description

SQL Server has a robust set of tools to prepare, aggregate, and query time series data. This course will show you how to build and work with dates, parse dates from strings (and deal with invalid strings), and format dates for reporting. From there, you will see how SQL Server’s built-in aggregation operators and window functions can solve important business problems like calculating running totals, finding moving averages, and displaying month-over-month differences using realistic sample data sets. You will also see how taking a different perspective on your data can solve difficult problems.

Chapter 1 is available for free, so you can check it out at no risk of accidentally giving me money. This chapter also includes one of the biggest tricks behind working with time series data in SQL Server: calendar tables. And add in a bonus bit on the APPLY operator. The other three chapters are great too, but those are for closers subscribers.

Upcoming Events: Datagrillen

Key Details

What: Datagrillen
Where: Kaiserstraße 10b, 49809 Lingen (Ems), Germany
When: Thursday, June 20th and Friday, June 21st.
Tickets appear not to be available any longer.

What I’m Presenting

01:15 PM — 02:30 PM, Thursday — Approaching Zero

I have a full series put together on the topic, all for this presentation. This will be my first time at Datagrillen (nee SQLgrillen) but I’ve heard great things about it. Sadly, it looks like my SC Freiburg jersey won’t arrive in time for me to bring it, so I’ll need to show my regional biases some other way.

Machine Learning with .NET: Model Builder

This is part three in a series on Machine Learning with .NET.

Last time on 36 Chambers, we looked at building our own model to explain why Kelvin Benjamin is awful. Today, we’re going to use the ML.NET Model Builder to understand how awful airlines are. The unofficial motto of this blog, after all, is “Parades need rain, too.”

The ML.NET Model Builder is still in preview, so expect things to change; some of those changes might be breaking changes, too. If you want to play along, download the Model Builder from the Visual Studio Marketplace. This works for Visual Studio 2017 and Visual Studio 2019, so you’ll need one of those. Community Edition is supported as well, so you don’t need a fancy version of Visual Studio.

After you install the extension, you’ll need to be in a solution with at least one project in it. The Model Builder creates new projects but is a context menu option off of a project itself. If you want to play along at home, clone my GitHub repo. I also have some instructions on the repo, which is rare for me.

First up, right-click on any of the projects and select Add --> Machine Learning.

We’ll add machine learning, all right, just not here. Not Europa.

This brings up a wizard tab to build a machine learning model.

Choose the form of the destructor.

You have four options from which to choose: two-class classification, multi-class classification, regression, or Choose Your Own Adventure. Today, we’re going to create a two-class classification model. Incidentally, they’re not kidding about things changing in preview—last time I looked at this, they didn’t have multi-class classifiers available.

Once you select Sentiment Analysis (that is, two-class classification of text), you can figure out how to feed data to this trainer.

Twitter is a garbage hole and airlines are great garbage magnets, so this should be grand.

As of right now, we can use delimited files or read from a table in SQL Server. Note the 1 GB limit for maximum file size; this is not a solution for very large data sets.

Once we select a file, we get the option to predict a label. As of the time of writing, this label must have a value of either 1 or 0. Our tweets are in the text column, so we’ll use that to predict sentiment.

From there, we move on to training.

This model is definitely a 10.

We don’t specify the algorithm or algorithms to use, but we do specify the number of seconds allowed for training. This is an interesting specification parameter; basically, the more seconds you allow for training, the more likely you are to find a good model. I don’t have any insight into the specifics of what happens during this training period; my expectation is that they iterate through a set of algorithms and try progressively more complicated sets of inputs and transformations. The ML.NET team does have a set of guidelines around how long you might want to train given data sizes. I’m going to train for 4 minutes because I didn’t plan well enough ahead to give myself ten minutes to let it wait. Here it is in progress:

Hard at work on training, boss. Training all kinds of stuff here.

Eventually, we run out of time and we have a winner:

Like a proper horse race, this one changed just as we ran out of time. The most exciting 4 minutes in sports.

Now it is time to evaluate. We get to see a table with the five best models on a few criteria: accuracy, Area Under the Curve (AUC), Area Under the Precision-Recall Curve (AUPRC), F1 score, and the amount of time it took to train.

One of these things is not like the others.

One thing I wish I could see here was a breakdown of our key correlation matrix measures—AUC, AUPRC, and F1 score are fine combination metrics and are very useful, but when you have major discrepancies in class makeup, recall by itself can mean a lot. For example, suppose we have a test for a medical condition which affects 1 in 100,000 people. We can get a 99.999% accuracy if we simply return “Negative result” for everybody, but that’s an awful test because the recall is zero: we never catch any of the positive results. It could be that I am not clever enough to interpret these results and infer the base values, though.

Regardless, our next step is to generate some code. Click on the “Code” link and you get the opportunity to add new projects. Note that you do not get to choose which of the models you want to use; the Model Builder picks the one with the highest accuracy.

Time to make the donuts.

We now have two new projects.

I made the donuts.

We have a model project and a console application. The model project contains our trained model (as a zip file) and input and output files. Here is the input file:

using Microsoft.ML.Data;

namespace DotNetMachineLearningML.Model.DataModels
{
    public class ModelInput
    {
        [ColumnName("airline_sentiment"), LoadColumn(0)]
        public bool Airline_sentiment { get; set; }

        [ColumnName("text"), LoadColumn(1)]
        public string Text { get; set; }
    }
}

And here is the output file:

using System;
using Microsoft.ML.Data;

namespace DotNetMachineLearningML.Model.DataModels
{
    public class ModelOutput
    {
        // ColumnName attribute is used to change the column name from
        // its default value, which is the name of the field.
        [ColumnName("PredictedLabel")]
        public bool Prediction { get; set; }

        public float Score { get; set; }
    }
}

The console application contains a model builder and a Program file. The ModelBuilder C# file allows us to modify or re-train our model should we desire, but is not needed for regular operation. The console application grabs a row at random from our tweet set and shows the prediction versus actual values. Instead of doing that, I’m going to take user input to classify a tweet. Replace the main method with this code:

static void Main(string[] args)
{
	MLContext mlContext = new MLContext();

	ITransformer mlModel = mlContext.Model.Load(GetAbsolutePath(MODEL_FILEPATH), out DataViewSchema inputSchema);
	var predEngine = mlContext.Model.CreatePredictionEngine<ModelInput, ModelOutput&gt;(mlModel);

	Console.WriteLine("Enter a sample tweet:");
	var text = Console.ReadLine();

	// Create sample data to do a single prediction with it 
	ModelInput sampleData = new ModelInput { Text = text };

	// Try a single prediction
	ModelOutput predictionResult = predEngine.Predict(sampleData);
	string outcome = predictionResult.Prediction ? "Positive tweet" : "Negative tweet";
	Console.WriteLine($"Single Prediction --&gt; Predicted value: {outcome}");

	Console.WriteLine("=============== End of process, hit any key to finish ===============");
	Console.ReadKey();
}

Change the console application to be the startup project and away we go:

Too soon for TWA jokes?

Conclusion

The Model Builder won’t replace an actual machine learning pipeline anytime soon, but it’s a good way to get started with processing clean data when you have no clue which algorithms you should try.