Machine Learning with .NET: Modeling

This is part two in a series on Machine Learning with .NET.

In the first post in this series, I took a look at the “why” behind ML.NET as well as some of its shortcomings in data processing. In this post, I want to look at an area where it does much better: training models.

A Simple Model: Predicting Victory

In this first example, I’m going to put together a small but complete demonstration of a business problem.

In 2018, the Buffalo Bills went 6-10. Previously on 36 Chambers, we learned how much Kelvin Benjamin dragged the team down. Now we’re going to re-learn it but this time in .NET. We will solve a classification problem with two cases: win or loss.

Data Preparation

Our input features include the starting quarterback, location (home or away), number of points scored, top receiver (by yards), top rusher (by yards), team number of sacks, team number of defensive turnovers, team minutes of possession, and the outcome (our label).

We can represent all of this in a class, RawInput:

public class RawInput
{
	//Game,QB,HomeAway,NumPointsScored,TopReceiver,TopRunner,NumSacks,NumDefTurnovers,MinutesPossession,Outcome
	//1,Peterman,Away,3,Zay Jones,Marcus Murphy,1,1,25,Loss

	[LoadColumn(0)]
	public float Game { get; set; }
	[LoadColumn(1)]
	public string Quarterback { get; set; }
	[LoadColumn(2)]
	public string Location { get; set; }
	[LoadColumn(3)]
	public float NumberOfPointsScored { get; set; }
	[LoadColumn(4)]
	public string TopReceiver { get; set; }
	[LoadColumn(5)]
	public string TopRunner { get; set; }
	[LoadColumn(6)]
	public float NumberOfSacks { get; set; }
	[LoadColumn(7)]
	public float NumberOfDefensiveTurnovers { get; set; }
	[LoadColumn(8)]
	public float MinutesPossession { get; set; }
	[LoadColumn(9)]
	[ColumnName("Label")]
	public string Outcome { get; set; }
}

There are a couple of points I want to make here:

Each attribute receives a tag which represents the order in which we load columns.
Every numeric feature must be a float. Even integers.
We need to specify the label with its own column name.

With a class, I can create a quick function to load my raw data into a list of RawData branded data:

public IDataView GetRawData(MLContext mlContext, string inputPath)
{
	return mlContext.Data.LoadFromTextFile<RawInput>(path: inputPath, hasHeader: true, separatorChar: ',');
}

The IDataView interface is our .NET version of the DataFrame in R or Pandas. The good news here is that just by creating a POCO with a few attributes, I can interact with ML.NET. Right now, loading from text files is the primary data load scenario, but I could see hitting SQL Server or other ODBC sources as well as Excel files, etc. in the future.

Build a Trainer

My next function trains a model. We’re going to use Naive Bayes here as well, just to keep consistent with the prior blog post.

Here are the transformations I’d like to do before feeding in my data:

Translate quarterback name based on a simple rule: Josh Allen maps to Josh Allen and every other QB maps to Nate Barkerson, the man of a million ~~faces~~ interceptions.
Translate number of points scored based on a simple rule: if they scored double digits, return true; otherwise, return false.
Drop the columns for the number of sacks, number of defensive turnovers, and number of minutes of possession. These columns are probably useful but we aren’t going to use them in this Naive Bayes model.
Drop the Game feature, which represents the game number. We don’t need it.

Now if you’ll allow me a rant.

Code Plus a Rant

In order to perform operation #1, I need to perform a custom mapping using mlContext.Transforms.CustomMapping. My rule is exceedingly simple; here it is in C# lambda expression form: name => name == "Josh Allen" ? "Josh Allen" : "Nate Barkerson". Real easy…except it’s not.

See, first I need to build input and output classes for my custom mapping, so it’s really mlContext.Transforms.CustomMapping<QBInputRow, QBOutputRow>. I can’t use a simple type here, either: it has to be a class.

So let’s create some classes:

public class QBInputRow
{
	public string Quarterback { get; set; }
}

public class QBOutputRow
{
	public string QuarterbackName { get; set; }
}

Okay, now that I have classes, I need to put in that lambda. I guess the lambda could change to qb => qb.Quarterback == "Josh Allen" ? "Josh Allen" : "Nate Barkerson" and that’d work except for one itsy-bitsy thing: if I do it the easy way, I can’t actually save and reload my model. Which makes it worthless for pretty much any real-world scenario.

So no easy lambda-based solution for us. Instead, we need a delegate. That’s going to be another class with a static method and a GetMapping() action:

[CustomMappingFactoryAttribute(nameof(QBCustomMappings.QBMapping))]
public class QBCustomMappings : CustomMappingFactory<QBInputRow, QBOutputRow>
{
	// This is the custom mapping. We now separate it into a method, so that we can use it both in training and in loading.
	public static void QBMapping(QBInputRow input, QBOutputRow output) => output.QuarterbackName =
		(input.Quarterback == "Josh Allen") ? "Josh Allen" : "Nate Barkerson";

	// This factory method will be called when loading the model to get the mapping operation.
	public override Action<QBInputRow, QBOutputRow> GetMapping()
	{
		return QBMapping;
	}
}

After creating the QBMapping() function, I can finally reference it: mlContext.Transforms.CustomMapping(QBCustomMappings.QBMapping, nameof(QBCustomMappings.QBMapping)). I need to create three separate classes to do a simple mapping. Oh, and three more classes to map my points scored. That’s six classes I would never have had to create in R or Python.

That’s a lot of boilerplate code considering in my mind, it’s a simple transformation. This leads me to advise against using custom transformations if you can. Instead, do all of your transformations before getting the data, but I think that means you can’t use the easy data load method I showed above (though I could be wrong on that score).

Rant over. Now that I have my mapping classes all built out, my training method looks like this:

public TransformerChain<Microsoft.ML.Transforms.KeyToValueMappingTransformer> TrainModel(
	MLContext mlContext, IDataView data)
{
	var pipeline =
		mlContext.Transforms.CustomMapping<QBInputRow, QBOutputRow>(
			QBCustomMappings.QBMapping, nameof(QBCustomMappings.QBMapping))
		.Append(mlContext.Transforms.CustomMapping<PointsInputRow, PointsOutputRow>(
			PointsCustomMappings.PointsMapping, nameof(PointsCustomMappings.PointsMapping)))
		// We could potentially use these features for a different model like a fast forest.
		.Append(mlContext.Transforms.DropColumns(new[] { "NumberOfSacks", "NumberOfDefensiveTurnovers",
			"MinutesPossession" }))
		.Append(mlContext.Transforms.DropColumns(new[] { "Game", "Quarterback" }))
		.Append(mlContext.Transforms.Concatenate("FeaturesText", new[]
		{
			"QuarterbackName",
			"Location",
			"TopReceiver",
			"TopRunner"
		}))
		.Append(mlContext.Transforms.Text.FeaturizeText("Features", "FeaturesText"))
		// Label is text so it needs to be mapped to a key
		.Append(mlContext.Transforms.Conversion.MapValueToKey("Label"), TransformerScope.TrainTest)
		.Append(mlContext.MulticlassClassification.Trainers.NaiveBayes(labelColumnName: "Label", featureColumnName: "Features"))
		.Append(mlContext.Transforms.Conversion.MapKeyToValue("PredictedOutcome", "PredictedLabel"));

	var model = pipeline.Fit(data);

	return model;
}

I’m building out a data pipeline here, which performs transformations in a series, using the Append() method to link parts together similar to |> in F# or %>% in R. It’s not nearly as pretty as either of those solutions, but it’s the best we’re getting with C#.

Our first two operations are the data transformations to get our QB name and “did they score double-digit points?” features. After that, we drop unused features using the DropColumns() method.

The next part deserves a bit of discussion. With ML.NET, we’re only allowed to send in one text column, so we need to combine together all of our string features and “featurize” them. The combination of Concatenate() and FeaturizeText() does this for us.

After we finish that part of the job, we need to turn our “Win” and “Loss” values into key-value mappings. ML.NET needs keys for binary and multi-class classification, as it will not train on labels. We want to keep the labels so we understand which class we’re in, so we compromise by using the MapValueToKey() method.

Then, we want to train using the Naive Bayes algorithm. ML.NET classifies Naive Bayes as a multi-class classifier and not a binary classifier, so we need to use the multi-class set even though our data set has only wins and losses. Finally, after we get back a class key, we need to map that key back to a value and return it. This way, we know our class name.

Finally, we fit the model to our data and return the fitted model.

Training and Evaluating the Model

The actual process of training the model has us retrieve data, split it into training and test data sets, and perform model training. Here is an example:

MLContext mlContext = new MLContext(seed: 9997);
BillsModelTrainer bmt = new BillsModelTrainer();

var data = bmt.GetRawData(mlContext, "Resources\\2018Bills.csv");
var split = mlContext.Data.TrainTestSplit(data, testFraction: 0.4);

// If we wish to review the split data, we can run these.
var trainSet = mlContext.Data.CreateEnumerable<RawInput>(split.TrainSet, reuseRowObject: false);
var testSet = mlContext.Data.CreateEnumerable<RawInput>(split.TestSet, reuseRowObject: false);

ITransformer model = bmt.TrainModel(mlContext, split.TrainSet);
var metrics = mlContext.MulticlassClassification.Evaluate(model.Transform(split.TestSet));

Console.WriteLine($"Macro Accuracy = {metrics.MacroAccuracy}; Micro Accuracy = {metrics.MicroAccuracy}");
Console.WriteLine($"Confusion Matrix with {metrics.ConfusionMatrix.NumberOfClasses} classes.");
Console.WriteLine($"{metrics.ConfusionMatrix.GetFormattedConfusionTable()}");

I also threw in model evaluation here because it’s pretty easy to do. We generate an ML context, load our data, and then split it into test and training data. Interestingly, I set the test fraction to 0.4 (or 40%) but it only pulled 25% of my data. I imagine that with a larger data set, I’d see closer to 40% reserved for testing but it’s luck of the draw with just 16 rows. By the way, never trust a model with 12 data points.

Speaking of models, we run the TrainModel() method and get back a model. From there, I can evaluate the model using the Evaluate() method and get back some metrics. For multi-class classification problems, I get back micro accuracy, macro accuracy, and a confusion matrix.

Macro Accuracy = 1; Micro Accuracy = 1
 Confusion Matrix with 2 classes.
 Confusion table
           ||======================
 PREDICTED ||     Loss |      Win | Recall
 TRUTH     ||======================
      Loss ||        3 |        0 | 1.0000
       Win ||        0 |        1 | 1.0000
           ||======================
 Precision ||   1.0000 |   1.0000 |

Oh, I had a 100% correct rate for my test data. Like I said, don’t trust models based off of 12 data points and don’t trust evaluations with 4 data points.

Model Changes

If I want to change the model I use for training, I can change my TrainModel() method. For multi-class classification, we have about a half-dozen models from which to choose:

These models have a few trade-offs, including computational complexity, accuracy, and assumptions regarding the shape of data. Investigate and choose based on your problem and data, but don’t assume every one solves everything equally well.

Cross-Validation

One last thing I want to point out is cross-validation. Doing this with ML.NET is really easy, but I need to get the pipeline out of TrainModel(). If I want to use cross-validation in production, I’d probably have one method which returns the pipeline and a second method which takes a pipeline and training data and generates a model for me. For now, here’s the cross-validation part specifically:

var cvResults = mlContext.MulticlassClassification.CrossValidate(data, pipeline, numberOfFolds: 4);

var microAccuracies = cvResults.Select(r => r.Metrics.MicroAccuracy);
Console.WriteLine(microAccuracies.Average());

I built out four folds, so we train on 12 games and test on 4 games. The average micro-accuracy is .64 or 64%, which is about what I expected. It’s not a great accuracy, but then again, it’s 12 data points.

Conclusion

In today’s post, we looked at training, testing, and evaluating models. I think that overall, this is a reasonably good experience if you have clean data. As soon as you want to perform non-standard transformations in the data pipeline, though, things get busy fast, in a way that we don’t typically see in R or Python.

In the next post in the series, I’ll show a completely different method for building models: the Model Builder.

S	M	T	W	T	F	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

36 Chambers – The Legendary Journeys: Execution to the max!