DataCamp Course: Time Series Analysis in SQL Server

I have a DataCamp course which just launched entitled Time Series Analysis in SQL Server.

Course Description

SQL Server has a robust set of tools to prepare, aggregate, and query time series data. This course will show you how to build and work with dates, parse dates from strings (and deal with invalid strings), and format dates for reporting. From there, you will see how SQL Server’s built-in aggregation operators and window functions can solve important business problems like calculating running totals, finding moving averages, and displaying month-over-month differences using realistic sample data sets. You will also see how taking a different perspective on your data can solve difficult problems.

Chapter 1 is available for free, so you can check it out at no risk of accidentally giving me money. This chapter also includes one of the biggest tricks behind working with time series data in SQL Server: calendar tables. And add in a bonus bit on the APPLY operator. The other three chapters are great too, but those are for closers subscribers.

Advertisements

Machine Learning with .NET: Model Builder

This is part three in a series on Machine Learning with .NET.

Last time on 36 Chambers, we looked at building our own model to explain why Kelvin Benjamin is awful. Today, we’re going to use the ML.NET Model Builder to understand how awful airlines are. The unofficial motto of this blog, after all, is “Parades need rain, too.”

The ML.NET Model Builder is still in preview, so expect things to change; some of those changes might be breaking changes, too. If you want to play along, download the Model Builder from the Visual Studio Marketplace. This works for Visual Studio 2017 and Visual Studio 2019, so you’ll need one of those. Community Edition is supported as well, so you don’t need a fancy version of Visual Studio.

After you install the extension, you’ll need to be in a solution with at least one project in it. The Model Builder creates new projects but is a context menu option off of a project itself. If you want to play along at home, clone my GitHub repo. I also have some instructions on the repo, which is rare for me.

First up, right-click on any of the projects and select Add --> Machine Learning.

We’ll add machine learning, all right, just not here. Not Europa.

This brings up a wizard tab to build a machine learning model.

Choose the form of the destructor.

You have four options from which to choose: two-class classification, multi-class classification, regression, or Choose Your Own Adventure. Today, we’re going to create a two-class classification model. Incidentally, they’re not kidding about things changing in preview—last time I looked at this, they didn’t have multi-class classifiers available.

Once you select Sentiment Analysis (that is, two-class classification of text), you can figure out how to feed data to this trainer.

Twitter is a garbage hole and airlines are great garbage magnets, so this should be grand.

As of right now, we can use delimited files or read from a table in SQL Server. Note the 1 GB limit for maximum file size; this is not a solution for very large data sets.

Once we select a file, we get the option to predict a label. As of the time of writing, this label must have a value of either 1 or 0. Our tweets are in the text column, so we’ll use that to predict sentiment.

From there, we move on to training.

This model is definitely a 10.

We don’t specify the algorithm or algorithms to use, but we do specify the number of seconds allowed for training. This is an interesting specification parameter; basically, the more seconds you allow for training, the more likely you are to find a good model. I don’t have any insight into the specifics of what happens during this training period; my expectation is that they iterate through a set of algorithms and try progressively more complicated sets of inputs and transformations. The ML.NET team does have a set of guidelines around how long you might want to train given data sizes. I’m going to train for 4 minutes because I didn’t plan well enough ahead to give myself ten minutes to let it wait. Here it is in progress:

Hard at work on training, boss. Training all kinds of stuff here.

Eventually, we run out of time and we have a winner:

Like a proper horse race, this one changed just as we ran out of time. The most exciting 4 minutes in sports.

Now it is time to evaluate. We get to see a table with the five best models on a few criteria: accuracy, Area Under the Curve (AUC), Area Under the Precision-Recall Curve (AUPRC), F1 score, and the amount of time it took to train.

One of these things is not like the others.

One thing I wish I could see here was a breakdown of our key correlation matrix measures—AUC, AUPRC, and F1 score are fine combination metrics and are very useful, but when you have major discrepancies in class makeup, recall by itself can mean a lot. For example, suppose we have a test for a medical condition which affects 1 in 100,000 people. We can get a 99.999% accuracy if we simply return “Negative result” for everybody, but that’s an awful test because the recall is zero: we never catch any of the positive results. It could be that I am not clever enough to interpret these results and infer the base values, though.

Regardless, our next step is to generate some code. Click on the “Code” link and you get the opportunity to add new projects. Note that you do not get to choose which of the models you want to use; the Model Builder picks the one with the highest accuracy.

Time to make the donuts.

We now have two new projects.

I made the donuts.

We have a model project and a console application. The model project contains our trained model (as a zip file) and input and output files. Here is the input file:

using Microsoft.ML.Data;

namespace DotNetMachineLearningML.Model.DataModels
{
    public class ModelInput
    {
        [ColumnName("airline_sentiment"), LoadColumn(0)]
        public bool Airline_sentiment { get; set; }

        [ColumnName("text"), LoadColumn(1)]
        public string Text { get; set; }
    }
}

And here is the output file:

using System;
using Microsoft.ML.Data;

namespace DotNetMachineLearningML.Model.DataModels
{
    public class ModelOutput
    {
        // ColumnName attribute is used to change the column name from
        // its default value, which is the name of the field.
        [ColumnName("PredictedLabel")]
        public bool Prediction { get; set; }

        public float Score { get; set; }
    }
}

The console application contains a model builder and a Program file. The ModelBuilder C# file allows us to modify or re-train our model should we desire, but is not needed for regular operation. The console application grabs a row at random from our tweet set and shows the prediction versus actual values. Instead of doing that, I’m going to take user input to classify a tweet. Replace the main method with this code:

static void Main(string[] args)
{
	MLContext mlContext = new MLContext();

	ITransformer mlModel = mlContext.Model.Load(GetAbsolutePath(MODEL_FILEPATH), out DataViewSchema inputSchema);
	var predEngine = mlContext.Model.CreatePredictionEngine<ModelInput, ModelOutput>(mlModel);

	Console.WriteLine("Enter a sample tweet:");
	var text = Console.ReadLine();

	// Create sample data to do a single prediction with it 
	ModelInput sampleData = new ModelInput { Text = text };

	// Try a single prediction
	ModelOutput predictionResult = predEngine.Predict(sampleData);
	string outcome = predictionResult.Prediction ? "Positive tweet" : "Negative tweet";
	Console.WriteLine($"Single Prediction --> Predicted value: {outcome}");

	Console.WriteLine("=============== End of process, hit any key to finish ===============");
	Console.ReadKey();
}

Change the console application to be the startup project and away we go:

Too soon for TWA jokes?

Conclusion

The Model Builder won’t replace an actual machine learning pipeline anytime soon, but it’s a good way to get started with processing clean data when you have no clue which algorithms you should try.

Machine Learning with .NET: Modeling

This is part two in a series on Machine Learning with .NET.

In the first post in this series, I took a look at the “why” behind ML.NET as well as some of its shortcomings in data processing. In this post, I want to look at an area where it does much better: training models.

A Simple Model: Predicting Victory

In this first example, I’m going to put together a small but complete demonstration of a business problem.

In 2018, the Buffalo Bills went 6-10. Previously on 36 Chambers, we learned how much Kelvin Benjamin dragged the team down. Now we’re going to re-learn it but this time in .NET. We will solve a classification problem with two cases: win or loss.

Data Preparation

Our input features include the starting quarterback, location (home or away), number of points scored, top receiver (by yards), top rusher (by yards), team number of sacks, team number of defensive turnovers, team minutes of possession, and the outcome (our label).

We can represent all of this in a class, RawInput:

public class RawInput
{
	//Game,QB,HomeAway,NumPointsScored,TopReceiver,TopRunner,NumSacks,NumDefTurnovers,MinutesPossession,Outcome
	//1,Peterman,Away,3,Zay Jones,Marcus Murphy,1,1,25,Loss

	[LoadColumn(0)]
	public float Game { get; set; }
	[LoadColumn(1)]
	public string Quarterback { get; set; }
	[LoadColumn(2)]
	public string Location { get; set; }
	[LoadColumn(3)]
	public float NumberOfPointsScored { get; set; }
	[LoadColumn(4)]
	public string TopReceiver { get; set; }
	[LoadColumn(5)]
	public string TopRunner { get; set; }
	[LoadColumn(6)]
	public float NumberOfSacks { get; set; }
	[LoadColumn(7)]
	public float NumberOfDefensiveTurnovers { get; set; }
	[LoadColumn(8)]
	public float MinutesPossession { get; set; }
	[LoadColumn(9)]
	[ColumnName("Label")]
	public string Outcome { get; set; }
}

There are a couple of points I want to make here:

  • Each attribute receives a tag which represents the order in which we load columns.
  • Every numeric feature must be a float. Even integers.
  • We need to specify the label with its own column name.

With a class, I can create a quick function to load my raw data into a list of RawData branded data:

public IDataView GetRawData(MLContext mlContext, string inputPath)
{
	return mlContext.Data.LoadFromTextFile<RawInput>(path: inputPath, hasHeader: true, separatorChar: ',');
}

The IDataView interface is our .NET version of the DataFrame in R or Pandas. The good news here is that just by creating a POCO with a few attributes, I can interact with ML.NET. Right now, loading from text files is the primary data load scenario, but I could see hitting SQL Server or other ODBC sources as well as Excel files, etc. in the future.

Build a Trainer

My next function trains a model. We’re going to use Naive Bayes here as well, just to keep consistent with the prior blog post.

Here are the transformations I’d like to do before feeding in my data:

  1. Translate quarterback name based on a simple rule: Josh Allen maps to Josh Allen and every other QB maps to Nate Barkerson, the man of a million faces interceptions.
  2. Translate number of points scored based on a simple rule: if they scored double digits, return true; otherwise, return false.
  3. Drop the columns for the number of sacks, number of defensive turnovers, and number of minutes of possession. These columns are probably useful but we aren’t going to use them in this Naive Bayes model.
  4. Drop the Game feature, which represents the game number. We don’t need it.

Now if you’ll allow me a rant.

Code Plus a Rant

In order to perform operation #1, I need to perform a custom mapping using mlContext.Transforms.CustomMapping. My rule is exceedingly simple; here it is in C# lambda expression form: name => name == "Josh Allen" ? "Josh Allen" : "Nate Barkerson". Real easy…except it’s not.

See, first I need to build input and output classes for my custom mapping, so it’s really mlContext.Transforms.CustomMapping<QBInputRow, QBOutputRow>. I can’t use a simple type here, either: it has to be a class.

So let’s create some classes:

public class QBInputRow
{
	public string Quarterback { get; set; }
}

public class QBOutputRow
{
	public string QuarterbackName { get; set; }
}

Okay, now that I have classes, I need to put in that lambda. I guess the lambda could change to qb => qb.Quarterback == "Josh Allen" ? "Josh Allen" : "Nate Barkerson" and that’d work except for one itsy-bitsy thing: if I do it the easy way, I can’t actually save and reload my model. Which makes it worthless for pretty much any real-world scenario.

So no easy lambda-based solution for us. Instead, we need a delegate. That’s going to be another class with a static method and a GetMapping() action:

[CustomMappingFactoryAttribute(nameof(QBCustomMappings.QBMapping))]
public class QBCustomMappings : CustomMappingFactory<QBInputRow, QBOutputRow>
{
	// This is the custom mapping. We now separate it into a method, so that we can use it both in training and in loading.
	public static void QBMapping(QBInputRow input, QBOutputRow output) => output.QuarterbackName =
		(input.Quarterback == "Josh Allen") ? "Josh Allen" : "Nate Barkerson";

	// This factory method will be called when loading the model to get the mapping operation.
	public override Action<QBInputRow, QBOutputRow> GetMapping()
	{
		return QBMapping;
	}
}

After creating the QBMapping() function, I can finally reference it: mlContext.Transforms.CustomMapping(QBCustomMappings.QBMapping, nameof(QBCustomMappings.QBMapping)). I need to create three separate classes to do a simple mapping. Oh, and three more classes to map my points scored. That’s six classes I would never have had to create in R or Python.

That’s a lot of boilerplate code considering in my mind, it’s a simple transformation. This leads me to advise against using custom transformations if you can. Instead, do all of your transformations before getting the data, but I think that means you can’t use the easy data load method I showed above (though I could be wrong on that score).

Rant over. Now that I have my mapping classes all built out, my training method looks like this:

public TransformerChain<Microsoft.ML.Transforms.KeyToValueMappingTransformer> TrainModel(
	MLContext mlContext, IDataView data)
{
	var pipeline =
		mlContext.Transforms.CustomMapping<QBInputRow, QBOutputRow>(
			QBCustomMappings.QBMapping, nameof(QBCustomMappings.QBMapping))
		.Append(mlContext.Transforms.CustomMapping<PointsInputRow, PointsOutputRow>(
			PointsCustomMappings.PointsMapping, nameof(PointsCustomMappings.PointsMapping)))
		// We could potentially use these features for a different model like a fast forest.
		.Append(mlContext.Transforms.DropColumns(new[] { "NumberOfSacks", "NumberOfDefensiveTurnovers",
			"MinutesPossession" }))
		.Append(mlContext.Transforms.DropColumns(new[] { "Game", "Quarterback" }))
		.Append(mlContext.Transforms.Concatenate("FeaturesText", new[]
		{
			"QuarterbackName",
			"Location",
			"TopReceiver",
			"TopRunner"
		}))
		.Append(mlContext.Transforms.Text.FeaturizeText("Features", "FeaturesText"))
		// Label is text so it needs to be mapped to a key
		.Append(mlContext.Transforms.Conversion.MapValueToKey("Label"), TransformerScope.TrainTest)
		.Append(mlContext.MulticlassClassification.Trainers.NaiveBayes(labelColumnName: "Label", featureColumnName: "Features"))
		.Append(mlContext.Transforms.Conversion.MapKeyToValue("PredictedOutcome", "PredictedLabel"));

	var model = pipeline.Fit(data);

	return model;
}

I’m building out a data pipeline here, which performs transformations in a series, using the Append() method to link parts together similar to |> in F# or %>% in R. It’s not nearly as pretty as either of those solutions, but it’s the best we’re getting with C#.

Our first two operations are the data transformations to get our QB name and “did they score double-digit points?” features. After that, we drop unused features using the DropColumns() method.

The next part deserves a bit of discussion. With ML.NET, we’re only allowed to send in one text column, so we need to combine together all of our string features and “featurize” them. The combination of Concatenate() and FeaturizeText() does this for us.

After we finish that part of the job, we need to turn our “Win” and “Loss” values into key-value mappings. ML.NET needs keys for binary and multi-class classification, as it will not train on labels. We want to keep the labels so we understand which class we’re in, so we compromise by using the MapValueToKey() method.

Then, we want to train using the Naive Bayes algorithm. ML.NET classifies Naive Bayes as a multi-class classifier and not a binary classifier, so we need to use the multi-class set even though our data set has only wins and losses. Finally, after we get back a class key, we need to map that key back to a value and return it. This way, we know our class name.

Finally, we fit the model to our data and return the fitted model.

Training and Evaluating the Model

The actual process of training the model has us retrieve data, split it into training and test data sets, and perform model training. Here is an example:

MLContext mlContext = new MLContext(seed: 9997);
BillsModelTrainer bmt = new BillsModelTrainer();

var data = bmt.GetRawData(mlContext, "Resources\\2018Bills.csv");
var split = mlContext.Data.TrainTestSplit(data, testFraction: 0.4);

// If we wish to review the split data, we can run these.
var trainSet = mlContext.Data.CreateEnumerable<RawInput>(split.TrainSet, reuseRowObject: false);
var testSet = mlContext.Data.CreateEnumerable<RawInput>(split.TestSet, reuseRowObject: false);

ITransformer model = bmt.TrainModel(mlContext, split.TrainSet);
var metrics = mlContext.MulticlassClassification.Evaluate(model.Transform(split.TestSet));

Console.WriteLine($"Macro Accuracy = {metrics.MacroAccuracy}; Micro Accuracy = {metrics.MicroAccuracy}");
Console.WriteLine($"Confusion Matrix with {metrics.ConfusionMatrix.NumberOfClasses} classes.");
Console.WriteLine($"{metrics.ConfusionMatrix.GetFormattedConfusionTable()}");

I also threw in model evaluation here because it’s pretty easy to do. We generate an ML context, load our data, and then split it into test and training data. Interestingly, I set the test fraction to 0.4 (or 40%) but it only pulled 25% of my data. I imagine that with a larger data set, I’d see closer to 40% reserved for testing but it’s luck of the draw with just 16 rows. By the way, never trust a model with 12 data points.

Speaking of models, we run the TrainModel() method and get back a model. From there, I can evaluate the model using the Evaluate() method and get back some metrics. For multi-class classification problems, I get back micro accuracy, macro accuracy, and a confusion matrix.

Macro Accuracy = 1; Micro Accuracy = 1
 Confusion Matrix with 2 classes.
 Confusion table
           ||======================
 PREDICTED ||     Loss |      Win | Recall
 TRUTH     ||======================
      Loss ||        3 |        0 | 1.0000
       Win ||        0 |        1 | 1.0000
           ||======================
 Precision ||   1.0000 |   1.0000 |

Oh, I had a 100% correct rate for my test data. Like I said, don’t trust models based off of 12 data points and don’t trust evaluations with 4 data points.

Model Changes

If I want to change the model I use for training, I can change my TrainModel() method. For multi-class classification, we have about a half-dozen models from which to choose:

Six models, ah ah ah

These models have a few trade-offs, including computational complexity, accuracy, and assumptions regarding the shape of data. Investigate and choose based on your problem and data, but don’t assume every one solves everything equally well.

Cross-Validation

One last thing I want to point out is cross-validation. Doing this with ML.NET is really easy, but I need to get the pipeline out of TrainModel(). If I want to use cross-validation in production, I’d probably have one method which returns the pipeline and a second method which takes a pipeline and training data and generates a model for me. For now, here’s the cross-validation part specifically:

var cvResults = mlContext.MulticlassClassification.CrossValidate(data, pipeline, numberOfFolds: 4);

var microAccuracies = cvResults.Select(r => r.Metrics.MicroAccuracy);
Console.WriteLine(microAccuracies.Average());

I built out four folds, so we train on 12 games and test on 4 games. The average micro-accuracy is .64 or 64%, which is about what I expected. It’s not a great accuracy, but then again, it’s 12 data points.

Conclusion

In today’s post, we looked at training, testing, and evaluating models. I think that overall, this is a reasonably good experience if you have clean data. As soon as you want to perform non-standard transformations in the data pipeline, though, things get busy fast, in a way that we don’t typically see in R or Python.

In the next post in the series, I’ll show a completely different method for building models: the Model Builder.

Machine Learning with .NET: Premise

This is part one in a series on Machine Learning with .NET.

Microsoft has officially released ML.NET 1.0. The idea behind ML.NET is to bring some of the data science techniques and algorithms we use in R and Python over to C# and F#. Over the course of this series, we will look at a few examples, but in this first post, I’d like to cover some of the reasoning for why you might want to use it.

An Early Digression: Documentation

For a new library, documentation is critical. Ideally, you want to have a series of examples covering the basics, as well as at least a couple of examples covering more advanced topics. That way, your first generation of users will discover and use your product more readily. As they gain skills, they go out and train others, leading to waves of acceptance if all goes well. But if you have poor documentation, your first wave of users may not fully grasp the power of what you’re giving them.

To its credit, ML.NET has a good documentation. Their tutorial gets you started with the Model Builder (which we’ll look at in the third post of this series) and they have other step-by-step tutorials which show how to hand-design models. They also have quite a few samples on their ML.NET Samples repo.

The First Question: Why?

I think this is a fair question, so here’s my attempt at answering it from the standpoint of a practitioner with a team familiar with R and Python.

A Long Time Go in a Server Far Away

If we go back ten or so years, statisticians and data analysts used tools like R, SAS, and Matlab to perform analyses and create models. If you wanted to turn this into production-worthy code, however, you didn’t simply spin up an R or SAS server and let the world bang away at it—those tools were relatively slow, inefficient, and (especially with R at that time) bug-prone.

Instead, data analysts tended more often to generate a model using one of these tools, describe the algorithm and weights to programmers, and the programmers would rewrite the model in a language like C. The upside to doing this was that you could count on the code working and being reasonably fast; the downside is that model changes necessitated code changes (or at least configuration file changes).

Enter the Present Era

Over the past 5-7 years, Python and R have come a long way, or two separate long ways. For Python, I see the turning point as the evolution of scikit-learn in about 2013-2014 leading this language into a space where earlier attempts (such as SciPy) were much less successful. Meanwhile, R always had the algorithms but it wasn’t until around 2013-2014 that we started to see the major improvements in stability and performance needed to take it seriously as a production-worthy language.

Today, both languages are fine for production purposes—for example, my team has production code running in R and in Python, and yet I sleep well at night.

The Burdens of Polyglot Infrastructure

I have to admit, though, that when your company has a bunch of .NET developers and your operations people are all used to working with .NET, the struggle is real when I tell them that we’re going to use R and Python. For more risk-averse companies, it might be okay to use R or Python for personal development, but when it comes time to move out, they want nothing but C# code.

If that’s the scenario you’re in, ML.NET can be useful. This way, you can build the entire pipeline in C#, integrate it easily with your existing C# code, and maintain it in C#.

Installation

Installation of ML.NET is pretty simple: there is a NuGet package. I have Microsoft.ML and Microsoft.ML.DataView installed in my solution.

Two important NuGet packages.

You will also want to download and install the Microsoft ML.NET Model Builder. As of the time of this post (May 28th, 2019), the Model Builder is in a public preview state.

Data Analysis

Let’s get a quick reminder of the Microsoft Team Data Science Process. If you’d like a fuller picture, try out this post, which is part 1 of my series on launching a data science project.

The Team Data Science Project Lifecycle (Source)

The first two phases of the lifecycle are business understanding and data acquisition & understanding. Frankly, .NET (ML.NET included) is pretty awful at both of those phases of the lifecycle.

I don’t dock it too many points for not being good at business understanding—R and Python aren’t any good at that step of the process, either. Instead, Excel, OneNote, and pen & paper are going to be some of your most valuable tools here.

Where I think it really falls short, however, is in the data analysis phase. R and Python are excellent for data analysis for a few reasons:

  1. They both make it easy to load data of various shapes and origins (Excel file, flat file, SQL database, API, etc.).
  2. They both make it easy to perform statistical analysis on data sets and get results back in a reasonable time frame. Statistical analysis can be something as trivial as a five-number summary and can scale up to more complicated statistical analyses.
  3. They both make it easy to transform and reshape data. If I have to define classes, I’m working too hard.
  4. They both make it easy to visualize data, whether that’s ggplot2, plotly, bokeh, matplotlib, or whatever.
  5. They both make it easy to explore data. In an R console, I can use functions like head() to grab the first couple of rows, see what they look like, and make sure that I get what I’m expecting. If I’m using R Studio, I get a built-in data frame viewer.
  6. They both have enormous libraries of statistical and analytical functionality developed over the course of decades.

On the C# side, here’s my argument:

  1. .NET has plenty of functionality around loading data from numerous sources. I don’t think it’s necessarily easy, particularly when you’re dealing with huge files with hundreds or thousands of columns. Do you seriously want me to create a class with all of those members just to perform exploratory analysis?
  2. .NET has Math.NET. I don’t think it’s as easy as what’s available in R and Python, but it’s solid.
  3. I have to define classes, so I’m working too hard.
  4. I guess I can use the Chart class with C#, but I don’t think it’s easy, particularly for throwaway stuff.
  5. Data exploration is a weak point, even with ML.NET. If I just want to see a few rows, I suppose I could build a unit test or console app, but that’s a lot of overkill. There is a C# Interactive which tries to mitigate some of the pain.
  6. Without castigating the work the Math.NET, Accord.NET, and ML.NET teams have done, C# is going to take the L here.

When it comes to F#, the data analysis story is a little better:

  1. Type providers make it a good bit easier to work with data without the expectation that I’m creating classes on my own. Record types are good here. I’d rate this as pretty solid once you get used to type providers.
  2. Same as the C# answer, so pretty solid.
  3. F# has its advantages, particularly around a very strict type system. I think that strict type system slows down exploratory work and
  4. FSharp.Charting is not bad, but it’s several rungs below the libraries I listed for R and Python. I haven’t tried XPlot yet, so maybe that will end up contradicting my gripe-fest here.
  5. F# does have a good REPL and you can create fsx scripts easily, so I give it credit there. I still think it feels slower exploring F# data sets than R or Python data sets. For example, I don’t know of an easy way to display a quick view of a data set like what we have in R Studio or even base R when running head().
  6. F# won’t add much to the table on this point.

In short, you can struggle through but there are much better experiences. I’m open to correction on the above points from people who spend considerably more time working with data science in the .NET space than I do.

Conclusion

In today’s post, I walked through some of the reasoning for ML.NET and looked at the area where it is weakest: data analysis. In the next post, we will look at an area where ML.NET is considerably stronger: data modeling.

Upcoming Events: TriPASS Data Science Group

Key Details

What: Triangle SQL Server User Group: Data Science SIG
Where: 3100 Tower Boulevard, Durham, NC
When: Tuesday, May 28th at 6 PM
Admission is free. Register on Meetup.

What I’m Presenting

06:00 PM — 08:00 PM — A Brief Look at ML.NET

This will include my example from my Machine Learning with .NET talk as well as a second quick example. This is a hands-on lab, so I have a bit more work than usual to do, as I’m going to have to build the lab guide.

PolyBase on Linux with CTP 2.5

The SQL Server team continues to make my day. The latest reason for joy (but also consternation because who has time for all of this?) is that PolyBase is now available on Linux as of SQL Server 2019 CTP 2.5.

I decided to give this a shot by building a container. Walk with me here.

Step One: SQL on Linux

I have Docker for Windows installed in this case but it makes absolutely no difference whether you’re using Windows, Mac, or Linux. Let’s grab our SQL on Linux container.

Before we get started, you want to have at least 4 GB of RAM available to your container. Here’s me giving my container 8 GB of RAM because I am a generous soul:

I kept telling my wife that putting 64 GB of RAM on that desktop would eventually pay off.

If you already have Docker installed, here is a sample:

docker run -e "ACCEPT_EULA=Y" -e "SA_PASSWORD=PolyBaseAllDay!!!11" -p 51433:1433 -d mcr.microsoft.com/mssql/server:2019-CTP2.5-ubuntu

You probably want a better sa password than this, but whatever. I’m also using port 51433 on my host rather than 1433 because I have SQL Server 2017 installed already on this machine and is listening on port 1433.

One other thing: I’m using double-quotes rather than single-quotes. You’ll want to use double-quotes for cmd.exe and PowerShell. You can get away with single quotes on Linux.

By the way, if you are on Windows and get the following error, it means you’re on Windows containers and need to switch to Linux containers:

A Windows container? What manner of ill is this man performing on his machine?

It’ll take a little while for everything to download, so you might need to occupy yourself in some manner in the meantime. Maybe take up a hobby like creating a Markov chain generator for generating postmodern blog posts. Warning: if you get it wrong, one of the articles might actually be readable and that’s how they’ll know you’re a fake.

Let’s pop open Azure Data Studio just to make sure everything loaded properly:

Can’t afford three commas; only in the one-comma club.

Now that we’re set up, I will create a Scratch database as is my wont.

CREATE DATABASE [Scratch]
GO

Step Two: Software on SQL on Linux

Now that we have SQL Server on Linux installed, we can begin to install PolyBase. There are some instructions here but because we started with the Docker image, we’ll need to do a little bit of prep work. Let’s get our shell on.

First, run docker ps to figure out your container ID. Mine is 818623137e9f. From there, run the following command, replacing the container ID with a reasonable facsimile of yours.

docker exec -it 818 /bin/bash

Do it correctly and you’ll land at a bash root prompt. Understand that with root power comes disastrous responsibility.

Me at a root prompt. Bonus point: Cities:Skylines has totally ruined my appreciation of SimCity 2000.

We are going to need to set up some stuff first. First, grab the Microsoft keys for apt:

wget -qO- https://packages.microsoft.com/keys/microsoft.asc | apt-key add -

If you’re copying from the Microsoft install instructions, remember that you don’t need sudo where you’re going.

Next, we will need to install the 2019 repository. Before we can run add-apt-repository we need to install some tooling. Run the following commands in order:

apt-get update && apt-get upgrade
apt-get install -y software-properties-common
add-apt-repository "$(wget -qO- https://packages.microsoft.com/config/ubuntu/16.04/mssql-server-preview.list)"
apt-get update && apt-get upgrade

Yes, there is a second run of apt-get update && apt-get upgrade there. We added a new repository and wish to make known to Ubuntu that it should look there.

Step Three: PolyBase on SQL on Linux

We’re close now. Just a few dozen more steps. At this point, you should still be in your bash shell, Master and Commander of the container 818623137e9f (or whatever your container is named). Run the next command with great purpose, for it will smell cowardice and indecision and punish you accordingly:

apt-get install mssql-server-polybase -y

I put the -y at the end to ensure that the machine knows I mean business.

Once I am done here, I relinquish command of the vessel by entering exit and closing my shell, for I have hit the most Microsofty of output messages:

At least I don’t need to reboot, amirite?

Step Four: the Great Sleep

Now that I am back at my DOS prompt, my container awaits its next command. It shall make the ultimate sacrifice for me: I will stop its existence. [I hereby claim my bonus points for making the appropriate distinction between “shall” versus “will” in this post.]

docker stop 818

But wait! Before it has a chance to dream of boxy sheep, I enlist its assistance once more and cause it to spring from its own ashes. Also, I guess I burned it in the in-between but no matter, for we have work to do!

docker restart 818

Now that it is awakened and has returned with a passion (and a new paint job to cover the char), we connect directly to the Scratch database and enable PolyBase:

EXEC sp_configure 'polybase enabled', 1
GO
RECONFIGURE
GO
And the people rejoice.

Step Five: the Wrapup

I won’t have any demonstrations of PolyBase because I’ve done this a few times already, but it does work for the PolyBase V2 sources: SQL Server, Oracle, MongoDB, Cosmos DB, Teradata, and ODBC connections.

Coda: the Limitations

There are some limitations around PolyBase for SQL Server on Linux. Here are a couple that I’ve seen in CTP 2.5:

  • PolyBase on Linux does not support scale-out clusters at this time.
  • PolyBase on Linux does not support connections to Hadoop or Azure Blob Storage at this time. If you do try to set up an external table, you will get the following error:
Operation PolyBase Garden: a Remote Java Bridge too far.

I don’t know which (if any) are just because this is the first iteration and which are permanent limitations, but keep in mind that there are differences in functionality here and that some of these differences might disappear in future versions of PolyBase on SQL on Linux.

Course Review: Docker and Kubernetes

This is a review of Nigel Poulton’s Pluralsight course entitled Docker and Kubernetes: the Big Picture.

This is a great introductory course for someone with zero container experience, as Nigel takes us through a quick primer on containers, Docker, and Kubernetes. Nigel brings an energy to this that I appreciate—he has a lot of passion for the topic and wants to share that passion with others. He also keeps things moving: there are 10 chapters but the whole course is less than 2 hours long, so he’s peppering you with topics.

I think one of the best use cases for this course is someone who is in management and needs to gain a high-level understanding of containers. Nigel doesn’t show you how to create containers or even how to install Docker, but he shows you where they fit in the modern infrastructure. The course is also new, so its material is up to date.

Even if you have Docker experience, it’s still a worthwhile investment of time because it helps explain containers in a way that non-technical people can understand.