This is part one in a series on Machine Learning with .NET.

Microsoft has officially released ML.NET 1.0. The idea behind ML.NET is to bring some of the data science techniques and algorithms we use in R and Python over to C# and F#. Over the course of this series, we will look at a few examples, but in this first post, I’d like to cover some of the reasoning for why you might want to use it.

An Early Digression: Documentation

For a new library, documentation is critical. Ideally, you want to have a series of examples covering the basics, as well as at least a couple of examples covering more advanced topics. That way, your first generation of users will discover and use your product more readily. As they gain skills, they go out and train others, leading to waves of acceptance if all goes well. But if you have poor documentation, your first wave of users may not fully grasp the power of what you’re giving them.

To its credit, ML.NET has a good documentation. Their tutorial gets you started with the Model Builder (which we’ll look at in the third post of this series) and they have other step-by-step tutorials which show how to hand-design models. They also have quite a few samples on their ML.NET Samples repo.

The First Question: Why?

I think this is a fair question, so here’s my attempt at answering it from the standpoint of a practitioner with a team familiar with R and Python.

A Long Time Go in a Server Far Away

If we go back ten or so years, statisticians and data analysts used tools like R, SAS, and Matlab to perform analyses and create models. If you wanted to turn this into production-worthy code, however, you didn’t simply spin up an R or SAS server and let the world bang away at it—those tools were relatively slow, inefficient, and (especially with R at that time) bug-prone.

Instead, data analysts tended more often to generate a model using one of these tools, describe the algorithm and weights to programmers, and the programmers would rewrite the model in a language like C. The upside to doing this was that you could count on the code working and being reasonably fast; the downside is that model changes necessitated code changes (or at least configuration file changes).

Enter the Present Era

Over the past 5-7 years, Python and R have come a long way, or two separate long ways. For Python, I see the turning point as the evolution of scikit-learn in about 2013-2014 leading this language into a space where earlier attempts (such as SciPy) were much less successful. Meanwhile, R always had the algorithms but it wasn’t until around 2013-2014 that we started to see the major improvements in stability and performance needed to take it seriously as a production-worthy language.

Today, both languages are fine for production purposes—for example, my team has production code running in R and in Python, and yet I sleep well at night.

The Burdens of Polyglot Infrastructure

I have to admit, though, that when your company has a bunch of .NET developers and your operations people are all used to working with .NET, the struggle is real when I tell them that we’re going to use R and Python. For more risk-averse companies, it might be okay to use R or Python for personal development, but when it comes time to move out, they want nothing but C# code.

If that’s the scenario you’re in, ML.NET can be useful. This way, you can build the entire pipeline in C#, integrate it easily with your existing C# code, and maintain it in C#.

Installation

Installation of ML.NET is pretty simple: there is a NuGet package. I have Microsoft.ML and Microsoft.ML.DataView installed in my solution.

Two important NuGet packages.

You will also want to download and install the Microsoft ML.NET Model Builder. As of the time of this post (May 28th, 2019), the Model Builder is in a public preview state.

Data Analysis

Let’s get a quick reminder of the Microsoft Team Data Science Process. If you’d like a fuller picture, try out this post, which is part 1 of my series on launching a data science project.

The Team Data Science Project Lifecycle (Source)

The first two phases of the lifecycle are business understanding and data acquisition & understanding. Frankly, .NET (ML.NET included) is pretty awful at both of those phases of the lifecycle.

I don’t dock it too many points for not being good at business understanding—R and Python aren’t any good at that step of the process, either. Instead, Excel, OneNote, and pen & paper are going to be some of your most valuable tools here.

Where I think it really falls short, however, is in the data analysis phase. R and Python are excellent for data analysis for a few reasons:

  1. They both make it easy to load data of various shapes and origins (Excel file, flat file, SQL database, API, etc.).
  2. They both make it easy to perform statistical analysis on data sets and get results back in a reasonable time frame. Statistical analysis can be something as trivial as a five-number summary and can scale up to more complicated statistical analyses.
  3. They both make it easy to transform and reshape data. If I have to define classes, I’m working too hard.
  4. They both make it easy to visualize data, whether that’s ggplot2, plotly, bokeh, matplotlib, or whatever.
  5. They both make it easy to explore data. In an R console, I can use functions like head() to grab the first couple of rows, see what they look like, and make sure that I get what I’m expecting. If I’m using R Studio, I get a built-in data frame viewer.
  6. They both have enormous libraries of statistical and analytical functionality developed over the course of decades.

On the C# side, here’s my argument:

  1. .NET has plenty of functionality around loading data from numerous sources. I don’t think it’s necessarily easy, particularly when you’re dealing with huge files with hundreds or thousands of columns. Do you seriously want me to create a class with all of those members just to perform exploratory analysis?
  2. .NET has Math.NET. I don’t think it’s as easy as what’s available in R and Python, but it’s solid.
  3. I have to define classes, so I’m working too hard.
  4. I guess I can use the Chart class with C#, but I don’t think it’s easy, particularly for throwaway stuff.
  5. Data exploration is a weak point, even with ML.NET. If I just want to see a few rows, I suppose I could build a unit test or console app, but that’s a lot of overkill. There is a C# Interactive which tries to mitigate some of the pain.
  6. Without castigating the work the Math.NET, Accord.NET, and ML.NET teams have done, C# is going to take the L here.

When it comes to F#, the data analysis story is a little better:

  1. Type providers make it a good bit easier to work with data without the expectation that I’m creating classes on my own. Record types are good here. I’d rate this as pretty solid once you get used to type providers.
  2. Same as the C# answer, so pretty solid.
  3. F# has its advantages, particularly around a very strict type system. I think that strict type system slows down exploratory work and
  4. FSharp.Charting is not bad, but it’s several rungs below the libraries I listed for R and Python. I haven’t tried XPlot yet, so maybe that will end up contradicting my gripe-fest here.
  5. F# does have a good REPL and you can create fsx scripts easily, so I give it credit there. I still think it feels slower exploring F# data sets than R or Python data sets. For example, I don’t know of an easy way to display a quick view of a data set like what we have in R Studio or even base R when running head().
  6. F# won’t add much to the table on this point.

In short, you can struggle through but there are much better experiences. I’m open to correction on the above points from people who spend considerably more time working with data science in the .NET space than I do.

Conclusion

In today’s post, I walked through some of the reasoning for ML.NET and looked at the area where it is weakest: data analysis. In the next post, we will look at an area where ML.NET is considerably stronger: data modeling.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s