What: SQL Saturday Dallas Where: UT Dallas, Naveen Jindal School of Management, 800 West Campbell Road, Richardson, Texas When: Saturday, June 1st, all day. Admission is free. Register on the SQL Saturday website.
This is a brand new talk. I’m a little surprised (pleasantly so) that they selected this talk, as I think database developers can gain a lot from some experience working with functional programming languages.
Microsoft has officially released ML.NET 1.0. The idea behind ML.NET is to bring some of the data science techniques and algorithms we use in R and Python over to C# and F#. Over the course of this series, we will look at a few examples, but in this first post, I’d like to cover some of the reasoning for why you might want to use it.
An Early Digression: Documentation
For a new library, documentation is critical. Ideally, you want to have a series of examples covering the basics, as well as at least a couple of examples covering more advanced topics. That way, your first generation of users will discover and use your product more readily. As they gain skills, they go out and train others, leading to waves of acceptance if all goes well. But if you have poor documentation, your first wave of users may not fully grasp the power of what you’re giving them.
I think this is a fair question, so here’s my attempt at answering it from the standpoint of a practitioner with a team familiar with R and Python.
A Long Time Go in a Server Far Away
If we go back ten or so years, statisticians and data analysts used tools like R, SAS, and Matlab to perform analyses and create models. If you wanted to turn this into production-worthy code, however, you didn’t simply spin up an R or SAS server and let the world bang away at it—those tools were relatively slow, inefficient, and (especially with R at that time) bug-prone.
Instead, data analysts tended more often to generate a model using one of these tools, describe the algorithm and weights to programmers, and the programmers would rewrite the model in a language like C. The upside to doing this was that you could count on the code working and being reasonably fast; the downside is that model changes necessitated code changes (or at least configuration file changes).
Enter the Present Era
Over the past 5-7 years, Python and R have come a long way, or two separate long ways. For Python, I see the turning point as the evolution of scikit-learn in about 2013-2014 leading this language into a space where earlier attempts (such as SciPy) were much less successful. Meanwhile, R always had the algorithms but it wasn’t until around 2013-2014 that we started to see the major improvements in stability and performance needed to take it seriously as a production-worthy language.
Today, both languages are fine for production purposes—for example, my team has production code running in R and in Python, and yet I sleep well at night.
The Burdens of Polyglot Infrastructure
I have to admit, though, that when your company has a bunch of .NET developers and your operations people are all used to working with .NET, the struggle is real when I tell them that we’re going to use R and Python. For more risk-averse companies, it might be okay to use R or Python for personal development, but when it comes time to move out, they want nothing but C# code.
If that’s the scenario you’re in, ML.NET can be useful. This way, you can build the entire pipeline in C#, integrate it easily with your existing C# code, and maintain it in C#.
Installation of ML.NET is pretty simple: there is a NuGet package. I have Microsoft.ML and Microsoft.ML.DataView installed in my solution.
You will also want to download and install the Microsoft ML.NET Model Builder. As of the time of this post (May 28th, 2019), the Model Builder is in a public preview state.
The first two phases of the lifecycle are business understanding and data acquisition & understanding. Frankly, .NET (ML.NET included) is pretty awful at both of those phases of the lifecycle.
I don’t dock it too many points for not being good at business understanding—R and Python aren’t any good at that step of the process, either. Instead, Excel, OneNote, and pen & paper are going to be some of your most valuable tools here.
Where I think it really falls short, however, is in the data analysis phase. R and Python are excellent for data analysis for a few reasons:
They both make it easy to load data of various shapes and origins (Excel file, flat file, SQL database, API, etc.).
They both make it easy to perform statistical analysis on data sets and get results back in a reasonable time frame. Statistical analysis can be something as trivial as a five-number summary and can scale up to more complicated statistical analyses.
They both make it easy to transform and reshape data. If I have to define classes, I’m working too hard.
They both make it easy to explore data. In an R console, I can use functions like head() to grab the first couple of rows, see what they look like, and make sure that I get what I’m expecting. If I’m using R Studio, I get a built-in data frame viewer.
They both have enormous libraries of statistical and analytical functionality developed over the course of decades.
On the C# side, here’s my argument:
.NET has plenty of functionality around loading data from numerous sources. I don’t think it’s necessarily easy, particularly when you’re dealing with huge files with hundreds or thousands of columns. Do you seriously want me to create a class with all of those members just to perform exploratory analysis?
.NET has Math.NET. I don’t think it’s as easy as what’s available in R and Python, but it’s solid.
I have to define classes, so I’m working too hard.
I guess I can use the Chart class with C#, but I don’t think it’s easy, particularly for throwaway stuff.
Data exploration is a weak point, even with ML.NET. If I just want to see a few rows, I suppose I could build a unit test or console app, but that’s a lot of overkill. There is a C# Interactive which tries to mitigate some of the pain.
Without castigating the work the Math.NET, Accord.NET, and ML.NET teams have done, C# is going to take the L here.
When it comes to F#, the data analysis story is a little better:
Type providers make it a good bit easier to work with data without the expectation that I’m creating classes on my own. Record types are good here. I’d rate this as pretty solid once you get used to type providers.
Same as the C# answer, so pretty solid.
F# has its advantages, particularly around a very strict type system. I think that strict type system slows down exploratory work and
FSharp.Charting is not bad, but it’s several rungs below the libraries I listed for R and Python. I haven’t tried XPlot yet, so maybe that will end up contradicting my gripe-fest here.
F# does have a good REPL and you can create fsx scripts easily, so I give it credit there. I still think it feels slower exploring F# data sets than R or Python data sets. For example, I don’t know of an easy way to display a quick view of a data set like what we have in R Studio or even base R when running head().
F# won’t add much to the table on this point.
In short, you can struggle through but there are much better experiences. I’m open to correction on the above points from people who spend considerably more time working with data science in the .NET space than I do.
In today’s post, I walked through some of the reasoning for ML.NET and looked at the area where it is weakest: data analysis. In the next post, we will look at an area where ML.NET is considerably stronger: data modeling.
This will include my example from my Machine Learning with .NET talk as well as a second quick example. This is a hands-on lab, so I have a bit more work than usual to do, as I’m going to have to build the lab guide.
The SQL Server team continues to make my day. The latest reason for joy (but also consternation because who has time for all of this?) is that PolyBase is now available on Linux as of SQL Server 2019 CTP 2.5.
I decided to give this a shot by building a container. Walk with me here.
Before we get started, you want to have at least 4 GB of RAM available to your container. Here’s me giving my container 8 GB of RAM because I am a generous soul:
If you already have Docker installed, here is a sample:
docker run -e "ACCEPT_EULA=Y" -e "SA_PASSWORD=PolyBaseAllDay!!!11" -p 51433:1433 -d mcr.microsoft.com/mssql/server:2019-CTP2.5-ubuntu
You probably want a better sa password than this, but whatever. I’m also using port 51433 on my host rather than 1433 because I have SQL Server 2017 installed already on this machine and is listening on port 1433.
One other thing: I’m using double-quotes rather than single-quotes. You’ll want to use double-quotes for cmd.exe and PowerShell. You can get away with single quotes on Linux.
By the way, if you are on Windows and get the following error, it means you’re on Windows containers and need to switch to Linux containers:
Let’s pop open Azure Data Studio just to make sure everything loaded properly:
Now that we’re set up, I will create a Scratch database as is my wont.
CREATE DATABASE [Scratch]
Step Two: Software on SQL on Linux
Now that we have SQL Server on Linux installed, we can begin to install PolyBase. There are some instructions here but because we started with the Docker image, we’ll need to do a little bit of prep work. Let’s get our shell on.
First, run docker ps to figure out your container ID. Mine is 818623137e9f. From there, run the following command, replacing the container ID with a reasonable facsimile of yours.
docker exec -it 818 /bin/bash
Do it correctly and you’ll land at a bash root prompt. Understand that with root power comes disastrous responsibility.
We are going to need to set up some stuff first. First, grab the Microsoft keys for apt:
Yes, there is a second run of apt-get update && apt-get upgrade there. We added a new repository and wish to make known to Ubuntu that it should look there.
Step Three: PolyBase on SQL on Linux
We’re close now. Just a few dozen more steps. At this point, you should still be in your bash shell, Master and Commander of the container 818623137e9f (or whatever your container is named). Run the next command with great purpose, for it will smell cowardice and indecision and punish you accordingly:
apt-get install mssql-server-polybase -y
I put the -y at the end to ensure that the machine knows I mean business.
Once I am done here, I relinquish command of the vessel by entering exit and closing my shell, for I have hit the most Microsofty of output messages:
Step Four: the Great Sleep
Now that I am back at my DOS prompt, my container awaits its next command. It shall make the ultimate sacrifice for me: I will stop its existence. [I hereby claim my bonus points for making the appropriate distinction between “shall” versus “will” in this post.]
docker stop 818
But wait! Before it has a chance to dream of boxy sheep, I enlist its assistance once more and cause it to spring from its own ashes. Also, I guess I burned it in the in-between but no matter, for we have work to do!
docker restart 818
Now that it is awakened and has returned with a passion (and a new paint job to cover the char), we connect directly to the Scratch database and enable PolyBase:
EXEC sp_configure 'polybase enabled', 1
Step Five: the Wrapup
I won’t have any demonstrations of PolyBase because I’ve done this a few times already, but it does work for the PolyBase V2 sources: SQL Server, Oracle, MongoDB, Cosmos DB, Teradata, and ODBC connections.
Coda: the Limitations
There are some limitations around PolyBase for SQL Server on Linux. Here are a couple that I’ve seen in CTP 2.5:
PolyBase on Linux does not support scale-out clusters at this time.
PolyBase on Linux does not support connections to Hadoop or Azure Blob Storage at this time. If you do try to set up an external table, you will get the following error:
I don’t know which (if any) are just because this is the first iteration and which are permanent limitations, but keep in mind that there are differences in functionality here and that some of these differences might disappear in future versions of PolyBase on SQL on Linux.
This is a great introductory course for someone with zero container experience, as Nigel takes us through a quick primer on containers, Docker, and Kubernetes. Nigel brings an energy to this that I appreciate—he has a lot of passion for the topic and wants to share that passion with others. He also keeps things moving: there are 10 chapters but the whole course is less than 2 hours long, so he’s peppering you with topics.
I think one of the best use cases for this course is someone who is in management and needs to gain a high-level understanding of containers. Nigel doesn’t show you how to create containers or even how to install Docker, but he shows you where they fit in the modern infrastructure. The course is also new, so its material is up to date.
Even if you have Docker experience, it’s still a worthwhile investment of time because it helps explain containers in a way that non-technical people can understand.
One of the things I like to point out in my Launching a Data Science Project talk is that all data is, by its nature, dirty. I figured I could get into that a little bit here and explain why. I’ll pick a few common problems and cover why each is intractable.
Subjective Data Lacks a Common Foundation of Understanding
Any time you ask for or collect subjective data, you assume a common foundation which does not exist. One great example of this is the Likert scale, which is usually a grading on a scale of 1-5 or 1-7 or 1-49 or whatever level of gradation you desire.
We see Likert scales often in surveys: “How enthusiastic are you about having to stand in line two more hours to pick up your tickets?” Then the answers typically range on a scale from Very Unenthusiastic to Unenthusiastic to Meh to Enthusiastic to Very Enthusiastic and people pick one of the five.
But here’s the problem: your “Enthusiastic” and my “Enthusiastic” are not necessarily the same. For a five-point scale it’s not quite as bad, but as the number of points of gradation increases, we’re more likely to find discrepancies. Even on a 10-point scale, my 5 and your 7 could be the same.
Here’s an example where this matters: a buddy of mine purchased a car and the dealer asked him to fill out a manufacturer’s survey. He was happy with the experience and rated the dealer a 9 out of 10 because he doesn’t give out 10s (I believe the phrase was something like “I still had to pay for the car. If they gave it to me for free, I’d give them a 10.”). In a big analysis, one person providing a 9 instead of a 10 doesn’t mean much—it might shift the mean down a thousandth of a point—but the manufacturer’s analysis penalizes dealers who get ratings lower than a 10. The underlying problem here is that the manufacturer is looking for happy customers. They have a happy customer, but due to underlying differences in definitions their system does not recognize him as happy. What’s funny in this is that a simple “Are you happy with your purchase?” would have gotten the answer they wanted without the pseudo-analytical numbering system and avoided the problem altogether.
Suppose you want an analysis of how best to distribute goods among your population. An example of this might be to figure out budgets for different departments. You send out a survey and ask people to provide some sort of numeric score representing their needs. Department A responds on a scale from 1-10. Department B responds on a scale from 1-100. Department C responds on a scale from 999,999 to 1,000,000 because Department C is run by a smart thinker.
Fine. You send out a second survey, one stack ranking each department from A to G, ranking them 1-7 and doling out the budget based on perceived rank.
Well, as the head of Department C, I know that A and B are the big boys and I want their money. Departments F and G are run by paint-huffers and paste-eaters respectively, so nobody’s going to vote for them. Therefore, I will rank in order C, F/G, D/E, B/A. This gets even better if I can convince F and G to go along in my scheme, promising them more dollars for paint and paste if they also vote for me atop their lists and then for each other next. Knowing that D, E, B, and A will rank themselves at the top, our coalition of Trolls and Bozos has just enough push to take a bunch of money.
If your end users potentially receive value (or get penalized) based on the data they send, they will game the system to send the best data possible.
Data Necessarily Abstracts the Particulars of Time and Place
This is probably the most Hayekian slide I’ve ever created in a technical presentation, in no small part because I reference indirectly The Use of Knowledge in Society, an essay from 1945 which critiques many of the pretensions of central planners. A significant part of this essay is the idea that “data” is often subjective and incomplete, even without the two problems I’ve described above. An example Hayek uses is that the price of a particular agricultural commodity has within it implicit information concerning weather conditions, expectations of future yields, and a great deal of information which people might not even be able to articulate, much less explicitly correlate. This includes expectations (which naturally differ from person to person), different weightings of factors, and internalized experiences (which pop up as hunches or feelings).
This essay was key to Hayek eventually winning the Nobel Prize in Economics and holds up quite well today.
But What Does This Mean?
If you are a wanna-be central planner, it means you will fail from the get-go. Most of us aren’t wanna-be central planners, however, so the answer isn’t nearly as bad.
In each of these cases, one of the biggest conclusions we can draw is that we will never explain all of the variance in a system, particularly one which involves humans. People are complex, cranky, contrarian, and possess subtle knowledge you cannot extract as data. The complexities of humans will be a source of natural error which will make your analyses less accurate than if you were dealing with rule-based automatons.
It also means that adopting additional precision for imprecise problems is the wrong way of doing it. If you do use a Likert-type scale, fewer broad options beats many fine options because you’re less likely to run into expectation differences (where my 7 is your 5 and Ted’s 9.14).
Today I’m going to cover something I’ve been waiting to do for just about three years: use PolyBase to connect one SQL Server instance to another SQL Server instance. In this post, we’ll cover the basic setup and a couple points of interest I’ve found in writing this chapter of the book.
Setup: One Less Amigo
Historically, PolyBase has three separate external entities: external data sources, external file formats, and external tables. External data sources tell SQL Server where the remote data is stored. External file formats tell SQL Server what the shape of that data looks like—in other words, CSV, tab-separated, Parquet, ORC, etc. External tables tell SQL Server the structure of some data of a particular external file format at a particular external data source.
With PolyBase V2—connectivity with SQL Server, Cosmos DB, Oracle, Spark, Hive, and a boatload of other external data sources—we no longer need external file formats because we ingest structured data. Therefore, we only need an external data source and an external table. You will need SQL Server 2019 to play along and I’d recommend keeping up on CTPs—PolyBase is under active development so being a CTP behind may mean hitting bugs which have subsequently been fixed.
External Data Source
As I create an external data source, I’m also going to use a database-scoped credential. In this case, I’m going to use SQL authentication though you can also use Windows authentication.
IF NOT EXISTS
FROM sys.database_scoped_credentials dsc
dsc.name = N'SqlWin10Credentials'
CREATE DATABASE SCOPED CREDENTIAL SqlWin10Credentials
WITH IDENTITY = 'PolyBaseUser', Secret = '<<Some Password>>';
IF NOT EXISTS
FROM sys.external_data_sources e
e.name = N'SQLWIN10'
CREATE EXTERNAL DATA SOURCE SQLWIN10 WITH
LOCATION = 'sqlserver://SQLWIN10',
PUSHDOWN = ON,
CREDENTIAL = SqlWin10Credentials
There are a couple of things worth noting here. First, the LOCATION parameter for our external data source is a URI which uses the sqlserver protocol. PolyBase V1 used a configuration setting to determine what the external data source was, but for the new connectors, we define it in the location itself. SQLWIN10 is the name of my SQL Server instance, and I’m going to use SqlWin10Credentials to access that remote box.
Second, we have the option to disable predicate pushdown if we so desire. I’d recommend keeping it on in general and don’t know of a scenario where I’d definitely turn it off.
Here is a simple external table definition:
CREATE EXTERNAL TABLE [dbo].[CityPopulationCenter]
[CityName] [varchar](120) NOT NULL,
[PopulationCenterName] [varchar](30) NOT NULL
LOCATION = 'Scratch.dbo.CityPopulationCenter',
DATA_SOURCE = SQLWIN10
There are a couple points of note here as well. In the LOCATION parameter, we need to use three-part naming: database, schema, table. Our external data source did not specify a database, so we therefore need to specify it here.
Second, the data types need to be compatible with the remote server’s data types. You have a table or view defined on the other side, so that’s pretty simple copy-pasta. Just note that we don’t need (and aren’t allowed to use) any types of constraints, including primary keys, foreign keys, unique keys, default constraints, check constraints, and the like.
The good news is that once you’ve defined this external table, you can directly query it as dbo.CityPopulationCenter and end users won’t know or care that it’s an external table. Note that you do need to open your database master key in sessions which intend to query external tables, as that is how SQL Server can decrypt your password to send over to the remote SQL Server instance.
I’m not going to hit too much more because I’m saving it for the book, but here are a couple interesting things I’ve found while working with PolyBase to SQL Server:
Performance is generally around the same as when working with linked servers. Sometimes it’s better, sometimes it’s worse.
When using scale-out groups, performance tends to be a little better, even though the other side is just a single SQL Server instance.
PolyBase can create an external table to a remote view.
PolyBase can not create an external table to another external table.
Even though it’s still in CTP and I think there are quite a few things Microsoft can do to make it better, I’m happy that direct SQL-to-SQL support is here.