This will include my example from my Machine Learning with .NET talk as well as a second quick example. This is a hands-on lab, so I have a bit more work than usual to do, as I’m going to have to build the lab guide.
The SQL Server team continues to make my day. The latest reason for joy (but also consternation because who has time for all of this?) is that PolyBase is now available on Linux as of SQL Server 2019 CTP 2.5.
I decided to give this a shot by building a container. Walk with me here.
Before we get started, you want to have at least 4 GB of RAM available to your container. Here’s me giving my container 8 GB of RAM because I am a generous soul:
If you already have Docker installed, here is a sample:
docker run -e "ACCEPT_EULA=Y" -e "SA_PASSWORD=PolyBaseAllDay!!!11" -p 51433:1433 -d mcr.microsoft.com/mssql/server:2019-CTP2.5-ubuntu
You probably want a better sa password than this, but whatever. I’m also using port 51433 on my host rather than 1433 because I have SQL Server 2017 installed already on this machine and is listening on port 1433.
One other thing: I’m using double-quotes rather than single-quotes. You’ll want to use double-quotes for cmd.exe and PowerShell. You can get away with single quotes on Linux.
By the way, if you are on Windows and get the following error, it means you’re on Windows containers and need to switch to Linux containers:
Let’s pop open Azure Data Studio just to make sure everything loaded properly:
Now that we’re set up, I will create a Scratch database as is my wont.
CREATE DATABASE [Scratch]
Step Two: Software on SQL on Linux
Now that we have SQL Server on Linux installed, we can begin to install PolyBase. There are some instructions here but because we started with the Docker image, we’ll need to do a little bit of prep work. Let’s get our shell on.
First, run docker ps to figure out your container ID. Mine is 818623137e9f. From there, run the following command, replacing the container ID with a reasonable facsimile of yours.
docker exec -it 818 /bin/bash
Do it correctly and you’ll land at a bash root prompt. Understand that with root power comes disastrous responsibility.
We are going to need to set up some stuff first. First, grab the Microsoft keys for apt:
Yes, there is a second run of apt-get update && apt-get upgrade there. We added a new repository and wish to make known to Ubuntu that it should look there.
Step Three: PolyBase on SQL on Linux
We’re close now. Just a few dozen more steps. At this point, you should still be in your bash shell, Master and Commander of the container 818623137e9f (or whatever your container is named). Run the next command with great purpose, for it will smell cowardice and indecision and punish you accordingly:
apt-get install mssql-server-polybase -y
I put the -y at the end to ensure that the machine knows I mean business.
Once I am done here, I relinquish command of the vessel by entering exit and closing my shell, for I have hit the most Microsofty of output messages:
Step Four: the Great Sleep
Now that I am back at my DOS prompt, my container awaits its next command. It shall make the ultimate sacrifice for me: I will stop its existence. [I hereby claim my bonus points for making the appropriate distinction between “shall” versus “will” in this post.]
docker stop 818
But wait! Before it has a chance to dream of boxy sheep, I enlist its assistance once more and cause it to spring from its own ashes. Also, I guess I burned it in the in-between but no matter, for we have work to do!
docker restart 818
Now that it is awakened and has returned with a passion (and a new paint job to cover the char), we connect directly to the Scratch database and enable PolyBase:
EXEC sp_configure 'polybase enabled', 1
Step Five: the Wrapup
I won’t have any demonstrations of PolyBase because I’ve done this a few times already, but it does work for the PolyBase V2 sources: SQL Server, Oracle, MongoDB, Cosmos DB, Teradata, and ODBC connections.
Coda: the Limitations
There are some limitations around PolyBase for SQL Server on Linux. Here are a couple that I’ve seen in CTP 2.5:
PolyBase on Linux does not support scale-out clusters at this time.
PolyBase on Linux does not support connections to Hadoop or Azure Blob Storage at this time. If you do try to set up an external table, you will get the following error:
I don’t know which (if any) are just because this is the first iteration and which are permanent limitations, but keep in mind that there are differences in functionality here and that some of these differences might disappear in future versions of PolyBase on SQL on Linux.
This is a great introductory course for someone with zero container experience, as Nigel takes us through a quick primer on containers, Docker, and Kubernetes. Nigel brings an energy to this that I appreciate—he has a lot of passion for the topic and wants to share that passion with others. He also keeps things moving: there are 10 chapters but the whole course is less than 2 hours long, so he’s peppering you with topics.
I think one of the best use cases for this course is someone who is in management and needs to gain a high-level understanding of containers. Nigel doesn’t show you how to create containers or even how to install Docker, but he shows you where they fit in the modern infrastructure. The course is also new, so its material is up to date.
Even if you have Docker experience, it’s still a worthwhile investment of time because it helps explain containers in a way that non-technical people can understand.
One of the things I like to point out in my Launching a Data Science Project talk is that all data is, by its nature, dirty. I figured I could get into that a little bit here and explain why. I’ll pick a few common problems and cover why each is intractable.
Subjective Data Lacks a Common Foundation of Understanding
Any time you ask for or collect subjective data, you assume a common foundation which does not exist. One great example of this is the Likert scale, which is usually a grading on a scale of 1-5 or 1-7 or 1-49 or whatever level of gradation you desire.
We see Likert scales often in surveys: “How enthusiastic are you about having to stand in line two more hours to pick up your tickets?” Then the answers typically range on a scale from Very Unenthusiastic to Unenthusiastic to Meh to Enthusiastic to Very Enthusiastic and people pick one of the five.
But here’s the problem: your “Enthusiastic” and my “Enthusiastic” are not necessarily the same. For a five-point scale it’s not quite as bad, but as the number of points of gradation increases, we’re more likely to find discrepancies. Even on a 10-point scale, my 5 and your 7 could be the same.
Here’s an example where this matters: a buddy of mine purchased a car and the dealer asked him to fill out a manufacturer’s survey. He was happy with the experience and rated the dealer a 9 out of 10 because he doesn’t give out 10s (I believe the phrase was something like “I still had to pay for the car. If they gave it to me for free, I’d give them a 10.”). In a big analysis, one person providing a 9 instead of a 10 doesn’t mean much—it might shift the mean down a thousandth of a point—but the manufacturer’s analysis penalizes dealers who get ratings lower than a 10. The underlying problem here is that the manufacturer is looking for happy customers. They have a happy customer, but due to underlying differences in definitions their system does not recognize him as happy. What’s funny in this is that a simple “Are you happy with your purchase?” would have gotten the answer they wanted without the pseudo-analytical numbering system and avoided the problem altogether.
Suppose you want an analysis of how best to distribute goods among your population. An example of this might be to figure out budgets for different departments. You send out a survey and ask people to provide some sort of numeric score representing their needs. Department A responds on a scale from 1-10. Department B responds on a scale from 1-100. Department C responds on a scale from 999,999 to 1,000,000 because Department C is run by a smart thinker.
Fine. You send out a second survey, one stack ranking each department from A to G, ranking them 1-7 and doling out the budget based on perceived rank.
Well, as the head of Department C, I know that A and B are the big boys and I want their money. Departments F and G are run by paint-huffers and paste-eaters respectively, so nobody’s going to vote for them. Therefore, I will rank in order C, F/G, D/E, B/A. This gets even better if I can convince F and G to go along in my scheme, promising them more dollars for paint and paste if they also vote for me atop their lists and then for each other next. Knowing that D, E, B, and A will rank themselves at the top, our coalition of Trolls and Bozos has just enough push to take a bunch of money.
If your end users potentially receive value (or get penalized) based on the data they send, they will game the system to send the best data possible.
Data Necessarily Abstracts the Particulars of Time and Place
This is probably the most Hayekian slide I’ve ever created in a technical presentation, in no small part because I reference indirectly The Use of Knowledge in Society, an essay from 1945 which critiques many of the pretensions of central planners. A significant part of this essay is the idea that “data” is often subjective and incomplete, even without the two problems I’ve described above. An example Hayek uses is that the price of a particular agricultural commodity has within it implicit information concerning weather conditions, expectations of future yields, and a great deal of information which people might not even be able to articulate, much less explicitly correlate. This includes expectations (which naturally differ from person to person), different weightings of factors, and internalized experiences (which pop up as hunches or feelings).
This essay was key to Hayek eventually winning the Nobel Prize in Economics and holds up quite well today.
But What Does This Mean?
If you are a wanna-be central planner, it means you will fail from the get-go. Most of us aren’t wanna-be central planners, however, so the answer isn’t nearly as bad.
In each of these cases, one of the biggest conclusions we can draw is that we will never explain all of the variance in a system, particularly one which involves humans. People are complex, cranky, contrarian, and possess subtle knowledge you cannot extract as data. The complexities of humans will be a source of natural error which will make your analyses less accurate than if you were dealing with rule-based automatons.
It also means that adopting additional precision for imprecise problems is the wrong way of doing it. If you do use a Likert-type scale, fewer broad options beats many fine options because you’re less likely to run into expectation differences (where my 7 is your 5 and Ted’s 9.14).
Today I’m going to cover something I’ve been waiting to do for just about three years: use PolyBase to connect one SQL Server instance to another SQL Server instance. In this post, we’ll cover the basic setup and a couple points of interest I’ve found in writing this chapter of the book.
Setup: One Less Amigo
Historically, PolyBase has three separate external entities: external data sources, external file formats, and external tables. External data sources tell SQL Server where the remote data is stored. External file formats tell SQL Server what the shape of that data looks like—in other words, CSV, tab-separated, Parquet, ORC, etc. External tables tell SQL Server the structure of some data of a particular external file format at a particular external data source.
With PolyBase V2—connectivity with SQL Server, Cosmos DB, Oracle, Spark, Hive, and a boatload of other external data sources—we no longer need external file formats because we ingest structured data. Therefore, we only need an external data source and an external table. You will need SQL Server 2019 to play along and I’d recommend keeping up on CTPs—PolyBase is under active development so being a CTP behind may mean hitting bugs which have subsequently been fixed.
External Data Source
As I create an external data source, I’m also going to use a database-scoped credential. In this case, I’m going to use SQL authentication though you can also use Windows authentication.
IF NOT EXISTS
FROM sys.database_scoped_credentials dsc
dsc.name = N'SqlWin10Credentials'
CREATE DATABASE SCOPED CREDENTIAL SqlWin10Credentials
WITH IDENTITY = 'PolyBaseUser', Secret = '<<Some Password>>';
IF NOT EXISTS
FROM sys.external_data_sources e
e.name = N'SQLWIN10'
CREATE EXTERNAL DATA SOURCE SQLWIN10 WITH
LOCATION = 'sqlserver://SQLWIN10',
PUSHDOWN = ON,
CREDENTIAL = SqlWin10Credentials
There are a couple of things worth noting here. First, the LOCATION parameter for our external data source is a URI which uses the sqlserver protocol. PolyBase V1 used a configuration setting to determine what the external data source was, but for the new connectors, we define it in the location itself. SQLWIN10 is the name of my SQL Server instance, and I’m going to use SqlWin10Credentials to access that remote box.
Second, we have the option to disable predicate pushdown if we so desire. I’d recommend keeping it on in general and don’t know of a scenario where I’d definitely turn it off.
Here is a simple external table definition:
CREATE EXTERNAL TABLE [dbo].[CityPopulationCenter]
[CityName] [varchar](120) NOT NULL,
[PopulationCenterName] [varchar](30) NOT NULL
LOCATION = 'Scratch.dbo.CityPopulationCenter',
DATA_SOURCE = SQLWIN10
There are a couple points of note here as well. In the LOCATION parameter, we need to use three-part naming: database, schema, table. Our external data source did not specify a database, so we therefore need to specify it here.
Second, the data types need to be compatible with the remote server’s data types. You have a table or view defined on the other side, so that’s pretty simple copy-pasta. Just note that we don’t need (and aren’t allowed to use) any types of constraints, including primary keys, foreign keys, unique keys, default constraints, check constraints, and the like.
The good news is that once you’ve defined this external table, you can directly query it as dbo.CityPopulationCenter and end users won’t know or care that it’s an external table. Note that you do need to open your database master key in sessions which intend to query external tables, as that is how SQL Server can decrypt your password to send over to the remote SQL Server instance.
I’m not going to hit too much more because I’m saving it for the book, but here are a couple interesting things I’ve found while working with PolyBase to SQL Server:
Performance is generally around the same as when working with linked servers. Sometimes it’s better, sometimes it’s worse.
When using scale-out groups, performance tends to be a little better, even though the other side is just a single SQL Server instance.
PolyBase can create an external table to a remote view.
PolyBase can not create an external table to another external table.
Even though it’s still in CTP and I think there are quite a few things Microsoft can do to make it better, I’m happy that direct SQL-to-SQL support is here.
I gave this once as a “beta” talk and have tweaked it a little bit since then. I figure I’ll need to tweak it a couple more times before I’m really comfortable with it, mostly figuring out what material to cut.
Are you going to PASS Summit this year, from November 4th through the 8th? If so, I have a two-fer for you.
First, I’m going to present a full-day training called Architecting a Big Data Analytics Solution in SQL Server 2019. I’m going to draw a few boxes, link together a bunch of interesting technologies (hints: SQL Server Big Data Clusters, Docker, Kubernetes, Spark, Hadoop, SQL Server Machine Learning Services, and a couple more are already on my slate), and rant about why I hate the term “big data.” You have to give me money if you want to see this training, but it will be worth it to see the cutting edge in the data platform space.
Second, I’m going to present a regular session entitled SQL Server Machine Learning Services in Production. Learn from someone who has been working with SQL Server Machine Learning Services since it was SQL Server R Services. You’ll discover what kinds of applications work well in ML Services as well as what kinds of workloads don’t do well at all. You’ll learn about the security and management models driving Machine Learning Services. You’ll also get the benefit of learning about what kinds of coding practices my team has developed to improve our development experience.
Come to PASS Summit and check out these talks as well as over a hundred other excellent talks.