Upcoming Events: SQL Saturday Orlando

Key Details

What: SQL Saturday Orlando
Where: Seminole State College of Florida, 100 Weldon Blvd Building UP, Sanford, FL 32773
When: Saturday, September 28th.
Admission is free. Register on the SQL Saturday website.

What I’m Presenting

11:00 AM — 11:45 AM — Data Virtualization with PolyBase

There are three important notes around SQL Saturday Orlando:

  1. The session is only 45 minutes. That will be a little bit of a challenge to get enough across during that time frame, but it’s one I’ll meet.
  2. This is a rare SQL Saturday where I won’t be in on that Friday. Due to Open F#, I’ll arrive Saturday morning after overnight flights.
  3. This is SQL Saturday number 100 for me as a presenter. That’s a pretty hefty number and is good incentive for me not to miss my flights…

New Video: Eyes on the Prize

I’ve had a few people request that I make a video of my talk, Eyes on the Prize: Simple and Effective Dashboard Visualization Techniques. Here it is in its full glory, with no time limits.

Evoke a response after watching this video.

When done right, data visualization informs users quickly and efficiently with a minimum of distraction. When done wrong, we end up with spinning 3D pie charts. In this talk, we will look at techniques for effective dashboard design, minimizing noise and clutter (like spinning 3D pie charts) while providing relevant information to the intended audience. We will see how different types of charts work, build out rules of thumb around which charts to use, and integrate visuals to create a coherent dashboard image. When it comes to dashboards, we will compare and contrast dashboards based on their intended uses, such as interactive dashboards versus TV dashboards. Finally, we will take an awful dashboard and turn it into something which end users will appreciate.

You can also get to the slides and additional resources.

Quick Thoughts on 100 SQL Saturdays

Last time I did one of these, I was way over 50 before I started writing. This time around, I’m at 99 and poised for 100.

For those of you keeping score, here’s the year-by-year breakdown:

  • 2013: 1
  • 2014: 3
  • 2015: 9
  • 2016: 22
  • 2017: 22
  • 2018: 26
  • 2019: 17 (through 2019-09-28)

I won’t have quite as many SQL Saturdays this year as last, but I should still be somewhere around 20 by the time we’re done.

My 100th SQL Saturday will be SQL Saturday Orlando 2019. I love that the math worked out this way because SQL Saturday #1 was in Orlando, way back in 2007. That kind of happenstance pleases me.

I may not be the first person to get to 100 (heck, I’m not even the first Kevin to get to 100) but when I presented at SQL Saturday Columbus in 2013, the idea of doing it 99 more times would have been ridiculous…and yet somehow it happened. Probably from me bribing a bunch of people.

The Institutional Value of SQL Saturday

I greatly appreciate the work Andy Warren, Steve Jones, and a cast of thousands have put into making this a successful global institution. SQL Saturdays have given me the opportunity to travel and speak all around North America and Europe.

On the opposite side, SQL Saturdays give people in our community a chance to see and speak with some of the best regional and national experts on topics across the data platform for the cost of a lunch and a Saturday indoors. It’s hard to think of a better deal than this.

You don’t see this kind of institutional support elsewhere. Code camps are great (by the way, have you signed up for Raleigh Code Camp yet? It’s this Saturday, October 5th), but they’re entirely independent. Learning that your city has a code camp going on is tricky and if you’re a non-local speaker looking to present at different code camps, there’s no really easy way to know where and when they are taking place.

This is the key benefit behind SQL Saturdays: they’re in one place, so we can easily see who’s putting them together and when.

The Places You’ll Go

SQL Saturdays have given me a great excuse to travel around significant parts of the world, including:

  • The United States. I’ve presented at SQL Saturdays in 27 states and Washington DC. I’m still holding out for SQL Saturday Maui…
  • Canada. My first foreign SQL Saturday was Nova Scotia, the File Not Found edition. To this day, it’s still one of my favorite events despite having 5 speakers and just over 5 non-speaking attendees. Aside from that, I’ve been to SQL Saturdays in Edmonton and Victoria.
  • Latin America: Costa Rica and Mexico.
  • Europe: Austria, England, and Ireland.

What’s Next?

Now that I’ve hit 100, I don’t have any more milestones, but I’d like to keep presenting at SQL Saturdays. Keeping up that 2016-2019 pace will be a challenge so I might not be able to do quite that many, but this isn’t the end for me, not by any stretch. In 2020, I’d really like to visit a few that I’ve never seen before, hit a few new states, and maybe venture into South America or Asia—especially if I can work out the travel details on the cheap.

Upcoming Events: Open FSharp

Key Details

What: Open FSharp 2019
Where: Broadway Studios, 435 Broadway, San Francisco, CA 94133
When: Thursday, September 26th and Friday, September 27th
Admission is $375 if you’re paying out of pocket or $475 for companies paying for tickets. Register on the Open F# website.

What I’m Presenting

02:00 PM — 02:50 PM — A Map to Success

I present on Thursday the 26th, and will aim my talk at database developers interested in learning some of the cool things you can do with F#, showing that it’s a great language for data platform professionals.

PolyBase: When You Have No Local Tables

Today’s PolyBase post is all about what happens when you want to join data from multiple data sources together, but none of your tables are local SQL Server tables.

No Local Tables were Harmed in the Making of this Post

Let’s suppose we have two sets of data in two different sources. My first is North Carolina population living in Azure Blob Storage. Here is the script to create this table.

WITH IDENTITY = 'cspolybaseblob',
SECRET = '<It's a secret>';
    LOCATION = 'wasbs://ncpop@cspolybaseblob.blob.core.windows.net',
    CREDENTIAL = AzureStorageCredential
        FIELD_TERMINATOR = N',',
        USE_TYPE_DEFAULT = True,
        STRING_DELIMITER = '"',
        ENCODING = 'UTF8'
CREATE EXTERNAL TABLE dbo.NorthCarolinaPopulation
    SumLev INT NOT NULL,
    County INT NOT NULL,
    Place INT NOT NULL,
    IsPrimaryGeography BIT NOT NULL,
    [Name] VARCHAR(120) NOT NULL,
    PopulationType VARCHAR(20) NOT NULL,
    Year INT NOT NULL,
    Population INT NOT NULL
    LOCATION = N'Census/NorthCarolinaPopulation.csv',
    DATA_SOURCE = AzureNCPopBlob,
    FILE_FORMAT = CsvFileFormat,

With this combination of external data source, file format, and table, we can create our next table. This is a data set which lives in Spark and contains…well, the same data. What we join together isn’t that interesting, after all.

WITH IDENTITY = '<Your Username>', Secret = '<Your Password>';
    LOCATION = 'odbc://clusterino:10016',
    CONNECTION_OPTIONS = 'Driver={Hortonworks Spark ODBC Driver}; Host = clusterino; Port = 10016; Database = default; ServerNode = clusterino:10016',
    CREDENTIAL = SparkCredential,
CREATE EXTERNAL TABLE dbo.NorthCarolinaPopulationTypedSpark
    SummaryLevel INT,
    CountyID INT,
    PlaceID INT,
    IsPrimaryGeography BIT,
    Name NVARCHAR(255),
    PopulationType NVARCHAR(255),
    Year INT,
    Population INT
    LOCATION = 'NorthCarolinaPopulationTyped',
    DATA_SOURCE = ClusterinoSpark

Once we have these two tables, we can join them.

FROM dbo.NorthCarolinaPopulation ncp
	INNER JOIN dbo.NorthCarolinaPopulationTypedSpark ncps
		ON ncp.SumLev = ncps.SummaryLevel
		AND ncp.County = ncps.CountyID
		AND ncp.Place = ncps.PlaceID
		AND ncp.PopulationType = ncps.PopulationType
		AND ncp.Year = ncps.Year;

You can extend this out to as many tables as you’d like. For each table, the PolyBase engine will determine whether to pull the data into a temp table locally or stream the data back, as well as whether to push down the operation to the remote source (if possible) or pull all of the data back. This is true for all PolyBase data sources, not just the two I showed here.

Category Theory: the Video Lectures

I just wrapped up watching Bartosz Milewski’s YouTube series on category theory. If you are at all interested in functional programming, this is an outstanding set of lectures.

The course covers category theory, a rather abstract branch of mathematics. Throughout the twenty videos, Milewski takes us through the landscape of category theory, grounding it as much as possible in the language and concepts of programming. Some of Milewski’s examples are in Haskell but you don’t need to know that language to understand what’s going on. Similarly, other examples are in C++, but it’s clear from the context what he means even if you’ve never seen a line of the language.

I came into this course with a fair knowledge of functional programming and a less-than-fair knowledge of category theory. Watching these videos has given me a much better understanding of the topic and has really cleared up some of the trickier concepts in functional programming like monads.

For further reading, Milewski has a blog full of posts on the topic and a group has (with permission) turned his posts into a book. There are also two more courses on category theory that Milewski has put together, helping us dig even further into the topic.

If you are interested in the series, don’t get too distracted by the intentionally-opaque examples. I’ve seen this in undergraduate courses I took on logic, where the professor wants to ensure that you don’t get stuck thinking about a specific example when explaining a general concept, as though the specific example were the only case. Milewski does a good job of combining the highly-general drawings of categories and how elements in categories can map to other categories via functors but then brings it down to examples we’re more familiar using, particularly with sets. There’s a balancing act involved in these examples and I think Milewski has that act pretty well covered.

It may take you a month or three to get through all of these videos, but I definitely recommend them. From here, I’m going to work through the book and fill in more gaps.

Upcoming Events: 24 Hours of PASS Summit Preview

Key Details

What: 24 Hours of PASS: Summit Preview 2019
Where: Online-Only
When: Tuesday, September 10th and Wednesday, September 11th.
Registration is free.

What I’m Presenting

08:00 PM — 09:00 PM (Eastern Daylight Time) — Architecting a Big Data Analytics Solution with SQL Server 2019

In this session, I am going to give a pre-con preview. If seeing this makes you want to get in on my full-day training, you can sign up at the PASS Summit website.

PolyBase + Dockerized Hadoop

Quite some time ago, I posted about PolyBase and the Hortonworks Data Platform 2.5 (and later) sandbox.

The summary of the problem is that data nodes in HDP 2.5 and later are on a Docker private network. For most cases, this works fine, but PolyBase expects publicly accessible data nodes by default—one of its performance enhancements with Hadoop was to have PolyBase scale-out group members interact directly with the Hadoop data nodes rather than having everything go through the NameNode and PolyBase control node.

Thanks to a comment by Christopher Conrad in that post, I learned how to solve this problem. I’ll focus on versions of HDP after 2.6.5. Once the new Cloudera gets its sandbox out, I’ll eventually get to checking that out. In the meantime, you can still grab the sandbox edition of the Hortonworks Data Platform distribution of Hadoop.

Update SQL Server Configuration

The first thing we need to do is change SQL Server’s hdfs-site.xml file. You can find it in %PROGRAMFILES%\MSSQL[##].MSSQLSERVER\MSSQL\Binn\Polybase\Hadoop\conf on a default installation of SQL Server, where [##] is the version number and MSSQLSERVER is your instance name.

Inside hdfs-site.xml, add the following property:


Now PolyBase will use hostnames rather than IP addresses. This will avoid the problem where it tries to connect to a 172.17.0.* IP address and fails because that subnet is not routable.

Update Hadoop Configuration

On the HDP sandbox, we need to open up some ports. To do so, ssh into your root node (by default, the username is root and the password hadoop) and run the following commands to modify the proxy deployment script:

cd /sandbox/deploy-scripts/
cp /sandbox-flavor .
vi assets/generate-proxy-deploy-script.sh

You don’t actually have to use vi here, though that’s my editor of choice. Scroll down to the bottom of the tcpPortsHDP list and you’ll want to add entries for three ports: 8050, 50010, and 10020. Save this file and then run the following shell commands to generate a script and replace your proxy deployment file with the newly-generated version:

cd /sandbox
mv proxy/proxy-deploy.sh proxy/proxy-deploy.sh.old
cp deploy-scripts/sandbox/proxy/proxy-deploy.sh proxy/

Restarting Hadoop

Now that we have a script in place, we need to stop all of the data nodes and restart the cluster. First, run ./sandbox-stop.sh to stop the sandbox. Then, run docker ps to see if there are any data nodes still running. If so, go ahead and kill them with docker kill (node ID). Once everything is dead as a doornail, run ./proxy/proxy-deploy.sh to build a new image with all of the ports we need open. After it’s done, run docker ps and look for an entry which looks something like>50010/tcp. If you see that, you’ve completed the mission successfully. Restart Linux on the sandbox and after everything boots up, you should be able to use your Hortonworks sandbox with PolyBase just like any other HDP cluster.

Updates to Time Series Analysis in SQL Server

A couple months ago, I blogged about how my DataCamp course entitled Time Series Analysis in SQL Server soft launched. Since then, I’ve been hard at work, squashing bugs and making instructions and hints clearer and getting this course from soft launch to hard launch.

I have a new batch of updates coming out soon, so I wanted to walk you through some of the process. First, the average rating over the past eight weeks. The course launched about 10 weeks ago, so I’m stripping out that early period of first-movers who love everything and give it a 5-star rating.

It’s like watching your investment portfolio change from day to day.

The first batch of updates hit on July 16th, when I had enough data to learn where some of the problems were and fix them. That bumped the score up a bit, where it has mostly leveled off in the 4.4-4.45 range. Since then, I’ve had a few minor changes but nothing as big as the June 16th release or the upcoming release. My hope is that this next set of changes bumps me into the 4.5-4.6 range.

To figure out what to change, I get a helpful, sortable table with information on how people are doing. Here it is sorted by percentage of people who gave up and asked for the solution:

Psst, what’s the answer to question 8?

When 80% of the people are unable to answer your question, there’s a problem… Drilling in, we get to see the error messages learners get when trying out this problem:

I don’t think I correctly specified the 10 parameters, no.

From there, I can view the actual submitted code and see what my solution looks like versus the submitted code. Here’s the top-ranked example (with a bonus that you get the solution. If you found this blog post and remember it when going through my course, I think that’s worth one free solution):

Where it all went wrong.

With this information, I can modify the course. In this particular exercise, I decided to do two things: first, I provided the number of seconds and milliseconds to each DATETIMEOFFSETFROMPARTS() call. That was a common mistake where people filled in 8.0 for seconds.milliseconds rather than 7.999. Second, in the hint, I provide the exact date in a format which just happens to match the order of inputs. My hope is that these two things are enough to raise the success rate. For this example above, people put in 2039 as the year instead of 2038. Reading the improved hint, they’d see that the year is definitely 2038.

I ended up making these changes on about 15-20 exercises and will keep making changes to make the course successful. If you are going through the course and have direct feedback, hit the (!) button at the top-right of the screen on any exercise and leave your thoughts.

If you want to feel old, there are people old enough legally to drink in the United States who had not been born yet when this episode of the Simpsons first aired.

If you want to try this course out and learn all about working with times, dates, and time-related aggregations in SQL Server, get to the course on DataCamp. The first chapter is free, though subsequent chapters come at a cost. Just money, though, not souls or firstborn children.