My last post covered reasons why I think Polybase is big. Today’s post will cover reasons why I’m potentially wrong and what I plan to do about it.
Motivation
Here’s my motivation in two tweets:
@feaselkl Because it’s crazy rare in the wild. #sqlsummit
— Brent Ozar (@BrentO) October 27, 2016
The Problems
So let’s talk about why it’s so rare in practice.
Versions And Editions
Polybase is currently available to us in two forms: SQL Server PDW or APS (i.e., devices which are too expensive for my employer to purchase) or SQL Server 2016 Enterprise Edition. The last survey of SQL Server versions in the wild that I can find is this one from September of 2015, that is, more than a year after SQL Server 2014 was released. The most popular version by far was 2008 R2, with the latest version in 5th place.
If we assume something like this holds true for 2016, then we’re looking at some fraction of about 4% of the population which can even use Polybase, and that’s assuming other things hold true. There is one big reason why I’m not quite so sure about holding this assumption, though: it seems to me that this year, a larger number of companies are biting the bullet and moving from 2008 R2. The change to core-based licenses in 2012 held back a large percentage of companies for nearly half a decade, but I think even a lot of the slower movers are starting to see major advantages to upgrading. Some companies will keep on 2008 R2 until Microsoft drops support altogether (and maybe even afterwards), others will begrudgingly move to 2012, but I can see a lot skipping 2014 and going straight to 2016. At PASS Summit and in the Triangle area, I’ve heard many more stories about people upgrading than I did last year, so I think there will be faster adoption than we saw for 2014.
Resources
As I mentioned in my tweet, there are relatively few people who regularly blog on Polybase, so few that I can probably count them on two hands (but if I’m not counting you, let me know!). Because there are so few people who go beyond the basics of installation and really simple queries, it’s easy to hit a wall and give up when you run into problems. It’s wonderful to have a Niko Neugebauer hitting columnstore indexes from every angle and being a one-man encyclopedia on the topic; there’s nobody (outside of Microsoft) who really serves that for Polybase, at least that I’ve found so far.
Configuration
Speaking of hitting walls, Polybase is not easy to configure. There are several text files you may have to edit, and figuring out what to change and how to do it is…not trivial. Even when things look like they’re working, you may not have things in true working order, as I found out when I tried to force MapReduce jobs (incidentally, I now have a solution, so I will get back to that!).
Cross-Platform Expertise
Another big problem, even for companies running SQL Server 2016 Enterprise Edition, is that you have to have a Hadoop cluster up and running. I don’t know the percentage of companies with working Hadoop clusters given an installation of SQL Server 2016 Enterprise Edition, but let’s take a quick guess. Suppose the adoption rate is faster than 2014 and we have about 4% server share for 2016, about 6 months in (noting that for 2014, it was 4% share at about 15 months in). It’s hard to make a simplifying assumption about percentage of companies with 2016, but let’s say 3% of firms have at least one 2016 instance in production. Of those, maybe half have Enterprise Edition (again, another guess), meaning that we’re looking at 1.5% of the population. Assuming 1/3 of those companies have a Hortonworks or Cloudera Hadoop cluster in place means that we’re looking at 0.5% of companies based on this series of wild guesses. And they may be wild guesses, but that number feels about right to me.
To get Polybase right, you need at the very least someone who knows SQL Server well and someone who knows Hadoop well. Those don’t need to be the same person, and often won’t be, as they’re different tech stacks.
Roughness Around The Edges
Polybase is technically not a V1 product—it’s been around for a few years now. It is, however, rough. There are some things which made sense back in 2010 and 2011 (around the time it was originally developed) which are showing their age, and when there isn’t much documentation available to help when you get weird errors, it can make life painful enough to want to quit. But let’s not do that.
How Can I Help?
Instead of quitting, I plan to double down on Polybase. The goal of this series is to highlight Polybase from start to finish across multiple platforms. I don’t plan to have as comprehensive of documentation for Polybase as Niko has for columnstore (particularly given that he’s been working at it for 3 years), but my plan is to walk through a number of topics, including installation, configuration, usage (including some advanced and odd scenarios), error handling, performance tuning, and maybe a few wacky theories.
Looking at the pain points above, I can’t get you 2016 Enterprise Edition (although you can get 2016 Developer Edition for free and try it at home). I can, however, help alleviate some of the other pain points:
- By covering a broad array of topics, I want to reduce the barriers to entry for Polybase. That will allow other people to share their experiences and how they solved particular problems, thereby making the community as a whole a little better off.
- I intend to cover configuration for several data sources, including Hortonworks Hadoop, HDInsight, Azure Blob Storage, and Azure SQL Data Warehouse. I might also give Cloudera’s Hadoop installation a try.
- Throughout this series and my concurrent “managing Hadoop” series, we’ll pick up a few things about that platform. I’m not an expert and I don’t pretend that I’ll make you an expert, but we’ll get a little better.
- As I experience pain points, I’m going to file Connect items and poke anybody I can to improve the product.
This is a long road, but I think it’s a worthwhile one. Ask me again in a few months if I feel the same way…