As I mentioned earlier this week, I’m restarting my series on Polybase, digging deeper than before. Today’s post will be non-technical but will explain my motivation behind it.
Polybase was featured in both PASS Summit 2016 keynotes. You can watch them on PASStv (keynote day 1, keynote day 2).
If you don’t want to watch the keynotes, the big news comes in the first keynote, where they demonstrate one Polybase query against a number of data sources, including Elasticsearch, Teradata, MongoDB, and Oracle. Add to that support for Hadoop, Azure Blob Storage, and Azure SQL Data Warehouse and you have a powerful multi-platform querying tool.
But hey, that’s for the future. Let’s talk about why I love it today.
What’s Good About Polybase?
Polybase gives you a way of connecting one or more SQL Server instances to a Hadoop cluster. Let’s look at some implications of this:
- We can move data into a Hadoop cluster, relieving some of the load on a SQL Server instance. For example, suppose you have large amounts of data streaming into a SQL Server instance, where it gets aggregated as part of an hourly or daily rollup process. We could do the heavy lifting in Hadoop (which scales out to solve these sorts of problems quite nicely) and pull the results into SQL Server. This means less of a burden on SQL Server.
- With a smaller burden on SQL Server, you might be able to reduce licensing costs. SQL Server Enterprise Edition is an expensive piece of software. It’s a great piece of software and totally worth the money (specifically, my employer’s money, not mine…), but if you can save money, that might be worth it.
- This provides an easier interface for application and database developers who are not particularly familiar with Hadoop-friendly technologies and languages (such as Scala and Python). Instead, you can write native T-SQL and have it translate to MapReduce jobs, which ends up being much faster than linked servers over Hive.
- Direct integration with Hadoop means that you can join external tables to SQL Server tables. One pattern for this that I have seen is taking aggregated measures from Hadoop and joining them to reference data in SQL Server.
Aside from Hadoop, Polybase currently supports Azure Blob Storage and Azure SQL Data Warehouse. I haven’t had much of a need for either of these, but there are good use cases for both which I’ll explore throughout the series.
After more than 10 months with SQL Server 2016, I still rank Polybase as my favorite feature. It’s rough around the edges (as we’ll note in sometimes-excruciating detail), but I think it marks the future of SQL Server as an MPP database.
2 thoughts on “Why Polybase?”