PASS Summit 2015, Day 4

Day 4 of PASS Summit 2015 was a doozy.  I only attended two sessions, but they were big ones.

The first session I attended was by a couple program managers about R integration in SQL Server 2016.  They let out that there will be R tools for Visual Studio coming out.  RStudio is a great product, but having some competition in the market will help out both products.  They also noted that the official product name, at least as of right now, is SQL Server R Services.  This is rather unfortunate because it shares the same acronym as SQL Server Reporting Services.  I’d have to believe that this name will change between now and release day.

Regardless of the name, the product sounds very intriguing.  It’s designed today to support R queries, but the model was written in such a way that they will be able to incorporate other languages in the future, meaning that we could conceivably run, for example, Python code directly against a SQL server instance, bringing code to data.

The last big thing in that session is that R support will be available in all editions of SQL Server 2016.  They’ll differentiate it somehow, but at least basic support should be there.

Right after that session, I attended a session on PolyBase and that floored me.  If you follow my Twitter stream, you got my reactions.  But here’s the idea summed up.  PolyBase has lots of good things but tempered with some bad things.  First the good:

  • PolyBase supports Hortonworks and Cloudera, as well as HDInsight in Azure.  It also supports Azure blob storage.  It does not support MapR, though, nor does it support Elastic MapReduce.
  • External tables do not require Hive; they build directly on top of MapReduce.
  • If you want to select data from a file or folder in HDFS, write a basic select statement…or write one as complicated as you’d like.  PolyBase supports all of T-SQL, unlike certain big-name components.
  • If you want to insert data into Hadoop—such as maintaining archival data—it’s as easy as writing an insert statement.
  • For external tables, you will now get the option to build a SQL Server cluster, with one head node and at least one compute node.  That way, if you’re processing data from a large Hadoop query, you can spread the load over a number of SQL Server instances.  I see this a step toward giving SQL Server a true scale-out option.
  • PolyBase tables support statistics.  The statistics on a table will tell the optimizer whether to push predicates down to Hadoop or not.  Basically, there will be two main models:  you can either pull the data from HDFS into a temp table and filter the temp table, or you can run a Map-Reduce job and bring that data into a temp table.  It sounds like they may support hybrid options as well, where part of the predicate gets pushed but the rest gets handled in SQL.
  • There will be new DMVs to show what’s going on with queries, including execution times and rudimentary explain plans.

Not all of the news is good, however:

  • Each external table ties to a single credential.  If you want application- or user-level control over external tables, you can’t do that directly.  What you can do, however, is define multiple external tables pointing to the same file/folder, giving the external tables different names and tying them to different credentials.  That’s not optimal, but it’s a decent workaround.
  • Hive is not supported.  Neither is Tez.  Neither is Spark or HBase.  This version is pretty limited in what it does support, and I can see some people running a hybrid scenario with PolyBase handling some of the data requests and Hive or Spark-SQL over ODBC handling others.  The fact that neither Tez nor Spark is supported means that performance will suck compared to more optimized solutions, so we may see PolyBase adoption in scenarios in which timeliness is less important (like monthly reporting or long-running batch processes).
  • There are maximum row and column width limitations.  I didn’t catch the exact numbers, but you definitely cannot create varchar(max) columns with PolyBase.

Both of these are V1 products and I should know better than to get too giddy about V1 products, but I think support for R and Hadoop make SQL Server 2016 an even better option as a data platform.  Both of these products have limitations, some of which will be relaxed in future versions of SQL Server, but I can see both of these features really taking off next year.  Throw in a completely revamped SQL Server Reporting Services and some love for SSIS and SQL Server 2016 sounds like it’s going to be the biggest BI edition since 2005 and maybe ever.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s