T-SQL Tuesday 95: Polybase And HDInsight

tt_300_11FF5A1D Today is T-SQL Tuesday #95, hosted by Derik Hammer, and the topic is Big Data.

This is a continuation of my Polybase series. If you haven’t seen this already, I have a twenty-some part series on Polybase in SQL Server 2016 and now SQL Server 2017. Polybase hasn’t changed with the RTM release of SQL Server 2017, so I haven’t needed to write much new about the topic lately. That’s my excuse for being lazy, at least.

Additional Resources

If you’re new here, you might be interested in the rest of my Polybase series, which I have in rough reading order. I also turned this series into a presentation.

You may also like my Kafka series, which led to a presentation on the topic. Kafka is a fantastic distributed message broker which lives in the Hadoop ecosystem, and I can’t say enough good things about it.

Maybe you’re curious about this Hadoop thing but don’t know where to start. Go check out my presentation introducing Hadoop, which I have recorded as well.

After that, if you want to see several methods you could use to connect Hadoop to SQL Server (including Polybase), check out my presentation on the topic, which the folks at NDC Sydney helpfully recorded.

Finally, if you just want to get an idea of what the modern data platform space looks like, I have a presentation on that as well. Carlos L Chacon interviewed me on the SQL Data Partners podcast in Episode 47 and we talked about this in depth, including where Hadoop fits into this ever-growing combination of technologies.

Now, On With The Show

In today’s post, I want to look at trying to connect a SQL Server instance to Azure HDInsight and see if I can take advantage of predicate pushdown. The short answer is, not really. The long answer is, read on.

Let’s Spin Up HDInsight

The first step here is to start up an HDInsight cluster. We won’t need a powerful cluster for this demo.

First, in the Azure Portal, search for HDInsight and you’ll get the option to create a new cluster. Here are my cluster configuration options:

1_ClusterConfiguration

I just wanted a classic Hadoop cluster and didn’t need to install HBase, Kafka, or anything else. Note that Windows is an option for operating system, but nobody in their right mind runs Hadoop on Windows. It’s dead; leave it be.

Next, fill out the basics. Here’s my setup:

2_Basics

You can configure your storage options after that. I’m choosing the default of Azure Storage, though you can also read from Azure Data Lake Storage. If you want Hive or Oozie metadata to persist after you kill the cluster—something that you might want to do in production as you shut down the HDInsight cluster when not in use—you can choose an Azure SQL Database in the dropdowns at the bottom.

3_Storage

Once you’ve got storage set up, you want to edit the cluster size. By default, you get two D12 head node and 4 D4 worker nodes. This will run you $3.63 an hour in East US. That’s about $87 a day, or roughly $2600 per month.

4_EditClusterSize

You probably want to trim that down a bit if you’re just playing around. Here’s what I ended up with:

5_SmallerNodes

Just using 2 D3 nodes cuts the price in half. It’d be a terrible demo for Hadoop and completely unfair in a performance test, but our purpose here is just to get a simple Polybase demo working, not provide a comprehensive study on Hadoop performance.

Once you’ve got everything set up, expect to stare at this for a while:

Configuring For Predicate Pushdown

Next up, we want to configure Polybase to use our Hadoop installation. To get the right version of Hadoop, I need to SSH to the NameNode:

7_SSH

My HDInsight cluster is using Hortonworks Data Platform version 2.6.2.2-5:

8_HDPVersion

That means that in my yarn-site.xml file, all references will be to /usr/hdp/2.6.2.2-5.

The Problem

But now we run into a problem: there are certain ports which need to be open for Polybase to work. This includes port 50010 on each of the data nodes against which we want to run MapReduce jobs. This goes back to the issue we see with spinning up data nodes in Docker: ports are not available. If you’ve put your HDInsight cluster into an Azure VNet and monkey around with ports, you might be able to open all of the ports necessary to get this working, but that’s a lot more than I’d want to mess with, as somebody who hasn’t taken the time to learn much about cloud networking.

One Workaround

There’s one consistent workaround that I see people use: have Azure Blob Storage work as an intermediary. In this scenario, you’d use Polybase to write to Azure Blob Storage and then pull that data into your Azure Storage local to the HDInsight cluster. This works for data transfer, but it doesn’t allow for predicate pushdown.

Using A Deployment Template

Robert Alexander has created a deployment template to spin up a SQL Server instance and an HDInsight cluster with a sample data set, so if you have an Azure subscription and want to play along, it’s pretty easy. Just hit the “Deploy” button, fill in a few details, and be prepared to wait half an hour. Note that the average daily cost is just over $150 for this demo, so you probably want to delete the objects as soon as you can to save money. You will be charged for an HDInsight cluster even if you’ve shut off all of the machines. There is no free pause option; the only way not to pay for the service is to delete the cluster.

Unfortunately, I was not able to get this template working. I got through step one of the deployment but got stuck with a status of Action Required:

10_ActionRequired

When clicking on the deployment link, it’d sit there spinning, waiting for something to load. In my browser console, I saw the problem:

9_APIError

Trying to go to that link gives an error message:

{"message":"You don't have access to this deployment ({deploymentID})."}

In this case, {deploymentID} is the exposed GUID after /deployments/ in the link above. I tried this with a couple of accounts, but was unable to spin up this demo with either of them.

Conclusion

I’m not particularly fond of HDInsight at this point. There’s a lot of effort to get started, and frankly, if I’m going to go through all of this effort, I’ll just build my own Cloudera or Hortonworks cluster and get full control. I’ve also been unable to use Polybase on my on-prem SQL Server instance against an HDInsight node. There are cases where I can see HDInsight being useful, but I don’t think there’s enough time and administrative savings over building your own Hadoop cluster to make it a worthwhile choice. Maybe in a few years they’ll have it nailed down, but for now, I’d stick to custom-built Hadoop either on-prem or as VMs in a cloud provider like Azure or AWS.

S	M	T	W	T	F	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

36 Chambers – The Legendary Journeys: Execution to the max!