Today’s post will look in some detail at the first of several external data sources. First, let’s check out MSDN. There are a half-dozen options, but for today, we’re focusing on a Hadoop cluster:
-- PolyBase only: Hadoop cluster as data source -- (on SQL Server 2016) CREATE EXTERNAL DATA SOURCE data_source_name WITH ( TYPE = HADOOP, LOCATION = 'hdfs://NameNode_URI[:port]' [, RESOURCE_MANAGER_LOCATION = 'ResourceManager_URI[:port]' ] [, CREDENTIAL = credential_name ] ) [;]
Using This Knowledge
We’ve got Polybase configured, and we know which Hadoop source we’re using (Hortonworks HDP 2.4 in my case), so let’s define an external data source. Here’s one which I’ll call HDP2:
CREATE EXTERNAL DATA SOURCE [HDP2] WITH ( TYPE = HADOOP, LOCATION = N'hdfs://sandbox.hortonworks.com:8020', RESOURCE_MANAGER_LOCATION = N'sandbox.hortonworks.com:8050' )
There are a couple of things I want to point out here. First, the Type is HADOOP, one of the three types currently available: HADOOP (for Hadoop, Azure SQL Data Warehouse, and Azure Blob Storage), SHARD_MAP_MANAGER (for sharded Azure SQL Database Elastic Database queries), and RDBMS (for cross-database Elastic Database queries on Azure SQL Database).
Second, the Location is my name node on port 8020. If you’re curious about how we figure that one out, go to Ambari (which, for me, is http://sandbox.hortonworks.com:8080) and go to HDFS and then Configs. In the Advanced tab, you can see the name node:
Scroll down a bit and you’ll see the “Advanced core-site” section. Click the drop-down for that and we’ll see fs.defaultFS:
That’s how we can tell what we should put in for Location.
As for RESOURCE_MANAGER_LOCATION, this is technically optional. If you never want to perform predicate pushdown, then you don’t need to put it in. With a Hadoop cluster, we almost definitely want to set this.
YARN (Yet Another Resource Negotiator) is our resource manager of choice, and finding its port is a similar process. First, go to YARN, and then Config and Advanced:
Scroll down to Advanced yarn-site and somewhere in the middle, you’ll see the resource manager location configuration setting:
Thus, our resource manager location is sandbox.hortonworks.com:8050—note that there’s no protocol (like hdfs:// or http:// or tcp://). Also, in practice, your resource manager might be on a different host from your name node, so don’t just assume that they’re on the same machine.
Conclusion
Early on, I didn’t understand the importance of the resource manager. That works fine for something like Azure Blob Storage, where we don’t have a resource manager. But if we want our big powerful Hadoop cluster to take its share of the load (rather than just sending back the entire file to SQL Server and letting it do all the heavy lifting), we’re going to need to tell the Polybase code where to find YARN.
Nice post, but how do I hook it up to my HDI cluster?
Hi, Jason. Polybase and HDInsight don’t play very well together: https://36chambers.wordpress.com/2017/10/10/t-sql-tuesday-95-polybase-and-hdinsight/
The recommended method has been to process data in HDInsight, write out results to Azure Blob Storage, and then use Polybase to ingest from Blob Storage (links on how to do this at https://36chambers.wordpress.com/polybase/). That means, however, that you can’t use predicate pushdown to build a MapReduce job on HDInsight.
Thanks Kevin, it seems to be very complicated than I originally thought.