Today’s post will look in some detail at the first of several external data sources.  First, let’s check out MSDN. There are a half-dozen options, but for today, we’re focusing on a Hadoop cluster:

-- PolyBase only:  Hadoop cluster as data source   
-- (on SQL Server 2016)  
    WITH (   
        TYPE = HADOOP,
        LOCATION = 'hdfs://NameNode_URI[:port]'  
        [, RESOURCE_MANAGER_LOCATION = 'ResourceManager_URI[:port]' ]  
        [, CREDENTIAL = credential_name ]

Using This Knowledge

We’ve got Polybase configured, and we know which Hadoop source we’re using (Hortonworks HDP 2.4 in my case), so let’s define an external data source.  Here’s one which I’ll call HDP2:

	LOCATION = N'hdfs://',

There are a couple of things I want to point out here.  First, the Type is HADOOP, one of the three types currently available:  HADOOP (for Hadoop, Azure SQL Data Warehouse, and Azure Blob Storage), SHARD_MAP_MANAGER (for sharded Azure SQL Database Elastic Database queries), and RDBMS (for cross-database Elastic Database queries on Azure SQL Database).

Second, the Location is my name node on port 8020.  If you’re curious about how we figure that one out, go to Ambari (which, for me, is and go to HDFS and then Configs.  In the Advanced tab, you can see the name node:


Scroll down a bit and you’ll see the “Advanced core-site” section.  Click the drop-down for that and we’ll see fs.defaultFS:


That’s how we can tell what we should put in for Location.

As for RESOURCE_MANAGER_LOCATION, this is technically optional.  If you never want to perform predicate pushdown, then you don’t need to put it in.  With a Hadoop cluster, we almost definitely want to set this.

YARN (Yet Another Resource Negotiator) is our resource manager of choice, and finding its port is a similar process.  First, go to YARN, and then Config and Advanced:


Scroll down to Advanced yarn-site and somewhere in the middle, you’ll see the resource manager location configuration setting:


Thus, our resource manager location is—note that there’s no protocol (like hdfs:// or http:// or tcp://).  Also, in practice, your resource manager might be on a different host from your name node, so don’t just assume that they’re on the same machine.


Early on, I didn’t understand the importance of the resource manager.  That works fine for something like Azure Blob Storage, where we don’t have a resource manager.  But if we want our big powerful Hadoop cluster to take its share of the load (rather than just sending back the entire file to SQL Server and letting it do all the heavy lifting), we’re going to need to tell the Polybase code where to find YARN.


8 thoughts on “External Data Sources: Hadoop

    1. Hi, Jason. Polybase and HDInsight don’t play very well together:

      The recommended method has been to process data in HDInsight, write out results to Azure Blob Storage, and then use Polybase to ingest from Blob Storage (links on how to do this at That means, however, that you can’t use predicate pushdown to build a MapReduce job on HDInsight.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s