Today’s post will look in some detail at the first of several external data sources.  First, let’s check out MSDN. There are a half-dozen options, but for today, we’re focusing on a Hadoop cluster:

-- PolyBase only:  Hadoop cluster as data source   
-- (on SQL Server 2016)  
CREATE EXTERNAL DATA SOURCE data_source_name  
    WITH (   
        TYPE = HADOOP,
        LOCATION = 'hdfs://NameNode_URI[:port]'  
        [, RESOURCE_MANAGER_LOCATION = 'ResourceManager_URI[:port]' ]  
        [, CREDENTIAL = credential_name ]
    )
[;]  

Using This Knowledge

We’ve got Polybase configured, and we know which Hadoop source we’re using (Hortonworks HDP 2.4 in my case), so let’s define an external data source.  Here’s one which I’ll call HDP2:

CREATE EXTERNAL DATA SOURCE [HDP2] WITH
(
	TYPE = HADOOP,
	LOCATION = N'hdfs://sandbox.hortonworks.com:8020',
	RESOURCE_MANAGER_LOCATION = N'sandbox.hortonworks.com:8050'
)

There are a couple of things I want to point out here.  First, the Type is HADOOP, one of the three types currently available:  HADOOP (for Hadoop, Azure SQL Data Warehouse, and Azure Blob Storage), SHARD_MAP_MANAGER (for sharded Azure SQL Database Elastic Database queries), and RDBMS (for cross-database Elastic Database queries on Azure SQL Database).

Second, the Location is my name node on port 8020.  If you’re curious about how we figure that one out, go to Ambari (which, for me, is http://sandbox.hortonworks.com:8080) and go to HDFS and then Configs.  In the Advanced tab, you can see the name node:

namenode

Scroll down a bit and you’ll see the “Advanced core-site” section.  Click the drop-down for that and we’ll see fs.defaultFS:

hdfslocation

That’s how we can tell what we should put in for Location.

As for RESOURCE_MANAGER_LOCATION, this is technically optional.  If you never want to perform predicate pushdown, then you don’t need to put it in.  With a Hadoop cluster, we almost definitely want to set this.

YARN (Yet Another Resource Negotiator) is our resource manager of choice, and finding its port is a similar process.  First, go to YARN, and then Config and Advanced:

yarnconfigs

Scroll down to Advanced yarn-site and somewhere in the middle, you’ll see the resource manager location configuration setting:

resourcemanager

Thus, our resource manager location is sandbox.hortonworks.com:8050—note that there’s no protocol (like hdfs:// or http:// or tcp://).  Also, in practice, your resource manager might be on a different host from your name node, so don’t just assume that they’re on the same machine.

Conclusion

Early on, I didn’t understand the importance of the resource manager.  That works fine for something like Azure Blob Storage, where we don’t have a resource manager.  But if we want our big powerful Hadoop cluster to take its share of the load (rather than just sending back the entire file to SQL Server and letting it do all the heavy lifting), we’re going to need to tell the Polybase code where to find YARN.

Advertisement

8 thoughts on “External Data Sources: Hadoop

    1. Hi, Jason. Polybase and HDInsight don’t play very well together: https://36chambers.wordpress.com/2017/10/10/t-sql-tuesday-95-polybase-and-hdinsight/

      The recommended method has been to process data in HDInsight, write out results to Azure Blob Storage, and then use Polybase to ingest from Blob Storage (links on how to do this at https://36chambers.wordpress.com/polybase/). That means, however, that you can’t use predicate pushdown to build a MapReduce job on HDInsight.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s