Quite some time ago, I posted about PolyBase and the Hortonworks Data Platform 2.5 (and later) sandbox.
The summary of the problem is that data nodes in HDP 2.5 and later are on a Docker private network. For most cases, this works fine, but PolyBase expects publicly accessible data nodes by default—one of its performance enhancements with Hadoop was to have PolyBase scale-out group members interact directly with the Hadoop data nodes rather than having everything go through the NameNode and PolyBase control node.
Thanks to a comment by Christopher Conrad in that post, I learned how to solve this problem. I’ll focus on versions of HDP after 2.6.5. Once the new Cloudera gets its sandbox out, I’ll eventually get to checking that out. In the meantime, you can still grab the sandbox edition of the Hortonworks Data Platform distribution of Hadoop.
Update SQL Server Configuration
The first thing we need to do is change SQL Server’s hdfs-site.xml file. You can find it in
%PROGRAMFILES%\MSSQL[##].MSSQLSERVER\MSSQL\Binn\Polybase\Hadoop\conf on a default installation of SQL Server, where [##] is the version number and
MSSQLSERVER is your instance name.
Inside hdfs-site.xml, add the following property:
<property> <name>dfs.client.use.datanode.hostname</name> <value>true</value> </property>
Now PolyBase will use hostnames rather than IP addresses. This will avoid the problem where it tries to connect to a 172.17.0.* IP address and fails because that subnet is not routable.
Update Hadoop Configuration
On the HDP sandbox, we need to open up some ports. To do so, ssh into your root node (by default, the username is
root and the password
hadoop) and run the following commands to modify the proxy deployment script:
cd /sandbox/deploy-scripts/ cp /sandbox-flavor . vi assets/generate-proxy-deploy-script.sh
You don’t actually have to use vi here, though that’s my editor of choice. Scroll down to the bottom of the
tcpPortsHDP list and you’ll want to add entries for three ports:
10020. Save this file and then run the following shell commands to generate a script and replace your proxy deployment file with the newly-generated version:
./generate-proxy-deploy-script.sh cd /sandbox mv proxy/proxy-deploy.sh proxy/proxy-deploy.sh.old cp deploy-scripts/sandbox/proxy/proxy-deploy.sh proxy/
Now that we have a script in place, we need to stop all of the data nodes and restart the cluster. First, run
./sandbox-stop.sh to stop the sandbox. Then, run
docker ps to see if there are any data nodes still running. If so, go ahead and kill them with
docker kill (node ID). Once everything is dead as a doornail, run
./proxy/proxy-deploy.sh to build a new image with all of the ports we need open. After it’s done, run
docker ps and look for an entry which looks something like
0.0.0.0:50010->50010/tcp. If you see that, you’ve completed the mission successfully. Restart Linux on the sandbox and after everything boots up, you should be able to use your Hortonworks sandbox with PolyBase just like any other HDP cluster.