The summary of the problem is that data nodes in HDP 2.5 and later are on a Docker private network. For most cases, this works fine, but PolyBase expects publicly accessible data nodes by default—one of its performance enhancements with Hadoop was to have PolyBase scale-out group members interact directly with the Hadoop data nodes rather than having everything go through the NameNode and PolyBase control node.
The first thing we need to do is change SQL Server’s hdfs-site.xml file. You can find it in %PROGRAMFILES%\MSSQL[##].MSSQLSERVER\MSSQL\Binn\Polybase\Hadoop\conf on a default installation of SQL Server, where [##] is the version number and MSSQLSERVER is your instance name.
Now PolyBase will use hostnames rather than IP addresses. This will avoid the problem where it tries to connect to a 172.17.0.* IP address and fails because that subnet is not routable.
Update Hadoop Configuration
On the HDP sandbox, we need to open up some ports. To do so, ssh into your root node (by default, the username is root and the password hadoop) and run the following commands to modify the proxy deployment script:
cp /sandbox-flavor .
You don’t actually have to use vi here, though that’s my editor of choice. Scroll down to the bottom of the tcpPortsHDP list and you’ll want to add entries for three ports: 8050, 50010, and 10020. Save this file and then run the following shell commands to generate a script and replace your proxy deployment file with the newly-generated version:
mv proxy/proxy-deploy.sh proxy/proxy-deploy.sh.old
cp deploy-scripts/sandbox/proxy/proxy-deploy.sh proxy/
Now that we have a script in place, we need to stop all of the data nodes and restart the cluster. First, run ./sandbox-stop.sh to stop the sandbox. Then, run docker ps to see if there are any data nodes still running. If so, go ahead and kill them with docker kill (node ID). Once everything is dead as a doornail, run ./proxy/proxy-deploy.sh to build a new image with all of the ports we need open. After it’s done, run docker ps and look for an entry which looks something like 0.0.0.0:50010->50010/tcp. If you see that, you’ve completed the mission successfully. Restart Linux on the sandbox and after everything boots up, you should be able to use your Hortonworks sandbox with PolyBase just like any other HDP cluster.
I have a new batch of updates coming out soon, so I wanted to walk you through some of the process. First, the average rating over the past eight weeks. The course launched about 10 weeks ago, so I’m stripping out that early period of first-movers who love everything and give it a 5-star rating.
The first batch of updates hit on July 16th, when I had enough data to learn where some of the problems were and fix them. That bumped the score up a bit, where it has mostly leveled off in the 4.4-4.45 range. Since then, I’ve had a few minor changes but nothing as big as the June 16th release or the upcoming release. My hope is that this next set of changes bumps me into the 4.5-4.6 range.
To figure out what to change, I get a helpful, sortable table with information on how people are doing. Here it is sorted by percentage of people who gave up and asked for the solution:
When 80% of the people are unable to answer your question, there’s a problem… Drilling in, we get to see the error messages learners get when trying out this problem:
From there, I can view the actual submitted code and see what my solution looks like versus the submitted code. Here’s the top-ranked example (with a bonus that you get the solution. If you found this blog post and remember it when going through my course, I think that’s worth one free solution):
With this information, I can modify the course. In this particular exercise, I decided to do two things: first, I provided the number of seconds and milliseconds to each DATETIMEOFFSETFROMPARTS() call. That was a common mistake where people filled in 8.0 for seconds.milliseconds rather than 7.999. Second, in the hint, I provide the exact date in a format which just happens to match the order of inputs. My hope is that these two things are enough to raise the success rate. For this example above, people put in 2039 as the year instead of 2038. Reading the improved hint, they’d see that the year is definitely 2038.
I ended up making these changes on about 15-20 exercises and will keep making changes to make the course successful. If you are going through the course and have direct feedback, hit the (!) button at the top-right of the screen on any exercise and leave your thoughts.
If you want to try this course out and learn all about working with times, dates, and time-related aggregations in SQL Server, get to the course on DataCamp. The first chapter is free, though subsequent chapters come at a cost. Just money, though, not souls or firstborn children.