UPDATE, 2019-09-04: Thanks to Christopher Conrad, whose comment below helped me figure out how to get this working. I have instructions for the HDP 2.6.5 and later sandbox available on a new blog post.
I didn’t want the next post in my Polybase series to be such a downer, but here goes:
tl;dr version: It does not look like the Hortonworks Data Platform 2.5 sandbox works with Polybase at this time.
Time For An Update
The Hortonworks Data Platform 2.5 is now generally available. It brings along with it niceties such as Spark 2.0, updated versions of Kafka & Storm, and quite a few little tweaks. Because it’s shiny and new, I want to use it.
My goal was to set up everything so that I can give my Integrating Hadoop and SQL Server presentation on HDP 2.5. Things started out positively: the HDFS interface has changed a little bit, but nothing big:
I could also get Sqoop, .NET file access, and Hive to work just fine. But when I tried to run a Polybase query, the query would run for just over 90 seconds and then fail with the following error message:
Msg 8680, Level 17, State 26, Line 20
Internal Query Processor Error: The query processor encountered an unexpected error during the processing of a remote query phase.
Where To Go When You Know It’s Broken
That error message isn’t incredibly helpful, but fortunately, there’s a bit more information available to us.
There are a few Polybase DMVs, but the two that are relevant here are sys.dm_exec_external_work and sys.dm_exec_compute_node_errors.
Because this is a quiet system, I didn’t really need to find the execution ID and could go straight to the compute node errors DMV, but I included both queries here because troubleshooting in a production environment likely will require both.
SELECT w.execution_id, w.type, w.input_name, w.status FROM sys.dm_exec_external_work w; SELECT e.error_id, e.create_time, e.compute_node_id, e.details FROM sys.dm_exec_compute_node_errors e WHERE e.execution_id = 'QID1136' ORDER BY e.create_time ASC;
Now we’re onto something:
Could not obtain block: BP-61928784-172.17.0.2-1473764148444:blk_1073742592_1771 file=/tmp/ootp/secondbasemen.csv Microsoft.SqlServer.DataWarehouse.Common.ErrorHandling.MppSqlException:
Could not obtain block: BP-61928784-172.17.0.2-1473764148444:blk_1073742592_1771 file=/tmp/ootp/secondbasemen.csv
at Microsoft.SqlServer.DataWarehouse.DataMovement.Common.ExternalAccess.HdfsBridgeReadAccess.Read(MemoryBuffer buffer, Boolean& isDone)
at Microsoft.SqlServer.DataWarehouse.DataMovement.Workers.ExternalMoveReaderWorker.Execute(Object status)
So this looks pretty bad: we can’t obtain a data block from one of the datanodes. My first step would be to attempt to pull this some other way, and I checked three different ways to make sure that the file is available and that it isn’t corrupt:
- I accessed the file through the HDFS graphical interface in Ambari. I was able to preview the first several dozen rows and download the entire file without a problem.
- I accessed the file through the WebHDFS interface (port 50070) from a .NET application, pulling down the entire 777-row file with no problems.
- I ran a curl command to grab the file. If you want to try this at home, you can build your own curl command. Mine was:# Basic information curl -i http://sandbox.hortonworks.com:50070/webhdfs/v1/tmp/ootp/secondbasemen.csv?op=OPEN # Automatically redirect curl -L http://sandbox.hortonworks.com:50070/webhdfs/v1/tmp/ootp/secondbasemen.csv?op=OPEN
Note that all three of these methods go through the WebHDFS interface, so I didn’t need to do all three—I could have safely pulled the file from curl and gotten the same information. Anyhow, now that we know that the file absolutely is there and is valid, let’s see what kind of story we can corroborate on our Hadoop cluster.
Checking Hadoop Logs
When I checked the Hadoop logs, I tried a few places. If you want to play along at home and you have the Hortonworks sandbox, these logs are in a folder at http://sandbox.hortonworks.com:50070/logs/.
- The datanode logs (hadoop-hdfs-datanode-sandbox.hortonworks.com.log) had zero records for my Polybase query. This told me that the datanode never received a request to do anything worth logging.
- The name node logs (hadoop-hdfs-namenode-sandbox.hortonworks.com.log) also had nothing of value.
- Because I have Ranger turned on, my audit log (hdfs-audit.log) does have something useful.2016-09-28 03:05:15,675 INFO FSNamesystem.audit: allowed=true ugi=pdw_user (auth:SIMPLE) ip=192.168.58.1 cmd=getfileinfo src=/tmp/ootp/secondbasemen.csv dst=null perm=null proto=rpc 2016-09-28 03:05:16,088 INFO FSNamesystem.audit: allowed=true ugi=pdw_user (auth:SIMPLE) ip=192.168.58.1 cmd=open src=/tmp/ootp/secondbasemen csv dst=null perm=null proto=rpc 2016-09-28 03:05:39,004 INFO FSNamesystem.audit: allowed=true ugi=pdw_user (auth:SIMPLE) ip=192.168.58.1 cmd=open src=/tmp/ootp/secondbasemen csv dst=null perm=null proto=rpc 2016-09-28 03:06:03,502 INFO FSNamesystem.audit: allowed=true ugi=pdw_user (auth:SIMPLE) ip=192.168.58.1 cmd=open src=/tmp/ootp/secondbasemen csv dst=null perm=null proto=rpc 2016-09-28 03:06:38,392 INFO FSNamesystem.audit: allowed=true ugi=pdw_user (auth:SIMPLE) ip=192.168.58.1 cmd=open src=/tmp/ootp/secondbasemen csv dst=null perm=null proto=rpc
First, it’s interesting to note that the Polybase engine uses “pdw_user” as its user account. That’s not a blocker here because I have an open door policy on my Hadoop cluster: no security lockdown because it’s a sandbox with no important information. Second, my IP address on the main machine is 192.168.58.1 and the name node for my Hadoop sandbox is at 192.168.58.129. These logs show that my main machine runs a getfileinfo command against /tmp/ootp/secondbasemen.csv. Then, the Polybase engine asks permission to open /tmp/ootp/secondbasemen.csv and is granted permission. Then…nothing. It waits for 20-30 seconds and tries again. After four failures, it gives up. This is why it’s taking about 90 seconds to return an error message: it tries four times.
Aside from this audit log, there was nothing interesting on the Hadoop side. The YARN logs had nothing in them, indicating that whatever request happened never made it that far.
When Logs Just Won’t Do
Logs have failed us, but we still have one tool available to us: network packet captures. Fortunately, I have two machines with different versions of the Hadoop sandbox installed. First, here’s my working machine. In this case, the 137 IP address is my SQL Server VM, and my 149 IP address is my Hadoop sandbox/namenode/datanode combo.
The packet capture shows first that Polybase connects on port 8020. We saw this earlier in the logs above, where the Polybase engine asks for file info (using the getBlockLocations method) and where the datanode lives. Then, we get to the important part: Polybase connects to the datanode on port 50010, and pulls the second basemen file from there.
By contrast, with my HDP 2.5 setup, Polybase tries to connect to the datanode, whose IP is 172.17.0.2 (which is the internal Docker container address). Every two seconds for the duration of my check, I get back the following message: “Standard query response [hex ID] No such name PTR 220.127.116.11.in-addr.arpa.” So it’s trying to connect directly to the datanode but can’t because the node is now within a Docker container. Even though I’ve exposed port 50010 as a passthru port to Docker, the Polybase engine doesn’t care because the response that came back from the TCP session on port 8020 says to connect to 172.17.0.2 as the datanode, not my name node’s external IP. Note that I don’t have a good packet capture image.
Conclusions (For Now)
Like I mentioned at the top, this is a downer of a conclusion. I really want to be able to use Polybase + HDP 2.5 + Docker, but Polybase requires direct access to the datanodes. This doesn’t much matter with a classic Hadoop setup where all of the nodes are physical and it’s just a matter of setting up iptables correctly. But in the New Dockerized World, it makes implementing Polybase a lot trickier. I’m still searching for a solution. I have a question out on the Hortonworks forums and am trying to bang on some Microsoft doors, so hopefully there will be some traction there, but in the meantime, it looks like I’ll have two HDP VMs on my machine.
9 thoughts on “Polybase And HDP 2.5”
Hi Kevin, Hopefully the Docker+HDP 2.5 issue will be resolved. Wasn’t sure if you got any good feedback on this yet. I guess it’s HDInsight for me at the moment with the SQL-Server Hadoop integeration (it’s going to be expensive). Thanks again for all your hard work on this.
The short answer is that Microsoft is aware of the issue. I won’t speak for them, but I expect it’ll not be an easy issue to solve. The problem is that, for performance reasons, Polybase was designed to communicate directly with Hadoop data nodes. When the data nodes are put into a Docker subnet, outside processes won’t be able to access the data nodes, and so the Polybase query times out.
I’ll have to test HDP 2.5 via Ambari installation without using Docker. If that’s possible (in other words, if I can install HDP without the data node details being hidden behind Docker), then 2.5 and Polybase would work. The sandbox still wouldn’t, but at least the production version could.
I’m going to try to write a post sketching out alternatives once I have some solid answers. I’m hopeful that the Polybase team will come up with a solution in a reasonable time frame, but life goes on in the meantime.
I found a solution that works for me. After opening up port 50010, add the following to your hdfs-site.xml (on the SQL server side)