Not too long ago, I decided to try connecting to HDInsight using SSIS. One of the Integration Services updates for SQL Server 2016 is Hadoop support. This Hadoop support comes via the Hadoop Connection Manager, which allows for two connection methods: WebHCat and WebHDFS. My preference is WebHDFS, but it appears that we cannot use WebHDFS at all with HDInsight. So that leaves WebHCat.
What Is WebHCat?
WebHCat is a web-based REST API for HCatalog, a management layer for dealing with files in HDFS. If you’re looking for configuration settings for WebHCat, you’ll want generally to look for “templeton” in config files, as Templeton was the project name before WebHCat. In Ambari, you can go to the Hive configs and look at webhcat-site.xml for configuration settings. For WebHCat, the default port in HDInsight is 30111, which you should find in the templeton.port configuration setting.
Connect To Port 30111
Now that we know this, it’s as easy as putting your cluster name into the WebHCat host, putting in the Hadoop user account, and hitting the Test Connection button:
And, if you’re following along at home, prepare to wait about 2 minutes and have it fail.
It turns out that HDInsight clusters are accessible through port 443 via URL rewrites, which is pretty cool: you always go through SSL connections, and based on the URL you go to, you’ll get forwarded on to the correct service. In practice, this makes it fairly easy to connect to WebHCat via curl.
Given that we need to connect on port 443 via an HTTPS connect (thus SSL), it seems that we should set the WebHCat port to 443 and turn on HTTPS and we’ll be good:
This also failed to connect when I hit Test Connection, giving me a weird System.AggregateException error at that (though I don’t have a screenshot of it). So what do we do now?
Capture Those Packets!
Wireshark is your friend, assuming that you’re allowed to install it in your environment. Make sure of it first; people have been fired for less…
So let’s fire up a packet capture and see what exactly is happening. I’ve removed any IP addresses or other identifiers, as it’s going out to a public IP, but we can easily see what’s going on here.
The red boxes show me trying to connect directly to port 30111 and we can see why we get errors: we’re sending SYN packets (to try to open a connection) but not receiving an ACK (acknowledgement) packet from the server. After a while, we have a connection timeout. Our HDInsight cluster is not listening to public requests on port 30111.
The blue box shows what’s happening when I use port 443 instead of 30111. First, we have SYN and ACK packets, which establish the connection. Next, SSIS tries to connect to the server on port 443 and…wait, it sends an HTTP request instead of HTTPS! Notice that we try to re-transmit a few times, assuming that the server simply lost track of time or got busy or left the phone on vibrate or something, when in reality, port 443 is studiously ignoring our packet because it’s not over SSL. Eventually our client gets the hint and sends a FIN packet to tell our the server that we’re just going to to shaddup about that whole HTTP request thing.
How Should It Look?
By contrast, I’ve taken a screenshot of what a successful SSL connection looks like when I use my browser to connect to the cluster:
We start with SYN and ACK, followed by an SSL Hello, which the server acknowledges. Then we have a key exchange and the server is happy, so it responds to our request and sends the data we want. That’s how it should look.
Fixes And Workarounds
Right now, there is no fix for Test Connection. At PASS Summit, I had a chance to discuss this issue with members of the SSIS team and it sounds like they are aware of the bug and are working on a fix.
In the meantime, it actually can work; if you hit OK and create the connection manager, the SSIS processes will actually connect via SSL.