All of my experience with Hadoop has been with the Hortonworks Data Platform. But to make sure that the stuff I’m advising in PolyBase Revealed works for Cloudera too, I grabbed the Cloudera QuickStart VM and took it for a spin. The two products are pretty similar overall but there are a few things I ended up figuring out along the way.
YARN Moved Out From Here, Buddy
My experience with Hortonworks has me think of port 8050 whenever someone asks for the YARN management port. But if you go knocking on port 8050, you get nothing back. That’s because YARN is running on port 8032 by default.
If you want to check your installation to see where your YARN resource manager address is, there’s an easy way to do it. Navigate to
etc\hadoop\conf and run the following code:
grep -n1 yarn.resourcemanager.address yarn-site.xml. That will return back three lines, and on the third line is your port.
Cloudera’s Quickstart VM tries to make things easy for you. One area where it does that is to auto-generate
/etc/hosts. If you read the host file, you can see a comment at the top:
This set up is well and good if you’re just playing with Cloudera locally, but if you want to try accessing CDH remotely—especially when working with PolyBase—we need to make a couple of changes. First, we need to set a valid IP address for quickstart.cloudera and quickstart, not just 127.0.0.1. If you need help figuring out what IP address to use, check out the contents of
ifconfig, which will tell you what IP addresses you have registered.
But we aren’t done yet. Like the comment states, Cloudera regenerates
/etc/hosts each time you reboot the machine or restart the QuickStart services. To avoid this, we will need to open up
/etc/init.d/cloudera-quickstart-init and comment out the line which calls
cloudera-quickstart-ip. Then, you manage
/etc/hosts yourself. You could also modify the quickstart IP script if you’d still like Cloudera to do the work of updating for you, but because I tend to set static IP addresses for VMs, I’m okay managing it myself.
My Data Node Has a First Name, It’s O-S-C-A-R
Here’s something which tripped me up a little bit while connecting to Cloudera using SQL Server. The data node name, instead of being
quickstart.cloudera like the host name, is actually
localhost. You can change this in
Because PolyBase needs to have direct access to the data nodes, having a node called localhost is a bit of a drag.
Today, we looked at a few non-obvious ways in which the Cloudera Distribution of Hadoop QuickStart VM differs from a Hortonworks Data Platform installation. There are other differences as well, but these were a few of the most apparent ones.