PolyBase Revealed: CDH Quickstart And PolyBase

All of my experience with Hadoop has been with the Hortonworks Data Platform. But to make sure that the stuff I’m advising in PolyBase Revealed works for Cloudera too, I grabbed the Cloudera QuickStart VM and took it for a spin. The two products are pretty similar overall but there are a few things I ended up figuring out along the way.

YARN Moved Out From Here, Buddy

My experience with Hortonworks has me think of port 8050 whenever someone asks for the YARN management port. But if you go knocking on port 8050, you get nothing back. That’s because YARN is running on port 8032 by default.

If you want to check your installation to see where your YARN resource manager address is, there’s an easy way to do it. Navigate to etc\hadoop\conf and run the following code: grep -n1 yarn.resourcemanager.address yarn-site.xml. That will return back three lines, and on the third line is your port.

Getting the YARN port in nineteen easy steps.

Host Bodies

Cloudera’s Quickstart VM tries to make things easy for you. One area where it does that is to auto-generate /etc/hosts. If you read the host file, you can see a comment at the top:

And that’s when the police said that the ping was coming from inside the house!

This set up is well and good if you’re just playing with Cloudera locally, but if you want to try accessing CDH remotely—especially when working with PolyBase—we need to make a couple of changes. First, we need to set a valid IP address for quickstart.cloudera and quickstart, not just 127.0.0.1. If you need help figuring out what IP address to use, check out the contents of ifconfig, which will tell you what IP addresses you have registered.

But we aren’t done yet. Like the comment states, Cloudera regenerates /etc/hosts each time you reboot the machine or restart the QuickStart services. To avoid this, we will need to open up /etc/init.d/cloudera-quickstart-init and comment out the line which calls cloudera-quickstart-ip. Then, you manage /etc/hosts yourself. You could also modify the quickstart IP script if you’d still like Cloudera to do the work of updating for you, but because I tend to set static IP addresses for VMs, I’m okay managing it myself.

My Data Node Has a First Name, It’s O-S-C-A-R

Here’s something which tripped me up a little bit while connecting to Cloudera using SQL Server. The data node name, instead of being quickstart.cloudera like the host name, is actually localhost. You can change this in /etc/cloudera-scm-agent/config.ini.

Because PolyBase needs to have direct access to the data nodes, having a node called localhost is a bit of a drag.

Conclusion

Today, we looked at a few non-obvious ways in which the Cloudera Distribution of Hadoop QuickStart VM differs from a Hortonworks Data Platform installation. There are other differences as well, but these were a few of the most apparent ones.

One thought on “PolyBase Revealed: CDH Quickstart And PolyBase

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s