Last time around, we installed Ubuntu and Docker on our Hadoop cluster-to-be.  Now we strike and install Hadoop.


My project of choice for installing a Hadoop cluster using Docker is Weiqing Yang’s caochong.  It’s pretty easy to install, so let’s get started.  I’m going to assume that you have a user account and have pulled caochong into a folder called caochong.

Making Changes

After grabbing the code, I’m going to go into caochong/from-ambari and edit the file.  The reason is that by default, the installation does not forward any ports to the outside world.  That’s fine if you’re testing Hadoop on your own machine (which is, admittedly, the point of this Docker project), but in our case, we’re doing some basic remote development, so I want to expose a series of ports and set privileged=true.


In case you want to follow along at home or my default vim coloring scheme is hard to read, here’s the section I changed:

# launch containers
master_id=$(docker run -d --net caochong -p $PORT:8080 -p 6080:6080 -p 9090:9090 -p 9000:9000 -p 2181:2181 -p 8000:8000 -p 8020:8020 -p 42111:42111 -p 10500:10500 -p 16030:16030 -p 8042:8042 -p 8040:8040 -p 2100:2100 -p 4200:4200 -p 4040:4040 -p 8050:8050 -p 9996:9996 -p 9995:9995 -p 8088:8088 -p 8886:8886 -p 8889:8889 -p 8443:8443 -p 8744:8744 -p 8888:8888 -p 8188:8188 -p 8983:8983 -p 1000:1000 -p 1100:1100 -p 11000:11000 -p 10001:10001 -p 15000:15000 -p 10000:10000 -p 8993:8993 -p 1988:1988 -p 5007:5007 -p 50070:50070 -p 19888:19888 -p 16010:16010 -p 50111:50111 -p 50075:50075 -p 18080:18080 -p 60000:60000 -p 8090:8090 -p 8091:8091 -p 8005:8005 -p 8086:8086 -p 8082:8082 -p 60080:60080 -p 8765:8765 -p 5011:5011 -p 6001:6001 -p 6003:6003 -p 6008:6008 -p 1220:1220 -p 21000:21000 -p 6188:6188 -p 2222:22 -p 50010:50010 -p 6667:6667 -p 3000:3000 --privileged=true --name $NODE_NAME_PREFIX-0 caochong-ambari)
echo ${master_id:0:12} > hosts
for i in $(seq $((N-1)));
    container_id=$(docker run -d --net caochong --privileged=true --name $NODE_NAME_PREFIX-$i caochong-ambari)
    echo ${container_id:0:12} >> hosts

There’s a bit of downside risk to doing this:  I am forwarding all of these ports from my machine onto the name node’s Docker instance.  This means I have to install all of the Hadoop services on the name node, rather than splitting it over the various nodes (which is generally a smarter idea with a real cluster).

Anyhow, once that’s done, run ./ --nodes=5 --port=8080 and you’ll get a lot of messages, one of which includes something like the following:


The box shows the five nodes that I’ve created.  Specifically, it gives us the Docker container names.  But which of those is the name node?

With caochong, you can tell which is the name node because it will have the name caochong-ambari-0 by default.  You can get that by running docker ps while the images are running.


Once we know the primary node, we’re good to go.  We’ll need to copy all of those hostnames that get created and keep note of which one’s the name node when we install Hadoop via Ambari.  If you’ve forgotten those names and have closed the install window, don’t fret:  you can get those host names in a file named caochong/from-ambari/hosts.

Installing Hadoop Via Ambari

Once this is set up, we can connect to Ambari to finish installation.  We have port forwarding set up, so from a laptop or other device, you can connect via web browser to port 8080 on your NUC device’s IP address and you’ll get a screen which looks like this:


We don’t have a cluster yet, so we’ll need to click the “Launch Install Wizard” button.  This will prompt us for a cluster name:


The next step is to figure out which version of the Hortonworks Data Platform we want to install:


The underlying Linux installation is Ubuntu 14, so we’ll select that box.  Note that I tried to trick the installer into installing HDP 2.5 by putting in the public repo information for 2.5, but it ended up still installing 2.4.  There might be some trick that I missed that gets it to work, though.

After selecting the repo, you get to list the nodes.  This is the set that you’ll copy and paste from the hosts list:


You can safely ignore any warnings you get about not using fully-qualified domain names; within the Docker virtual network that caochong sets up, these are all accessible names.  After entering the hosts, you’ll want to copy and paste the SSH private key which gets generated.  That’s in caochong/from-ambari/id_rsa.  Copy the contents of that file into the SSH private key box and you can register and confirm the nodes.


This can take a couple minutes but eventually all of the bars should go green and you’ll be able to click the Next button to go to the next page, where you get to select the services you want to install on this cluster.


I selected pretty much all of the services, although if you’re testing production clusters, you might want to match whatever’s in your production cluster.  Going to the next step, you get the chance to set master nodes for the various services.


Notice how 3bd shows up for pretty much all of these services.  This is not what you’d want to do in a real production environment, but because we want to use Docker and easily pass ports through, it’s the simplest way for me to set this up.  If you knew beforehand which node would host which service, you could modify the batch script that we discussed earlier and open those specific ports.

After assigning masters, we next have to define which nodes are clients in which clusters.


We want each node to be a data node, and from there, I spread out the load across the five.  That way, if one node goes down (e.g., if I’m testing a secondary node failure), there’s at least one other node which can take up the slack.

Once we’ve assigned nodes, it’s time to modify any configurations.  The installer is pretty helpful about what you need to modify, specifically passwords.


Each one of the numbered notes above is a password or secret key which needs set.  Fill those out and you can go to the next step, which is reviewing all of your changes.


Assuming you don’t want to make any changes, hit the Deploy button and you’ll get to watch an installer.


This installer will take quite some time.  I didn’t clock installations, but I wouldn’t be shocked if it took 45 minutes or so for everything to install.  But once you’re finished, you officially have a Hadoop cluster of your own.


In this third part of the Hadoop on the Go miniseries, we created a five-node cluster.  There are a couple more blog posts around administration that I want to get to, particularly around rebooting the cluster and quickly rebuilding the cluster (something I think will come to me as I become more familiar with Docker).


3 thoughts on “Let’s Build A Hadoop Cluster, Part 3

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s