In part 1 of this series, we bought some hardware. After patiently(?) waiting for it, we have the hardware and installed Ubuntu, so let’s keep going.
Docker? I hardly even know her!
Hadoop on Docker is a relatively new thing. Thanks to Randy Gelhausen’s work, the Hortonworks Data Platform 2.5 sandbox now uses Docker. That has negative consequences for those of us who want to use Polybase with the sandbox, but for this series, I’m going to forego the benefits of Polybase to get a functional 5-node cluster on the go.
Before I go on, a couple notes about Docker and Hadoop. First, why would I want to do this? Let’s think about the various ways we can install Hadoop:
- As a one-node, standalone server. This is probably the worst scenario because it’s the least realistic. The benefit of Hadoop is its linear scalability, so you’re never going to run a production system with a single, standalone node.
- As a one-node cluster. This is what the various sandboxes do, and they’re useful for getting your feet wet. Because they’re set up to be multi-node clusters, you get to think about things like service placement and can write code the way you would in production. Sandboxes are fine, but they’re not a great way of showing off Hadoop.
- As multiple VMs using VirtualBox or VMware. This works, but virtual machines burn through resources. Even with 32 GB of RAM, putting three or four VMs together will burn 8-12 GB of RAM just for the operating systems and 5x the disk.
- As multiple containerized nodes. In other words, put together several nodes in Docker. Each node has a much lower resource overhead than a virtual machine, so I can easily fit 5 nodes. I also have one copy of the Ubuntu image and small marginal disk additions for the five nodes.
With these in mind, and because I want to show off Hadoop as Hadoop and not a glorified single-node system, I’m going to use Docker.
After setting up Ubuntu, the first step is to install Docker. It’s really easy to do on Ubuntu, and probably takes 10 minutes to do. If you want to learn more about Docker, Pluralsight has a Container Management path which is extremely helpful.
Setting Up A Cluster
There are a few guides that you can follow for setting up a multi-node Hadoop cluster with Docker. Here are a few I didn’t use, but which you might find helpful:
- Renjith Rajan sets up a five-node cluster in AWS.
- Henning Kropp shows how to build a Docker file and run off of a CentOS image.
- Kiwen Lau builds a Hadoop cluster (NOT HDP) with Docker.
I used Kiwen Lau’s blog post first to understand how to do it, but I wanted to put together a Hortonworks Data Platform installation instead. In Friday’s post, I’ll show you the project I used to set up a HDP cluster in Docker.