This is a review of Thomas Henson’s Pluralsight course entitled Getting Started with Hortonworks Data Platform.
This course felt a little light. Its primary focus is using Ambari to install elements of the Hortonworks Data Platform from scratch. Having done this a few times but only for local environments, it was useful to compare Thomas’s work against what I’ve done.
But this section leaves out a lot of pain that I’ve had to endure, in part due to admittedly odd circumstances: I have had to build everything offline so that I can take my Hadoop cluster on the go for presentations without needing Internet access; otherwise, when services start up, they try to phone home to update. There’s probably something that I’ve missed in my research which fixes this, but I haven’t found it yet.
I don’t fault Thomas for not showing local repository installation, but there are some things I would have preferred to see a bit more coverage on. I’ve run into plenty of weird errors and warnings which lead to installation failure, and we see almost none of that in the course. We only get the best-case scenario, but I’m not sure how common that scenario is. We also don’t get any detail on Kerberos or cluster security other than setting passwords. I know Kerberos is a topic in his follow-up course, so it might have been out of scope. Imagining myself as a sysadmin new to Hadoop, that’s something I’d be really interested in knowing more detail about.
On the plus side, there were some things I learned during installation, like how much of a resource hog Accumulo can be.
The second half of the course was around administration topics: rack awareness, rebalancing data, configuration, and alerting within Ambari and at the command line. These are good themes and I liked the coverage of rack awareness in particular. Configuration was a bit on the short side, though: we saw how to alter configurations, but nothing really on what to configure. There’s a huge surface area here, but again, if I’m a junior sysadmin or someone new to Hadoop, one of my first configuration questions is, “What do I need to tune out of the box and how do I understand the consequences of changing settings?” Even picking one or two common services like Hive, HDFS, or Spark and going through some of the important settings would help a lot, especially if the logic applies to other services too.
Overall, this course feels too short. I don’t want some 8-hour biopic that I’ll never finish, but at 2 hours, it gives me just enough to get stuck someplace. Thomas does a good job presenting and the material is clear, but a judicious 30 minutes of extra content would have made this that much better. Perhaps I’ll be more content after seeing his follow-up course.