This is a review for Frank Kane’s Udemy course entitled Apache Spark 2 with Scala — Hands On with Big Data!
All in all, this was a fine course. Frank aims the course at people new to Spark. In my case, I have experience but wanted to start at the beginning to fill in the gaps in my self-education. To that extent, the course covers the basics of Spark, starting with installation and a primer on Scala. From there, we get to see Spark in action, starting with the “Spark 1.0” concept of Resilient Distributed Datasets and the functional approach to problem-solving, and then eventually covering “Spark 2.0” with DataFrames and Spark SQL. Frank also covers interesting features like broadcast variables and accumulators, things I didn’t have any experience when I started the course.
Frank uses a couple of interesting examples in his lessons: a network graph of Marvel character interactions in comics and the MovieLens data set. These data sets are large enough that processing is not trivial but not so large that you’d need a 20-node cluster to complete the task today. The examples were also fun, which made it easier to follow along.
The one area where I think this course wasn’t that great was in the “what’s next” sections, starting with MLLib and including Spark Streaming and GraphX. Frank has a Spark Streaming course which is on my to-watch list, so I can give that part a pass. With MLLib, it seemed that “This barely works” was the subtext, with weird results coming out of the tests. And GraphX just touched on the topic, though I get the feeling that GraphX will not be long for this world.
In short, if you want to learn the basics of Spark, this is a great introductory course. I’ve recommended it to several people looking to learn the product and I think Frank keeps the content at the right level for new people.