We are going to combine three of my interests today: Spark, data analysis, and notebooks. Unlike last week’s discussion of notebooks, I’m going to use Apache Zeppelin today. In today’s post, we will load some restaurant data into Spark and then build a notebook to visualize some of that data.
If you have not done so, follow through my Getting Started with Spark guide. This will get restaurant data loaded into Spark. I’m going to assume that you’ve created a cluster, loaded the restaurant table (but make sure you have the version with no NAs!), and are ready to create a notebook.
Create A Notebook
Our first step will be to create an R notebook.
My goal is to do some of the things that I did in my Touching on Advanced Topics post. Originally, I wanted to replicate that analysis in its entirety using Zeppelin, but this proved to be pretty difficult, for reasons that I mention below. As a result, I was only able to do some—but not all—of the anticipated work. I think a more seasoned R / SparkR practitioner could do what I wanted, but that’s not me, at least not today.
With that in mind, let’s start messing around.
I think it might be more useful to start with the various roadblocks I hit while working on this blog post. SparkR is a different paradigm, and as someone with a tenuous hold on R, it proved to be a lot harder than I wanted. Here are a few of the things I had trouble with, in the hopes that the solutions will save you a bit of time in the future.
If you install all of the packages from the Jupyter notebook, you’ll find that dplyr has a sql command which will override the Spark SQL command and leave you in a bad place if you don’t know what’s going on. If you get stuck in this bad place, you can specify the function call, and for Spark, the sql function is SparkR::sql.
One of the first struggles I had was that certain operations return data frames rather than vectors, for example, mean. I was able to use Wendy Yu’s great SparkR tutorial to figure out a way to get mean latitude and longitude from my data set.
Lack of Results
Another major problem I had was that Java is much stricter about data types than R, meaning that when I tried to perform some operation against my restaurants table (in which score is defined as a double), Java would fail with an exception saying that NA is not a valid numeric value. No matter how I wrote my SQL statement, I was unable to ignore that NA data. I even switched over to Python, thinking it a SparkR issue, but it turns out to be part of Spark. I’m sure there’s some flag somewhere that I’m not setting which allows me to ignore those bad values, but I haven’t found it yet.
Data Frames Aren’t DataFrames
Another issue I ran into was when I started using ggplot2. When I tried to hook up my Spark data frame (which is of type DataFrame), ggplot2 spat back the following error:
Error : ggplot2 doesn't know how to deal with data of class DataFrame
Fortunately, Jose Dianes has an article which discusses this exact problem. You have to collect the distributed DataFrame before ggplot2 can understand what it is.
Okay, we spent enough time complaining about my failings; now let’s actually show what we can do. Unlike the Advanced Topics notebook, I’m going to keep this fairly compact and not meander quite as much as I did (for pedagogical purposes, I promise).
Step one is to load the restaurant data:
restaurantsDF <- sql(sqlContext, "SELECT COUNT(1) AS numberOfResults, AVG(score) AS meanscore, ROUND(lat, 1) AS lat, ROUND(long, 1) AS long FROM restaurants GROUP BY ROUND(lat, 1), ROUND(long, 1)")
Note that in this version, I’m going to specify a narrowed-down query with the relevant attributes and filters I want. I’ve found this to be preferable to the classic R approach of loading all the data with read.csv and importing dplyr and other tools to filter the data after the fact, particularly because of differences in data frame operations, as we’ll see below.
Step two is to install relevant packages:
install.packages("ggplot2", repos = "http://cran.us.r-project.org") install.packages("ggmap", repos = "http://cran.us.r-project.org") library(ggplot2) library(ggmap) print('Finished loading libraries.')
I only need ggplot2 and ggmap because I did my transformations earlier.
Step three is where I calculate latitude and longitude. My latitude and longitude are technically off slightly because I’m taking the mean of pre-aggregated lat-long combos rather than the mean of the population, but the difference is so slight that it doesn’t make an appreciable difference.
r.lat <- collect(agg(restaurantsDF, mean=mean(restaurantsDF$lat)))$mean
r.long <- collect(agg(restaurantsDF, mean=mean(restaurantsDF$long)))$mean
With those calculated and collected, I’m able to pass them in as parameters to ggmap’s get_map function, so we can build a Google map.
#Set wakemap to work around the mean latitude and longitude values we calculated above. We've set the zoom value to 10 so you can see the entire region. wakemap <- get_map(location = c(lon = r.long, lat = r.lat), zoom = 10, maptype = "roadmap", scale = 2)
Once I have those, I need to convert my Spark DataFrame class object to a local R data frame. I do that by calling the collect method. Because I only need three of the four attributes, I’ll show off the select function as well.
restgrp <- collect(select(restaurantsDF, "meanscore", "lat", "long"))
This data frame is something ggplot2 can use. So let’s run ggmap and fill in results:
# Now we can fill in ggmap with settings. We want three function calls: # scale_fill_gradient (to give us a visual cue of restuarants. Pick good colors for low & high.) # geom_text (to display the meanscore, giving us precise values. Round meanscore to 1 spot after decimal) # geom_tile (to display blocks of color. Set alpha = 1) ggmap(wakemap) + scale_fill_gradient(low="black",high="orange") + geom_text(data = restgrp, aes(x=long, y=lat, fill = meanscore, label = round(meanscore, 1))) + geom_tile(data = restgrp, aes(x=long, y=lat, alpha=1, fill=meanscore))
The end result looks a bit like this:
If you want to get this as a notebook, I’ve made the R script available here. You can publish Databricks notebooks, but they’re only public for a period of six months and I don’t want to have a broken link in a few months’ time.
Why Use Zeppelin?
Throughout all of this, I didn’t even mention one of the big advantages to using Zeppelin: the built-in visualization components. Zeppelin has made it easier for third parties to display graphics, and so visualizing a data set is as simple as running the display command. For example, here’s a nice-looking histogram of mean restaurant scores by lat-long pair:
That’s a one-line operation, and you can change the plot options to perform other visualizations as well. This type of visualization is production-worthy, so I give the Databricks team plaudits for making this available to us in such an easy-to-use format.
To be honest, I wanted this to be a blog post in which I described how easy the conversion from Jupyter and local R to Zeppelin and SparkR would be. In reality, there’s a gap between the two. There are some good articles to help you bridge that gap—I mentioned Wendy Yu’s and Jose Dianes’s—but this certainly is not an exercise in copy-paste; there’s a lot more going on here. Nonetheless, I recommend getting a handle on SparkR, as the ability to perform data analysis on distributed compute is powerful, especially if you can’t get your boss to approve that desktop with 64 GB of RAM…