If you’re interested in learning more about Apache Spark, there are a couple of options available to you. About a month ago, I covered installing Spark on a Windows machine. You can also use the Hortonworks Data Platform and run Spark from it. Today’s option, though, does not require a powerful machine; instead, it uses a web application to control a Spark cluster.
Databricks, the primary company behind Spark, has introduced Databricks Community Edition. This post will show you how to create a cluster, load a table from CSV, and query that table using a Python-based notebook.
Create A Cluster
When you select the “New User” option, sign up for the Community Edition and you can spin up free Spark clusters with 6 GB of RAM and the ability to run notebooks against the cluster. Once you have your account set up, go to the main page and click the Clusters button.
Then, click the Create Cluster button:
You’ll need to name your cluster and select a version of Spark. By default, I’d go with 1.6 (the current latest version in production), but you could try the 2.0 preview if you’re feeling cheeky.
There are a couple of notes with these clusters:
- These are not powerful clusters. Don’t expect to crunch huge data sets with them. Notice that the cluster has only 6 GB of RAM, so you can expect to get maybe a few GB of data max.
- The cluster will automatically terminate after one hour without activity. The paid version does not have this limitation.
- You interact with the cluster using notebooks rather than opening a command prompt. In practice, this makes interacting with the cluster a little more difficult, as a good command prompt can provide features such as auto-complete.
Once you click the button to create this cluster, it starts spinning up:
If you’re used to spinning up an ElasticMapReduce or HDInsight cluster, this will be amazingly fast; it took less time for this cluster to become available than it took for me to take the screen shot. One the cluster is in place, the management screen looks more like:
Create A Table
So now let’s load some data. I’m going to go to the Tables section and pick a file:
I’m going to import a set of restaurant data for the Wake County, North Carolina area.
I then clicked the Preview Table button, which let me fix the headers and prep this table. I also had to give the table a name, so I chose Restaurants.
After clicking Create Table, the table gets created and looks a bit like this:
With that, we have a populated table, and our cluster will be able to access it. You might also want to change score, lat, and long to be doubles rather than floats to prevent floating point annoyances later, but they’ll work fine for this purpose.
You might be wondering, why aren’t we creating the table inside a cluster? The reason is fundamental to the Hadoop and Spark paradigm: compute and storage are separate entities. Clusters handle computation: they accept data and perform actions on that data. On the other side, storage should be able to live beyond a cluster’s lifetime. Even if there is no cluster running, we should still be able to access our data.
We could certainly write Spark code to read a file in locally and work with it, but after the cluster shuts down, we might lose that data, so this option is really only helpful for dealing with ephemeral data.
Tying Everything Together
Notebooks allow us to link together compute (clusters) and storage (tables). To create a notebook, hit the Workspace button. Then, click on your user account and you’ll see a list of objects. Right-click on the empty space and go to Create –> Notebook.
Once you do that, you’ll get the option of selecting one of four notebook styles:
Python is probably my least-favorite of the four options, and for that reason I’m going to pick it here—it’s a smart idea to be able to work with any of these languages, as you might end up in an environment in which everybody uses a different language than you’d prefer. I’ve called my notebook test-library.
Run Some Code
Now that we’ve named and created a notebook, it’s time to start running some code. The code sample I’m going to use is trivial:
df = sqlContext.sql("SELECT * FROM Restaurants LIMIT 5") df.take(5)
Paste that into the code block and press Shift + Enter to run the code snippet. You’ll quickly get a result set that looks like:
And with that, congratulations! You’ve created a Spark cluster, uploaded a file, created a table, accessed that table using SQL, and performed an operation on it using Python. There’s a lot more to Spark, but this is a good start.
As I mentioned, the Community edition automatically terminates clusters after one hour of inactivity. If you want to stop the cluster yourself, go back to the Clusters section and click the X next to your active cluster.
This will allow you to shut down the cluster. Termination seems to take longer than cluster creation, but once it’s done, you can see the cluster show up in the Terminated Clusters section below the Active Clusters list.
I’m not sure how long terminated clusters show up in that list, but I do know that they go away within a few hours.
But what doesn’t go away are our tables and notebooks. Our restaurant data is safe and sound, and our notebook which operates against that restaurant data will live on as well. This means that if we want to do something else with restaurant data, it’s as easy as re-creating a cluster and picking back up where we left off. We will need to re-load the data in the cluster, but you can leave that as the top step in a notebook.
The Databricks Community Edition product is a great addition to the Spark ecosystem. With it, I can learn the basics of Spark from any machine, no matter how underpowered it is. It also helps separate development from installation and administration, meaning that I can get started on learning the mechanics of Spark development without needing to set up and configure a cluster first. Not bad for a free product.
EDIT (2016-07-18): I have changed the Restaurants data file to a version without NA values for scores, to make it easier to analyze in Spark. Spark really does not like NA scores when you declare score a float or double.