This blog post is in anticipation of SQL Saturday Columbus, in which I’m going to give a talk introducing R to SQL Server developers. My primary vehicle for explaining R will be notebooks.
I’ll start with two big questions: what are notebooks and why should we use them?
Remember chemistry class in high school or college? You might remember having to keep a lab notebook for your experiments. The purpose of this notebook was two-fold: first, so you could remember what you did and why you did each step; second, so others could repeat what you did. A well-done lab notebook has all you need to replicate an experiment, and independent replication is a huge part of what makes hard sciences “hard.”
Take that concept and apply it to statistical analysis of data, and you get the type of notebook I’m talking about here. You start with a data set, perform cleansing activities, potentially prune elements (e.g., getting rid of rows with missing values), calculate descriptive statistics, and apply models to the data set.
Why Should We Use Them?
There are two separate reasons why we should use notebooks. The first ties back to the reason you used them in chem lab: the potential for independent verification. If I want to publish a study—even if it’s just an internal analysis of company metrics—I want another analyst to be able to follow what I did and ensure that I made correct choices.
Aside from independent validation, there is a second big reason to use notebooks: cross-language compatibility. If your organization only uses R to perform data analysis, a notebook might not be necessary; you can get by with well-documented and thorough R scripts. But as soon as somebody wants to write Python or Julia code, the classic technique starts to get tricky. Notebooks help you organize analyses and write code across a number of languages.
In addition to multi-lingual capabilities, notebooks make documentation easier. The major notebooks support Markdown, an easy way of formatting text. This allows you to create pretty and thorough documentation. I recently went through an EdX course on Spark, and their labs were amazing.
It’s hard to do justice to their labs in a single screenshot because of how thorough they were, but you can get the gist of what Markdown can do here: separate sections, headings, images, code snippets, and a large number of other features without having to write a mess of HTML.
Which Notebook To Use?
There are two major notebooks out there today: Zeppelin and Jupyter. I’ve used both and like both. Hoc Q. Phan has a good comparison of the two, and I’d say it’s fair: they’re both pretty close in terms of quality, and your particular use case will probably determine which you choose.
Working With Jupyter
Running Jupyter is easy: open up a command prompt and type in “jupyter notebook” after you’ve installed Jupyter. That dumps me in my home directory on Windows, so I’m going with that and storing my notebooks in Documents\Notebooks.
If you want to create a new notebook, it’s as easy as hitting the New button:
Running An Existing Notebook
In my case, I’m going to open the Basic R Test workbook, which you can download for yourself if you’d like. Clicking on a notebook brings it up in a new tab:
My notebooks are nowhere near as beautiful as the Spark course notebooks, but they do get the point across at least. If you want to modify existing Markdown, double-click on the section of text you’d like to edit:
This changes the pretty text into a Markdown editor block. When you’re done making changes, press Shift+Enter to run the block. Running Markdown blocks simply makes them display prettified HTML and moves the focus to the next block. The next block is a code block.
The code here is extremely simple: we’re going to create a vector with one element (an integer whose value is 500) and assign it to y. Then, we’re going to print the contents of y. To run the block, again, press Shift+Enter or you can hit the Play button instead.
The results of the code block appear on-screen. One of the nice things about notebooks is that you can include any R outputs as part of the output set. To show you what I mean, I’m going to select the Run All option from the Cell menu.
This will run all of the cells again, starting from the top. I could select “Run All Below” and skip re-assigning y, but in this case, it doesn’t hurt anything. After running everything, I can scroll down and see this section of results:
There are two separate command here. First, I’m running a head command to grab the first 3 elements from the set. Second, I’m running a plot command to plot two variables. In neither of these cases do I spend any time on formatting; the notebook does all of this for me. That’s a major time savings.
Inserting New Cells
Inserting new cells is easy. If you want to insert a new cell at the bottom of a notebook, start typing in the cell at the bottom of the page.
You can define what type of cell it is with the cell type menu:
In this case, I’ve selected Markdown, so this will be a commentary cell. I fill in details and when I’m done, hit Shift+Enter to run the cell. But let’s say that you want to insert something in between two already-existing cells. That’s easy too: in the Insert menu, you have two options: Insert Cell Above or Insert Cell Below.
Notebooks are powerful tools for data analytics. I wouldn’t use a notebook as my primary development environment, though—for R, I’d write the code in RStudio and once I have my process complete, then I’d transfer it to a notebook, add nice Markdown, and make it generally understandable.
The rest of this week’s R posts will use notebooks to walk through concepts, as I think that this is easier to understand than giving somebody a big script and embedding comments (or worse, not even including comments).