R: The Basics Of Notebooks

This blog post is in anticipation of SQL Saturday Columbus, in which I’m going to give a talk introducing R to SQL Server developers.  My primary vehicle for explaining R will be notebooks.

Notebook Basics

I’ll start with two big questions:  what are notebooks and why should we use them?

Notebooks Are…

Remember chemistry class in high school or college?  You might remember having to keep a lab notebook for your experiments.  The purpose of this notebook was two-fold:  first, so you could remember what you did and why you did each step; second, so others could repeat what you did.  A well-done lab notebook has all you need to replicate an experiment, and independent replication is a huge part of what makes hard sciences “hard.”

Take that concept and apply it to statistical analysis of data, and you get the type of notebook I’m talking about here.  You start with a data set, perform cleansing activities, potentially prune elements (e.g., getting rid of rows with missing values), calculate descriptive statistics, and apply models to the data set.

Why Should We Use Them?

There are two separate reasons why we should use notebooks.  The first ties back to the reason you used them in chem lab:  the potential for independent verification.  If I want to publish a study—even if it’s just an internal analysis of company metrics—I want another analyst to be able to follow what I did and ensure that I made correct choices.

Aside from independent validation, there is a second big reason to use notebooks:  cross-language compatibility.  If your organization only uses R to perform data analysis, a notebook might not be necessary; you can get by with well-documented and thorough R scripts.  But as soon as somebody wants to write Python or Julia code, the classic technique starts to get tricky.  Notebooks help you organize analyses and write code across a number of languages.

In addition to multi-lingual capabilities, notebooks make documentation easier.  The major notebooks support Markdown, an easy way of formatting text.  This allows you to create pretty and thorough documentation.  I recently went through an EdX course on Spark, and their labs were amazing.

MarkdownExample

It’s hard to do justice to their labs in a single screenshot because of how thorough they were, but you can get the gist of what Markdown can do here:  separate sections, headings, images, code snippets, and a large number of other features without having to write a mess of HTML.

Which Notebook To Use?

There are two major notebooks out there today:  Zeppelin and Jupyter.  I’ve used both and like both.  Hoc Q. Phan has a good comparison of the two, and I’d say it’s fair:  they’re both pretty close in terms of quality, and your particular use case will probably determine which you choose.

For the purpose of my talk, I’m going to use Jupyter, as that’s something my attendees can install pretty easily.  I also have blog posts on installing it for Windows and Linux.

Working With Jupyter

Running Jupyter is easy:  open up a command prompt and type in “jupyter notebook” after you’ve installed Jupyter.  That dumps me in my home directory on Windows, so I’m going with that and storing my notebooks in Documents\Notebooks.

Notebooks

If you want to create a new notebook, it’s as easy as hitting the New button:

NewNotebook

Running An Existing Notebook

In my case, I’m going to open the Basic R Test workbook, which you can download for yourself if you’d like.  Clicking on a notebook brings it up in a new tab:

BasicRTest

My notebooks are nowhere near as beautiful as the Spark course notebooks, but they do get the point across at least.  If you want to modify existing Markdown, double-click on the section of text you’d like to edit:

EditMarkdown

This changes the pretty text into a Markdown editor block.  When you’re done making changes, press Shift+Enter to run the block.  Running Markdown blocks simply makes them display prettified HTML and moves the focus to the next block.  The next block is a code block.

CodeBlock

The code here is extremely simple:  we’re going to create a vector with one element (an integer whose value is 500) and assign it to y.  Then, we’re going to print the contents of y.  To run the block, again, press Shift+Enter or you can hit the Play button instead.

RunCodeBlock

The results of the code block appear on-screen.  One of the nice things about notebooks is that you can include any R outputs as part of the output set.  To show you what I mean, I’m going to select the Run All option from the Cell menu.

RunAll

This will run all of the cells again, starting from the top.  I could select “Run All Below” and skip re-assigning y, but in this case, it doesn’t hurt anything.  After running everything, I can scroll down and see this section of results:

RunAllResults

There are two separate command here.  First, I’m running a head command to grab the first 3 elements from the set.  Second, I’m running a plot command to plot two variables.  In neither of these cases do I spend any time on formatting; the notebook does all of this for me.  That’s a major time savings.

Inserting New Cells

Inserting new cells is easy.  If you want to insert a new cell at the bottom of a notebook, start typing in the cell at the bottom of the page.

NewCell

You can define what type of cell it is with the cell type menu:

CellTypeMenu

In this case, I’ve selected Markdown, so this will be a commentary cell.  I fill in details and when I’m done, hit Shift+Enter to run the cell.  But let’s say that you want to insert something in between two already-existing cells.  That’s easy too:  in the Insert menu, you have two options:  Insert Cell Above or Insert Cell Below.

Conclusion

Notebooks are powerful tools for data analytics.  I wouldn’t use a notebook as my primary development environment, though—for R, I’d write the code in RStudio and once I have my process complete, then I’d transfer it to a notebook, add nice Markdown, and make it generally understandable.

The rest of this week’s R posts will use notebooks to walk through concepts, as I think that this is easier to understand than giving somebody a big script and embedding comments (or worse, not even including comments).

Advertisements

2 thoughts on “R: The Basics Of Notebooks

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s