This will be the first in a long-running series about R.  As I learn more about the language and apply it in more areas, I’ll follow up with the series.

Today, I want to talk about R from the perspective of someone familiar with SQL Server.  I think that there’s a pretty strong overlap between good T-SQL knowledge and a basic understanding of R.  R is a functional programming language (another article here), and I think there’s a nice overlap between functional languages such as R or F# and SQL.  At the very least, a language like SQL helps you get out of the procedural mindset that the C-based languages will put you in.

In addition to that, the other advantage to understanding SQL is that you can intuit the core data type of R:  data frames.  Data frames are collections of variables, similar to how result sets in SQL are collections of attributes.  Both sets can be envisioned as a table with rows and columns, and operating on the entire set is vital to understanding R or SQL.

Installing R

If you want to get R, go to your CRAN mirror of choice.  My preferred spot is Wright State’s mirror in Dayton, Ohio (unofficial motto of that university:  Wright State, Wrong College).  You can grab R for Linux, OS X, or Windows.  I kind of prefer to run R code on a Linux machine, so I put together a simple Elementary OS VM with R on it.  If you want to go down that route, make sure to add your CRAN mirror as a repository so you can use apt-get to install it and side packages.

Once you have R, you could simply go from there, but honestly, you’ll want RStudio.  RStudio is to R what Visual Studio is to C# or F#.  RStudio provides a nice REPL for R, as well as a single application in which you can write scripts, view output, plot results, and maintain files.  It’s an essential tool for anything more than just toying around with R.

Working With Data Sets

Getting R installed and set up is the first part of the battle.  From there, we have to get data sets and analyze them.  By default, R comes with a few data sets, but it’s always more fun to play with your own data.  Let’s use the data set that I use for my Hadoop presentation:  Raleigh crime data from 2006-2013.  If you export that data set as a zip file and then decompress the file, you’ll see that it’s nothing more than a CSV with six columns.  I’m going to copy that data set to /opt/data/Police.csv.  So let’s get started with basic analysis.

First, we need to read the data set from disk, creating a data frame in memory and populating it.  Running this one-liner should give you 413,319 observations of 6 variables:

raleighcrime = read.csv('/opt/data/Police.csv')

So let’s do a few things with this.  First, I want to take a look at the first few rows of the data set to see what I’ve got:

head(raleighcrime)

This shows us that we have a crime code, description, incident datetime, officer’s beat, incident number, and location, and we can see how some of the rows look.  Here’s what it looks like on my system:

1.1 - head_raleighcrime

Let’s now look at the structure of this data set:

str(raleighcrime)

1.2 - str_raleighcrime

str is a fantastic function, as it gives us back information on how R understands the data set.  We see the six variables, and by default, it assigns LCR, LCR.DESC, INC.DATETIME, INC.NO, and LOCATION as factors, while it sets BEAT to an integer.  Factors are discrete variables—think of it as something like sex, age band, or favorite color.  Other variables are continuous, and those are interesting for running analyses.  In our scenario, I’m going to assume that the numeric value of a beat is not related to its physical location, and that beat numbers have no direct relationship.  That means that I’m going to make beat a factor.  I will also change INC.DATETIME to be a date type, so that I can watch trends over time.

raleighcrime$INCIDENT.DATE = as.Date(raleighcrime$INC.DATETIME, format=”%m/%d/%Y %I:%M:%S %p”)
head(raleighcrime)

There are a couple of things to unpack here.  First of all, I’m creating a new variable called INCIDENT.DATE, which will be a date type.  Looking back at the data set when we ran head the first time, we see that the incident date-time comes in as MM/DD/YYYY HH:MM:SS {AM/PM}.  When you convert a factor to a date or date-time object, you need to specify the format the data comes in as.  Our final output is a set of dates, meaning that we’ve lost time granularity.  For my basic analysis, I’m okay with that, but if we want to analyze crime by time of day, that time might be important.

1.3 - head_incidentdate

After flipping incident date to a date type, we’ll now switch beat to be a factor:

raleighcrime$BEAT = as.factor(raleighcrime$BEAT)
str(raleighcrime)

Note that this time around, I’ve replaced $BEAT instead of creating a new column. For now, we’ll leave the rest of the columns the same.

1.4 - str_beat

Data Analysis

We’ve played around with the data set and understand what’s in each column, so let’s get cracking on some basic analysis.  The first thing I want to do is plot this data by year, to see how the year-over-year trends look.  The best part of R is how easy it is to plot data.  Let’s install a package to help us plot this data.

install.packages("ggplot2")
library(ggplot2)

This is an outstanding package for plotting data. Downloading and installing it might take a little while, but it’s well worth the wait. Once we have the package and we include the library in our console, we’ll plot the data by year:

qplot(x = as.factor(as.numeric(format(raleighcrime$INCIDENT.DATE, "%Y"))), xlab = "Year", ylab = "Number of Incidents")

I think I could probably find a more elegant solution for this, but what’s nice is that this one-liner shows me exactly what I want: a bar graph splitting my data by year, showing me the number of incidents in each year. We can see from the Raleigh data set that police incidents have remained fairly steady over the past several years.

1.5 - raleighcrimeplot

Given this data set, there are a few more things we could do, such as breaking out specific categories of incidents (for example, combining all of the assault-related incidents together), and we’ll play around with this data set as well as several others in this series.

Full Script

If you’re interested in running the full script, here it is.  Note how simple it is to pull in a data set, modify the data set, and graph that set.  Imagine how many lines of C or other procedural code this might be, and you’ll start to understand the true power of R.

install.packages("ggplot2")
library(ggplot2)

raleighcrime = read.csv('/opt/data/Police.csv')
raleighcrime$INCIDENT.DATE = as.Date(raleighcrime$INC.DATETIME, format=”%m/%d/%Y %I:%M:%S %p”)
raleighcrime$BEAT = as.factor(raleighcrime$BEAT)
head(raleighcrime)
str(raleighcrime)
qplot(x = as.factor(as.numeric(format(raleighcrime$INCIDENT.DATE, "%Y"))), xlab = "Year", ylab = "Number of Incidents")

Additional Resources

If you’re interested in getting ahead of the curve on R, here are a few articles on the topic: