R: Using dplyr And TidyR

Last time around, I used R to view some Raleigh crime data.  Today, I’m going to touch on a couple of packages which help cleanse data.

Last Time Around

Just to make sure you start off on the right track, here’s the script from last time. If you have questions, start with the introduction post and follow along.

install.packages("ggplot2")
library(ggplot2)

raleighcrime = read.csv('/opt/data/Police.csv')
raleighcrime$INCIDENT.DATE = as.Date(raleighcrime$INC.DATETIME, format=”%m/%d/%Y %I:%M:%S %p”)
raleighcrime$BEAT = as.factor(raleighcrime$BEAT)
head(raleighcrime)
str(raleighcrime)
qplot(x = as.factor(as.numeric(format(raleighcrime$INCIDENT.DATE, "%Y"))), xlab = "Year", ylab = "Number of Incidents")

Handling Dirty Data

I don’t know about you, but the data sets I use tend to be imperfect in some way.  Maybe the people entering the data didn’t quite do a perfect job, maybe tracking requirements changed over time as we learned what kind of data to track, or maybe there’s some other reason.  Regardless of the reason, we need to be able to fix or strip out bad data before we analyze it.  Fortunately, CRAN is full of packages which help us out, including dplyr and TidyR.

Installation and loading of packages is easy:

install.packages("dplyr")
install.packages("tidyr")
library(dplyr)
library(tidyr)

These two packages work extremely well together.  dplyr is a pipeline tool which exposes a few verbs.  Important verbs include filter()arrange()select()distinct(), and contains().  You can also chain these functions together.  TidyR then lets you pivot, unpivot, and mutate data.  The authors of TidyR note that tidy data follows Codd’s third normal form as it applies to a statistical tool.

dplyr

So what kinds of things do we need to do with the Raleigh data?  First of all, we have some data without locations, as you can see if you run the following command:

tail(raleighcrime)

Some of this early data did not have GPS coordinates attached to it, so it wouldn’t be very useful plotting on a map.  Following from that, I want to use dplyr and TidyR to do three things:

  1. Strip out records without a valid Location
  2. Break Location out into Latitude and Longitude
  3. Limit Latitude and Longitude to three spaces after the decimal (which is approximately one city block)

To do this, I need to do just a little bit of housekeeping and convert my location variable to a string:

raleighcrime$LOCATION = as.character(raleighcrime$LOCATION)

Doing this lets me call the filter() function in dplyr:

rcf <- filter(raleighcrime, LOCATIONSTR != "")
tail(rcf)
str(raleighcrime)
str(rcf)

Notice that after running this filter function, the rcf data set no longer has blank locations. It also dropped approximately 16,000 observations, so we’re throwing away roughly 4% of the total data set when we do this.  That’s not fantastic, but for a data set like this, it’s the best we can do.

TidyR

Now that we have all of the records with locations specified, let’s break it out into latitude and longitude.  TidyR has a function called separate() which will automatically break out alpha from numeric (so that m04 would be broken into two columns, m and 04).  We can’t rely on the default behavior here, so we’re going to have to use a different overload of the separate function.  Our locations look something like (35.77224272249131, -78.637631764409).  The end goal here is to have two separate columns, one which contains 35.772 and the other -78.637.

The first step along the way is to separate the latitude and longitude components into their own variables.  Let’s use the separate() function and do that:

rcf <- separate(rcf, LOCATION, c("LATITUDE", "LONGITUDE"), ",", 1)
head(rcf)

This function will give us latitude and longitude, but we still have to remove the parentheses and strip everything after the thousandths place. To get us to our goal, let’s use the extract_numeric() function along with the built-in round() function:

rcf$LATITUDE <- round(extract_numeric(rcf$LATITUDE), 3)
rcf$LONGITUDE <- round(extract_numeric(rcf$LONGITUDE), 3)
head(rcf)

Make A Pretty(?) Map

The end result is that we now have a set of numeric data points which represent latitude and longitude down to the level of approximately one city block. Because R is about visualization of data, let’s visualize this data. We’ll use a package called ggmap to connect to Google Maps and give us a nice background for our data points.

install.packages("ggmap")
library("ggmap")

Once we have ggmap installed, let’s put together an appropriate map box for plotting:

raleighmap <- get_map(location = c(lon = mean(rcf$LONGITUDE), lat = mean(rcf$LATITUDE)), zoom = 11, maptype = "roadmap", scale = 2)

ggmap(raleighmap) + geom_point(data = rcf, aes(x = LONGITUDE, y = LATITUDE, fill = "red", alpha=0.1, size = 1, shape = 21) + guides(fill=FALSE, alpha=FALSE, size=FALSE)

The first line connects to the Google API and grabs a static road map of the Raleigh area. The second line plots all of the individual incidents on the map.

raleighcrimemap

As it stands right now, this is a very messy map.  Later on in this R series, we’ll look at ways of handling this data a bit more nicely.

Full Script

Here is the full script for today’s project.  This does include the script from our introduction to R.

install.packages("ggplot2")
install.packages("dplyr")
install.packages("tidyr")
install.packages("ggmap")
library(ggplot2)
library(dplyr)
library(tidyr)
library(ggmap)

raleighcrime = read.csv('/opt/data/Police.csv')
raleighcrime$INCIDENT.DATE = as.Date(raleighcrime$INC.DATETIME, format=”%m/%d/%Y %I:%M:%S %p”)
raleighcrime$BEAT = as.factor(raleighcrime$BEAT)
raleighcrime$LOCATION = as.character(raleighcrime$LOCATION)

rcf <- filter(raleighcrime, LOCATIONSTR != "")
rcf <- separate(rcf, LOCATION, c("LATITUDE", "LONGITUDE"), ",", 1)
rcf$LATITUDE <- round(extract_numeric(rcf$LATITUDE), 3)
rcf$LONGITUDE <- round(extract_numeric(rcf$LONGITUDE), 3)

raleighmap <- get_map(location = c(lon = mean(rcf$LONGITUDE), lat = mean(rcf$LATITUDE)), zoom = 11, maptype = "roadmap", scale = 2)

ggmap(raleighmap) + geom_point(data = rcf, aes(x = LONGITUDE, y = LATITUDE, fill = "red", alpha=0.1, size = 1, shape = 21) + guides(fill=FALSE, alpha=FALSE, size=FALSE)

Additional Resources

Here are some of the resources I used in building this blog post.  They go into more detail on some of these topics:

Advertisements

2 thoughts on “R: Using dplyr And TidyR

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s