Become A Better Speaker

That’s a big title.  Expect “(on the margin)” in tiny font after the title.

What Other People Say

Let me start off with some great resources, and then I’ll move to a few things that have helped me in the past.

Troy Hunt

Troy is a master of preparation.  Check out his prep work he did for NDC.  I don’t remember if he linked in either of these posts, but he will schedule tweets for certain parts of his talk because he knows exactly when he’ll get to that point.  Also check out his speaker anti-patterns and fix them.  Of his anti-patterns, I most hate the Reader.  Please don’t read your slides; your audience can read faster than you can speak, and your audience can read the slides later.  Tell engaging stories, tie things together, and give people something they can remember when they do re-read those slides.  You’ll also enjoy his checklist of the 19 things he needs to do before the talk begins.

What I get from Troy:  Practice, practice, practice.  One thing I want to get better at is taping myself and forcing me to listen to it.  There are verbal tics which you can only find when you listen to yourself.

Brent Ozar

Brent is a top marketer in the SQL Server world.  Here’s some great advice from him on delivering technical presentations.  He’s absolutely right about giving people a link at the end telling them where to get additional materials.  He has additional information on how to handle slides, working with the audience, and getting prepped for the session.  I think his music idea is interesting; at a couple all-day cons we’ve hosted locally, the guy who hosts tends to put on some early morning music before festivities begin.  It lightens up the mood considerably and works a lot better than a bunch of introverts staring quietly at their phones for 40 minutes until someone makes a move to begin.

What I get from Brent:  I’ve cribbed a few things from him.  All of my slide decks have two pieces of information at the end:  “To learn more, go here” and “And for help, contact me.”  The first bit then points to a short form address URL I bought (http://CSmore.info) and which redirects you to the longer-form address.  On that link, I include slides, a link to my GitHub repo, and additional links and helpful information.  The second bit on contacting includes my e-mail address and Twitter handle.

Paul Randal

Paul has a great Pluralsight course on communication.  This course isn’t just about presentations, but there are two modules, one on writing presentations and one on giving presentations.  I highly recommend this course.

What I get from Paul:  Practice, prepare, and don’t panic.  Compare against my previous review of this course.  Things will inevitably go wrong, so take spare cables and extras of anything you can fit.  At this point, I even carry a mini projector in case of emergency.  I’ve not needed to use it but there might come a day.

Julia Evans

Julia has a fantastic blog post on improving talks and conferences.  I definitely like her point about understanding your target audience.  Her argument in favor of lightning talks is interesting, and I think for beginners, it’s a good idea.  For more experienced speakers, however, 10 minutes is barely an introduction, and sometimes I want a deep dive.  Those have to be longer talks just by their nature.

Another great point she makes is to give hard talks:  aim for beginners to scrape part of it but for experts to learn from it as well.  I absolutely love Kyle Kingsbury’s work too and he helped me get a handle on distributed systems, but in a way that I could re-read his posts several months later and pick out points I never got before.

What I get from Julia:  Find your motivation and make your talks work better for a broader range of people.  I have one talk in particular on the APPLY operator which has the goal of making sure that pretty much anybody, regardless of how long you’ve been in the field, learns something new.  There are a couple of examples which are easier for new users to understand and a couple of examples which are definitely more advanced but still straightforward enough that a new user can get there (even if it does take a little longer).  Ideally, I’d like all of my talks to be that way.

What I Recommend

Here are a few recommendations that I’ll throw out at you.  I’m going to try not to include too much overlap with the above links, as I really want you to read those posts and watch those videos.

  • Practice!  Practice until you know the first five minutes cold.  There are some presenters who will practice a talk multiple times a day for several days in a row.  I’m not one of those people, but if you’re capable of it, go for it.
  • Record yourself.  Find all of those placeholder words and beat them out of yourself.  I don’t mean just “Uh,” “Er,” “Um,” “Ah,” and so on.  In my case, I have a bad habit of starting sentences with “So…”  I’m working on eliminating that habit.  Recordings keep you honest.
  • Tell interesting stories.  To crib from one of my 18th century main men, Edmund Burke, “Example is the best school of mankind, and they will learn at no other.”  Theory ties your work together, and stories drive the audience.  Stories about failure and recovery from failure are particularly interesting; that’s one of the core tenets of drama.
  • Prep and practice your demos.  If you’re modifying anything (databases, settings, etc.) over the course of your demo, have a revert script at the end or revert your VM.  That way, you won’t forget about it at the end of a great talk, give the talk again later (after you’ve forgotten that you never rolled everything back), and have your demos fail because you forgot.  Not that this has happened to me…
  • Speaking of failure, prepare for failure.
    • Have extra cables.  I have all kinds of adapters for different types of projectors.  I have VGA, DVI (though I’ve only seen one or two projectors which required this), and HDMI adapters for both of my laptops in my bag at all times.
    • Prepare to be offline.  If your talk can be done offline, you should do it that way.  Internet connections at conferences are pretty crappy, and a lot of demo failures can be chalked up to flaky networks.  This means having your slides available locally, having your demo scripts available locally, etc.
    • Have your slides and demo scripts on a spare drive, USB stick, or something.  If all else fails, maybe you can borrow somebody else’s laptop for the talk.  I had to do this once.  It was embarrassing, but I got through it and actually got good scores.  The trick is to adapt, improvise, and overcome.  And you do that with preparation and practice.
    • Have your slides and demo scripts available online.  I know I mentioned assuming that your internet connection will flake out, but if your system flakes out and someone lends you a laptop but can’t accept USB sticks (maybe it’s a company laptop), at least you can grab the slides and code online.
    • If you do need an internet connection, have a MiFi or phone you can tether to your laptop, just in case.  If you have two or three redundant internet sources, the chances of them all failing are much lower than any single one failing.
    • Have a spare laptop if you can.  That’s hard and potentially expensive, but sometimes a computer just goes bye-bye an hour before your presentation.
  • Install updates on your VMs regularly.  Do it two nights before your presentation; that way, if an update nukes your system, you have a day to recover.  Also, it reduces the risk that Windows 10 will pick your presentation as the perfect time to install 700 updates.  Very helpful, that Windows 10 is.
  • When in doubt, draw it out.  I have embraced touchscreens on laptops, bought a nice stylus, and love drawing on the screen.  I think it helps the audience understand where you’re going better than using a laser pointer, and sometimes you don’t have a whiteboard.  If you don’t like touchscreens, ZoomIt still works with a mouse.
  • Speaking of which, learn how to use ZoomIt or some other magnification tool.  Even if you set your fonts bigger (which yes, you need to do), you will want to focus in on certain parts of text or deal with apps like SQL Server Management Studio which have fixed-size sections.

There are tomes of useful information on this topic, so a single blog post won’t have all of the answers, but hopefully this is a start.

Calculating SQL Saturday Distances

EDIT 2016-07-21 17:30:00 — It turns out that my distance calculation query was a little bit off. I was accidentally filtering out SQL Saturdays which did not have any 600-mile collisions in the next week but did have a 600-mile collision in the current week. With this in place, there are now 38 pair collisions out of 209 combos. I’ve updated the commentary portion to match these corrected numbers, but they don’t fundamentally alter the point. I also found that I mistakenly mis-coded Kansas City’s state in 2015; it has been corrected in the data file. Moral of this story: don’t blog too early in the morning?

A couple of days ago, Argenis Fernandez put up a notice that PASS is changing the rules for SQL Saturdays:

The first change we are going to make is to the amount of funding support we provide organizers. Currently any organizer, whatever the size of their SQLSaturday event, receives $500. Starting January 1, 2017, we will be reducing this to amount to $250 and targeting those cities/locations that really need the financial support to grow, rather than well-established events that are better capable of attracting sponsors and financial support. When assessing those events that require financial support, PASS will be meeting with event organizers and complete a review of past performance.

Which brings me to the second change: implementing a new 600-mile radius and concurrent event timing restriction to reduce competition between individual SQLSaturday events in Canada/USA.  The event timing restriction means that events within the 600-mile radius cannot occur on the same weekend or one weekend prior or after.  This will help to deliver a more consistent and satisfying experience for sponsors, which will reduce sponsor exhaustion and assist with increasing overall ROI.  The existing 400-mile radius restriction for all other regions will remain.

I don’t mind the first rule change at all, and from what I’ve seen, I don’t think anybody does.  That $250 might be the difference between breaking even and going slightly in the hole, but if you know about it beforehand, it’s not that hard to squeeze $250 out of a budget or find $250 in sponsorships somewhere.  As a result, it saves PASS somewhere between $25-50K a year and most events won’t notice a difference.

The second change, however, has been…controversial.  Steve Jones has a well-written argument against the rule change, particularly in the way this was announced.

How Much Does This Change?

I wanted to check out current SQL Saturdays for 2015 and 2016  and see how many would have been affected by the new rule.  So let’s take a look!

Getting The Data

I decided to go with a simplistic approach.  Rather than looking up the specific addresses of each SQL Saturday, I went by city name.  The way I figure it, if I’m 10-20 miles off, it won’t make a huge difference.  I also decided to limit my focus to the US and Canada, as those are the areas affected by the rule change.

I coded each event with city, state (or province), and event date.  You can get my results here in CSV format.

Geocoding The Data

My CSV doesn’t include latitude and longitude, so I’m going to load the data into R, get that latitude and longitude information, and then put it into SQL Server. First create the SQL Server table:

CREATE TABLE dbo.SQLSaturdays
(
	City VARCHAR(40),
	State CHAR(2),
	EventDate DATE,
	Latitude DECIMAL(8, 4),
	Longitude DECIMAL(8, 4),
	GeoPoint AS GEOGRAPHY::Point(Latitude, Longitude, 4326)
);
GO

And here’s the R script I threw together:

install.packages("ggmap")
install.packages("RODBC")
library(ggmap)
library(RODBC)

sqlsat <- read.csv("C:\\Temp\\2015_2016_SQLSaturdays.csv")
sqlsat$City <- as.character(sqlsat$City)
sqlsat$State <- as.character(sqlsat$State)
sqlsat$Date <- as.Date(sqlsat$Date, format="%m/%d/%Y")
sqlsat$GeoLookup <- paste(sqlsat$City, sqlsat$State, sep = " ")

# Perform Google Maps geocoding.  Google Maps provides 1000 geocoding hits per day for free.
# Our data set is only 106 observations, so we can do that without issue.
sqlsat <- cbind(sqlsat, t(sapply(sqlsat$GeoLookup, geocode, USE.NAMES=FALSE)))

sqlsat

conn <- odbcDriverConnect("Driver=SQL Server;Server=.;Initial Catalog=Scratch;Provider=SQLNCLI11.1;Integrated Security=SSPI")

# Via http://stackoverflow.com/questions/14334840/how-to-insert-a-dataframe-into-a-sql-server-table
# Not a great practice, but for 106 rows, it'll do.
values <- paste("('",sqlsat$City,"','",sqlsat$State,"','",sqlsat$Date,"',",sqlsat$lat,",",sqlsat$lon,")", sep="", collapse=",")
cmd <- paste("INSERT INTO Scratch.dbo.SQLSaturdays(City, State, EventDate, Latitude, Longitude) VALUES ", values)
result <- sqlQuery(conn, cmd, as.is=TRUE)

close(conn)

There are a couple not-so-great practices (particularly around the way I inserted data into SQL Server) but it does the job, especially when you only have 106 rows.  Also, if you want to play along at home, you’ll probably want to change the connection and database names.

Determining Distance

From there, I ran a T-SQL query to do the following:

  1. Turn latitude and longitude into GEOGRAPHY points
  2. Find SQL Saturday pairs which occur within one week of one another
  3. Calculate the distance in miles between these two city pairs
  4. Return only city pairs which are less than 600 miles apart

Here’s the script:

WITH sqlsats AS
(
	SELECT
		City,
		State,
		EventDate,
		Latitude,
		Longitude,
		GeoPoint,
		DENSE_RANK() OVER (ORDER BY EventDate) dr
	FROM dbo.SQLSaturdays s
),
chronoprox AS
(
	SELECT
		s.City,
		s.State,
		s.EventDate,
		s.Latitude,
		s.Longitude,
		s.GeoPoint,
		sNext.City AS NextCity,
		sNext.State AS NextState,
		sNext.EventDate AS NextEventDate,
		sNext.GeoPoint AS NextGeoPoint
	FROM sqlsats s
		LEFT OUTER JOIN sqlsats sNext
			ON (s.dr = sNext.dr - 1 OR s.dr = sNext.dr)
			AND s.EventDate >= DATEADD(DAY, -7, sNext.EventDate)
			AND NOT (s.City = sNext.City AND s.State = sNext.State)
),
geoprox AS
(
	SELECT
		cp.City,
		cp.State,
		cp.EventDate,
		cp.Latitude,
		cp.Longitude,
		cp.GeoPoint,
		cp.NextCity,
		cp.NextState,
		cp.NextEventDate,
		cp.NextGeoPoint,
		cp.GeoPoint.STDistance(cp.NextGeoPoint) / 1609.344 AS DistanceInMiles
	FROM chronoprox cp
	WHERE
		cp.NextGeoPoint IS NOT NULL
)
SELECT
	gp.City,
	gp.State,
	gp.EventDate,
	gp.NextCity,
	gp.NextState,
	gp.NextEventDate,
	gp.DistanceInMiles
FROM geoprox gp
WHERE
	gp.DistanceInMiles < 600
ORDER BY
	gp.NextEventDate;

The end result is that there were 38 city pairs with a distance of less than 600 miles in 2015 and 2016.  23 of these pairings took place in 2015, and 15 in 2016 (including scheduled events which have not yet happened).

Deriving Conclusions

10 of the 38 city-pairs were in the run-up to PASS Summit 2015, and Raleigh & Charlotte were right in the middle of that.

As Grand Poobah of the Raleigh SQL Saturday, this does affect me, as we have some larger events (Atlanta, Orlando, Washington, Charlotte) in our region, as well as smaller institutions (Richmond, Spartanburg).  Finding a good date has become a bit harder (in my case, because I don’t want to jump claim on any of those events’ dates), but I also don’t want to oversell that difficulty:  out of 209 potential collisions, we only saw 38.  And that’s factoring in current concerns people have about the total number of available dates, such as how you would never have a SQL Saturday around Thanksgiving or Christmas or Labor Day or…

I also don’t want to undersell the marginal increase in difficulty for smaller SQL Saturdays to keep going—it’s hard to find a venue even in the best of circumstances (trust me, I know!), and feeling like you’re going to be locked out of a good portion of the year reduces your options even further.  My original concern was mostly around midwestern events, as they’re going to have the most geographical overlap, but if you check out the pairs, the southeast is actually the hardest-hit region:

--I took the results from the previous query and put them into a temp table called #tmp.
WITH records AS
(
	SELECT
		City,
		State
	FROM #tmp
	UNION ALL
	SELECT
		nextcity,
		nextstate
	FROM #tmp
)
SELECT
	City,
	State,
	COUNT(1) AS Collisions
FROM records
GROUP BY
	City,
	State
ORDER BY
	Collisions DESC;

SQLSaturdayCollisions

Kansas City tops the list with 6 collisions, followed by Raleigh with 5 (go us!). Also near the top of the list are Spartanburg, Dallas, Columbus (GA), and Orlando have 4 apiece.  Note that the total number of collisions adds up to 76 because I include both “sides” of the 38 collisions.  This does affect 34 separate events, though.

What If?

While I have the data, how about we play around with a couple of alternatives?

  1. If we shrink the collision radius to 500 miles, we’re down to 30 collisions, including 14 in 2016.
  2. At 400 miles (the international limit), we have 18 collision pairs, including 7 in 2016.
  3. At 700 miles, we’re up to 54 collisions, including 20 in 2016.  As a side note, at 700 miles, we would have had collisions in all 9 of the cities which had SQL Saturdays in the first three weeks of October 2015 (Columbus GA, Pittsburgh, KC, Orlando, Raleigh, Minneapolis, Boston, Charlotte, Dallas). At 600 miles, Boston eeks out of the list.

Final Thoughts (For Now)

I’m still on the fence about this decision.  As a speaker, I like the fact that there are so many events on the east coast and midwest, and I’d hate to see the number of opportunities to visit other cities and try to help people learn drop as a result of this.  Looking at the data, I think we’ll lose some events on the margin.  There is some opportunity for growth in less desirable times of the year (think July), but the problem is that if you’re already running a smaller event, picking a bad time of year will guarantee that not many people will be able to show up.

But at the same time, I’ve heard from several events that I’ve attended that sponsorships are drying up this year.  If that’s the case across the board, then we might have reached the event limit, particularly in areas like the southeast which have a large number of events.

Spark, R, And Zeppelin

We are going to combine three of my interests today:  Spark, data analysis, and notebooks.  Unlike last week’s discussion of notebooks, I’m going to use Apache Zeppelin today.  In today’s post, we will load some restaurant data into Spark and then build a notebook to visualize some of that data.

Getting Started

If you have not done so, follow through my Getting Started with Spark guide.  This will get restaurant data loaded into Spark.  I’m going to assume that you’ve created a cluster, loaded the restaurant table (but make sure you have the version with no NAs!), and are ready to create a notebook.

Create A Notebook

Our first step will be to create an R notebook.

CreateNotebook

CreateRNotebook

My goal is to do some of the things that I did in my Touching on Advanced Topics post.  Originally, I wanted to replicate that analysis in its entirety using Zeppelin, but this proved to be pretty difficult, for reasons that I mention below.  As a result, I was only able to do some—but not all—of the anticipated work.  I think a more seasoned R / SparkR practitioner could do what I wanted, but that’s not me, at least not today.

With that in mind, let’s start messing around.

Analysis

The Struggles

I think it might be more useful to start with the various roadblocks I hit while working on this blog post.  SparkR is a different paradigm, and as someone with a tenuous hold on R, it proved to be a lot harder than I wanted. Here are a few of the things I had trouble with, in the hopes that the solutions will save you a bit of time in the future.

Overwritten Functions

If you install all of the packages from the Jupyter notebook, you’ll find that dplyr has a sql command which will override the Spark SQL command and leave you in a bad place if you don’t know what’s going on.  If you get stuck in this bad place, you can specify the function call, and for Spark, the sql function is SparkR::sql.

Unexpected Results

One of the first struggles I had was that certain operations return data frames rather than vectors, for example, mean.  I was able to use Wendy Yu’s great SparkR tutorial to figure out a way to get mean latitude and longitude from my data set.

Lack of Results

Another major problem I had was that Java is much stricter about data types than R, meaning that when I tried to perform some operation against my restaurants table (in which score is defined as a double), Java would fail with an exception saying that NA is not a valid numeric value.  No matter how I wrote my SQL statement, I was unable to ignore that NA data.  I even switched over to Python, thinking it a SparkR issue, but it turns out to be part of Spark.  I’m sure there’s some flag somewhere that I’m not setting which allows me to ignore those bad values, but I haven’t found it yet.

Data Frames Aren’t DataFrames

Another issue I ran into was when I started using ggplot2.  When I tried to hook up my Spark data frame (which is of type DataFrame), ggplot2 spat back the following error:

Error : ggplot2 doesn't know how to deal with data of class DataFrame

Fortunately, Jose Dianes has an article which discusses this exact problem.  You have to collect the distributed DataFrame before ggplot2 can understand what it is.

The Code

Okay, we spent enough time complaining about my failings; now let’s actually show what we can do.  Unlike the Advanced Topics notebook, I’m going to keep this fairly compact and not meander quite as much as I did (for pedagogical purposes, I promise).

Step one is to load the restaurant data:

restaurantsDF <- sql(sqlContext, "SELECT COUNT(1) AS numberOfResults, AVG(score) AS meanscore, ROUND(lat, 1) AS lat, ROUND(long, 1) AS long FROM restaurants GROUP BY ROUND(lat, 1), ROUND(long, 1)")

Note that in this version, I’m going to specify a narrowed-down query with the relevant attributes and filters I want.  I’ve found this to be preferable to the classic R approach of loading all the data with read.csv and importing dplyr and other tools to filter the data after the fact, particularly because of differences in data frame operations, as we’ll see below.

Step two is to install relevant packages:

install.packages("ggplot2", repos = "http://cran.us.r-project.org")
install.packages("ggmap", repos = "http://cran.us.r-project.org")
library(ggplot2)
library(ggmap)
print('Finished loading libraries.')

I only need ggplot2 and ggmap because I did my transformations earlier.

Step three is where I calculate latitude and longitude.  My latitude and longitude are technically off slightly because I’m taking the mean of pre-aggregated lat-long combos rather than the mean of the population, but the difference is so slight that it doesn’t make an appreciable difference.

r.lat <- collect(agg(restaurantsDF, mean=mean(restaurantsDF$lat)))$mean
r.long <- collect(agg(restaurantsDF, mean=mean(restaurantsDF$long)))$mean

With those calculated and collected, I’m able to pass them in as parameters to ggmap’s get_map function, so we can build a Google map.

#Set wakemap to work around the mean latitude and longitude values we calculated above.  We've set the zoom value to 10 so you can see the entire region.
wakemap <- get_map(location = c(lon = r.long, lat = r.lat), zoom = 10, maptype = "roadmap", scale = 2)

Once I have those, I need to convert my Spark DataFrame class object to a local R data frame.  I do that by calling the collect method.  Because I only need three of the four attributes, I’ll show off the select function as well.

restgrp <- collect(select(restaurantsDF, "meanscore", "lat", "long"))

This data frame is something ggplot2 can use.  So let’s run ggmap and fill in results:

# Now we can fill in ggmap with settings.  We want three function calls:
# scale_fill_gradient (to give us a visual cue of restuarants.  Pick good colors for low & high.)
# geom_text (to display the meanscore, giving us precise values.  Round meanscore to 1 spot after decimal)
# geom_tile (to display blocks of color.  Set alpha = 1)
ggmap(wakemap) +
  scale_fill_gradient(low="black",high="orange") +
  geom_text(data = restgrp, aes(x=long, y=lat, fill = meanscore, label = round(meanscore, 1))) +
  geom_tile(data = restgrp, aes(x=long, y=lat, alpha=1, fill=meanscore)) 

The end result looks a bit like this:

RestaurantHeatmap

If you want to get this as a notebook, I’ve made the R script available here.  You can publish Databricks notebooks, but they’re only public for a period of six months and I don’t want to have a broken link in a few months’ time.

Conclusions

Why Use Zeppelin?

Throughout all of this, I didn’t even mention one of the big advantages to using Zeppelin:  the built-in visualization components.  Zeppelin has made it easier for third parties to display graphics, and so visualizing a data set is as simple as running the display command.  For example, here’s a nice-looking histogram of mean restaurant scores by lat-long pair:

PlotMeanScoreHistogram

That’s a one-line operation, and you can change the plot options to perform other visualizations as well.  This type of visualization is production-worthy, so I give the Databricks team plaudits for making this available to us in such an easy-to-use format.

Final Thoughts

To be honest, I wanted this to be a blog post in which I described how easy the conversion from Jupyter and local R to Zeppelin and SparkR would be.  In reality, there’s a gap between the two.  There are some good articles to help you bridge that gap—I mentioned Wendy Yu’s and Jose Dianes’s—but this certainly is not an exercise in copy-paste; there’s a lot more going on here.  Nonetheless, I recommend getting a handle on SparkR, as the ability to perform data analysis on distributed compute is powerful, especially if you can’t get your boss to approve that desktop with 64 GB of RAM…

Pluralsight Reviews: Exploration With R

I just wrapped up watching a Pluralsight course entitled Exploratory Data Analysis with R.  Matthew Renze put together a nice introduction to R.  The first module helps someone brand new to R install the product and RStudio, but from there, it’s digging into descriptive analytics.  The course is a bit light on CRAN packages, instead focusing mostly on what you can do within the core of R.

This course is meant to be an introduction to his Beginning Data Visualization with R course.  As such, expect a review of that in the not-too-distant future.

R: Touching On Advanced Topics

Wrapping up this mini-series on R, I’m going to link to a notebook which touches on some more advanced topics, especially from the standpoint of a one-hour presentation starting from scratch.  In this last post, I plan to gloss over the details and give you the notebook links and let you see what’s going on.

Notebook Details

This notebook started as a lab assignment for a the Polyglot .NET group in the Raleigh-Durham area.  Nobody got all the way through the lab, but people did get far enough that I’d call the results a success.

Basically, there are two versions of this notebook:  the lab version (which I gave to attendees) and the completed version.  If you feel like digging through the lab for a couple hours, you might get more out of it than simply looking at the completed version.

Basic Ideas

The Data Set

The original data set came from the town of Cary, North Carolina’s open data portal.  This data set includes restaurant inspections over the course of approximately one decade.  It includes some very interesting details, but is missing one huge part:  it has address information, but no latitude and longitude information.

My original plan to take address data and get latitude and longitude was to hit the Google Maps API, which offers 1000 free requests per day.  1000 hits per day is a huge number most days, but when you have over 13,000 requests, it’s not that great an option.  As a result, I ended up using the street2coordinates function inside RDSTK.  This is a bit slower than Google Maps and didn’t have the nice geocoding methods that ggmap has for Google Maps, but I was able to let the process run overnight and get latitude and longitude details.  From there, I saved the file as a text file and make it available for general use.

The Process

This notebook takes you through my version of data analysis, including some false starts thrown in to help you understand.  I like examples that have built-in failure points because we see them in real life far more often than successes, and so I want to show how I react after hitting a brick wall, ways that I got around the issue at hand, and that in turn helps you later on when you hit a similar wall.

Cool Things We Do

This notebook is full of things that I consider really nice about R:

  • Using ggmap to plot data on real Google Maps images, letting users gain immense context
  • Cleanse data using dplyr, including filtering results, grouping results, and creating new summary variables that we can plot on a map.
  • Make area plots with mean restaurant scores for the whole of Wake County as well as the town of Cary.

Conclusion

This post wraps up my R mini-series.  It barely scratches the surface of what you can do with R, but I’m hoping it inspires SQL Server developers to get in on the data analysis game.

R: Database Access

In today’s post, I’m going to use RODBC to connect to SQL Server and perform a few operations.

Pre-Requisites

There are a couple of steps we’re going to have to do in order to get RODBC to work with SQL Server.

Grab The Notebook

The first step is to grab the Basic SQL Server Connection notebook from my GitHub repo.  We’ll follow along with this notebook throughout the blog post.

Grab The Database Backup

If you want to follow along, I have a backup of PASSData which you can download from the GitHub repo.  It’s about 20 MB and does require SQL Server 2016.  If you don’t want to download this file or don’t have SQL Server 2016 installed, you can still get most of the benefit from this notebook by using your own DSN and writing your own queries.

Set Up Your DSN

If You’re On Linux

First, if you’re running on Linux, you’ve got a bit of work to do.  For this post, I’m going to assume that you’re running Windows, but if you are on Linux, follow the instructions above and you should be able to follow along.  You’ll probably have to change the connection to use SQL authentication instead of Windows authentication, but that’s a minor one-line change in the conn <- obdcConnect("PASSDATA") block.

If You’re On Windows

Before you begin connecting to a SQL Server instance, you need to set up an ODBC connection. To do this in Windows, go to the ODBC Data Sources application and create a SYSTEM-level data source.

Once you have that data source loaded, you can use RODBC to connect to SQL Server using that DSN. In this case, I created a data source named “PASSData” which connects to a local SQL Server instance’s PASSData database.

Learning What RODBC Can Do

First, we need to get RODBC from CRAN.  RODBC is a package which allows you to connect to ODBC-compliant databases using R.  This package allows you to connect to a wide variety of database products, which has its positives and negatives.  The big positive is that you don’t need to use SQL Server on the back end:  you can set up an Oracle, MySQL, Postgres, or a large number of other products.  The big negative is that this means less effort can be put into vendor-specific extensions.  RODBC is not a least-common-denominator product, but won’t be optimized for every product’s use cases.

We install RODBC with the following commands:

install.packages("RODBC", repos = "http://cran.us.r-project.org")
library(RODBC)

If you’d like more details on RODBC, I recommend checking out the documentation.

help(RODBC)
RShowDoc("RODBC", package="RODBC")

RODBC provides the odbcConnect function, which lets you connect to a database using a DSN or a connection string.  We’re going to use the DSN we created above.

Data Retrieval

We have a couple methods for getting data.  First, we can get metadata using built-in methods.  Here’s how to get a list of all tables:

sqlTables(conn, schema = "dbo")

Once you have that, you can easily grab all of the rows in a table using the sqlFetch command:

registrations <- sqlFetch(conn, "dbo.SQLSatRegistrations")

If you want a more fine-grained query or want to write SQL to perform some of the early data cleansing, write a sqlQuery:

sessions <- sqlQuery(conn, paste("SELECT TOP(50) EventName, EventDate, SessionName",
                                 "FROM dbo.SQLSatSessions",
                                 "WHERE SatNum = 217",
                                 "ORDER BY SessionName DESC"))

Other Operations

You can also insert and update data, create tables, and manage tables using RODBC, but for the most part, aside from inserting cleansed data, I’d personally shy away from using a data analysis tool for database management.

At any rate, once you’re done with the connection, close it using the close command:

close(conn)

Conclusion

Given the continued immense popularity of relational databases, it makes sense to know how to connect to a relational database using R.  RODBC is just one package which is up to the task—RJDBC and RSQLServer are two others.  If you find yourself having performance problems with RODBC on SQL Server, try the others.

R: The Basics Of Notebooks

This blog post is in anticipation of SQL Saturday Columbus, in which I’m going to give a talk introducing R to SQL Server developers.  My primary vehicle for explaining R will be notebooks.

Notebook Basics

I’ll start with two big questions:  what are notebooks and why should we use them?

Notebooks Are…

Remember chemistry class in high school or college?  You might remember having to keep a lab notebook for your experiments.  The purpose of this notebook was two-fold:  first, so you could remember what you did and why you did each step; second, so others could repeat what you did.  A well-done lab notebook has all you need to replicate an experiment, and independent replication is a huge part of what makes hard sciences “hard.”

Take that concept and apply it to statistical analysis of data, and you get the type of notebook I’m talking about here.  You start with a data set, perform cleansing activities, potentially prune elements (e.g., getting rid of rows with missing values), calculate descriptive statistics, and apply models to the data set.

Why Should We Use Them?

There are two separate reasons why we should use notebooks.  The first ties back to the reason you used them in chem lab:  the potential for independent verification.  If I want to publish a study—even if it’s just an internal analysis of company metrics—I want another analyst to be able to follow what I did and ensure that I made correct choices.

Aside from independent validation, there is a second big reason to use notebooks:  cross-language compatibility.  If your organization only uses R to perform data analysis, a notebook might not be necessary; you can get by with well-documented and thorough R scripts.  But as soon as somebody wants to write Python or Julia code, the classic technique starts to get tricky.  Notebooks help you organize analyses and write code across a number of languages.

In addition to multi-lingual capabilities, notebooks make documentation easier.  The major notebooks support Markdown, an easy way of formatting text.  This allows you to create pretty and thorough documentation.  I recently went through an EdX course on Spark, and their labs were amazing.

MarkdownExample

It’s hard to do justice to their labs in a single screenshot because of how thorough they were, but you can get the gist of what Markdown can do here:  separate sections, headings, images, code snippets, and a large number of other features without having to write a mess of HTML.

Which Notebook To Use?

There are two major notebooks out there today:  Zeppelin and Jupyter.  I’ve used both and like both.  Hoc Q. Phan has a good comparison of the two, and I’d say it’s fair:  they’re both pretty close in terms of quality, and your particular use case will probably determine which you choose.

For the purpose of my talk, I’m going to use Jupyter, as that’s something my attendees can install pretty easily.  I also have blog posts on installing it for Windows and Linux.

Working With Jupyter

Running Jupyter is easy:  open up a command prompt and type in “jupyter notebook” after you’ve installed Jupyter.  That dumps me in my home directory on Windows, so I’m going with that and storing my notebooks in Documents\Notebooks.

Notebooks

If you want to create a new notebook, it’s as easy as hitting the New button:

NewNotebook

Running An Existing Notebook

In my case, I’m going to open the Basic R Test workbook, which you can download for yourself if you’d like.  Clicking on a notebook brings it up in a new tab:

BasicRTest

My notebooks are nowhere near as beautiful as the Spark course notebooks, but they do get the point across at least.  If you want to modify existing Markdown, double-click on the section of text you’d like to edit:

EditMarkdown

This changes the pretty text into a Markdown editor block.  When you’re done making changes, press Shift+Enter to run the block.  Running Markdown blocks simply makes them display prettified HTML and moves the focus to the next block.  The next block is a code block.

CodeBlock

The code here is extremely simple:  we’re going to create a vector with one element (an integer whose value is 500) and assign it to y.  Then, we’re going to print the contents of y.  To run the block, again, press Shift+Enter or you can hit the Play button instead.

RunCodeBlock

The results of the code block appear on-screen.  One of the nice things about notebooks is that you can include any R outputs as part of the output set.  To show you what I mean, I’m going to select the Run All option from the Cell menu.

RunAll

This will run all of the cells again, starting from the top.  I could select “Run All Below” and skip re-assigning y, but in this case, it doesn’t hurt anything.  After running everything, I can scroll down and see this section of results:

RunAllResults

There are two separate command here.  First, I’m running a head command to grab the first 3 elements from the set.  Second, I’m running a plot command to plot two variables.  In neither of these cases do I spend any time on formatting; the notebook does all of this for me.  That’s a major time savings.

Inserting New Cells

Inserting new cells is easy.  If you want to insert a new cell at the bottom of a notebook, start typing in the cell at the bottom of the page.

NewCell

You can define what type of cell it is with the cell type menu:

CellTypeMenu

In this case, I’ve selected Markdown, so this will be a commentary cell.  I fill in details and when I’m done, hit Shift+Enter to run the cell.  But let’s say that you want to insert something in between two already-existing cells.  That’s easy too:  in the Insert menu, you have two options:  Insert Cell Above or Insert Cell Below.

Conclusion

Notebooks are powerful tools for data analytics.  I wouldn’t use a notebook as my primary development environment, though—for R, I’d write the code in RStudio and once I have my process complete, then I’d transfer it to a notebook, add nice Markdown, and make it generally understandable.

The rest of this week’s R posts will use notebooks to walk through concepts, as I think that this is easier to understand than giving somebody a big script and embedding comments (or worse, not even including comments).