Getting Started With Custom Visuals

I’m going to be presenting a new talk for the Raleigh-Durham Power BI Users Group.  The talk is entitled Getting Started with Custom Visuals and will cover a half-dozen custom visuals as well as where you can go to get more of them.

If you’re interested in seeing this in webinar form, I’m going to give a version of this talk this upcoming Tuesday at 1 PM Eastern.  Registration is free, so sign up today.

TIL: Slack + SQL Server

No, I don’t mean the SQL Server community on Slack—although you should definitely join up there.

I have a client whose primary method of communication is Slack.  They have e-mail accounts, but hardly use them.  They also have no real monitoring solution in place, so I wanted to be able to do some quick checks and, in the event of an error, push a message to a Slack channel that they could monitor.

It turns out that there is already a Github repo for a CLR function to push a message out to Slack.  The folks at PowerupCloud have a nice step-by-step walkthrough on how they did it.  I wish they had continued the series on integrating further, particularly around SQL Agent alerts.  If you’re a paying Slack customer, you can integrate e-mail with Slack.  You could also throw Zapier into the mix and possibly get a free integration point.

If you want to avoid the external integration process, you can read the default trace or system health XE (about which I learned a lot from Ed Leighton-Dick’s presentation on the topic) and write calls to Slack based on results there.

Pluralsight Reviews: R Data Visualization

Beginning Data Visualization with R is Matthew Renze’s second course on R for Pluralsight, and it follows from the first on exploratory data analysis (review here). In this second course, Matthew takes a look at univariate and bivariate data visualization.

The flow of the course is similar to the first course: Matthew’s examples are based around a fictitious company and uses the Open Movie Database for raw data. In this particular course, we spend less time learning about transformations and more about the three biggest graphics systems within R: the base R graphics library, Lattice, and ggplot2. Matthew does a good job of showing how to create various forms of plots on these different libraries, showing when to use each type (e.g., box plots, bar graphs, and scatter plots). He also shows cases in which certain libraries do not have particular graph types.

I think this was a great course. If I had one knock, it’s that some of the examples suffered due to the data set. For example, Matthew showed us heatmaps and density plots of the number of movies by runtime. The plot did work and it did show what you can do, but I would have preferred to see him use a data set with at least two clusters so you can really get the power of these types of charts.

Regardless of that minor complaint, if you are interested in learning how to perform visualizations with R, this is a great course. I’m looking forward to his intermediate and advanced-level (forthcoming) courses as well.

TIL: RetainSameConnection

I recently had to debug a co-worker’s SSIS package, trying to figure out why it wasn’t working quite right.  We have a sharded infrastructure, meaning that our customers’ data is split out among a number of servers with equivalent schemas.  For this particular package, we wanted to, for each customer, get the relevant data from a common ETL server and push results into the appropriate server for that customer.

My co-worker had set up a dynamic connection (see Rafael Salas or Hari Bagra for details on how to do this), but something weird was happening:  the package was trying to push everything to the same server.  I confirmed that if all relevant customers loaded were for the same server, that the process would work correctly, and that I could run each server load one at a time, so there weren’t any problems connecting to particular servers or parameters overriding this choice.  It’s like the connection was “sticky,” connecting successfully to the first server and then ignoring the later changes.

It turns out, that’s exactly the case.  He turned on the RetainSameConnection flag for the connection manager.  If you set the value equal to true, then SSIS will open a single connection for the duration of the package run.  There are a couple of times in which you might want to set this to true:

  1. You have a complex package which hits the same connections over and over again, or you have to do lots of operations in a foreach loop.  This can remove the overhead of re-opening connections over and over again, thereby improving performance.
  2. You are using global temp tables and want multiple SSIS tasks to manipulate data in those global temp tables.

My co-worker was doing option #2, setting up a global temp table.  He turned on RetainSameConnection to get that functionality working, not realizing that it necessarily precluded the other requirement, that this package be able to modify the connection to point to a different server based on the current customer.

As a note, if you set RetainSameConnection to false, you will lose the ability to use global temp tables in multiple SSIS tasks.  The reason is that the connection will close between tasks, and once the last session which references a global temp table closes, that temp table can be marked for cleanup.

In my case, I ripped out the global temp table logic and set RetainSameConnection to False.  That allowed me to use dynamic connections as expected.

SQL Saturday Distances, A Closer Look

Last week, I put together an analysis of SQL Saturdays.  There were a lot of great comments, including one by Jack Corbett, who was interested in seeing a closer analysis of how far attendees tend to travel:

Most people from Maine aren’t going to anything outside of New England, and most aren’t going to go south of Boston. Seems to limit opportunities for attendees. Instead of analyzing the distance between locations, let’s look at how far non-speakers are traveling to go to events and limit based on that.

Fortunately(?), I happen to have just the data set.  PASS has made certain Summit and SQL Saturday data available as part of the SQL Server Launch 2016 Discovery Days.  Now, let me warn you:  this is going to be a long blog post.  But hey, it’s Friday, so cozy on up with your beverage of choice and let’s get data scienceing.

Dirty Data Done Dirt Cheap

Airing of Grievances

Before I begin, allow me to perform the data science Airing of Grievances.  This is an important part of data analysis which most people gloss over, instead jumping right into the “clean up the dirty data” phase.  But no, let’s revel in its filth for just a few moments.

  • There are some strange data type choices here.  For example, UserGUID (which is the deidentifier for individuals) is a uniqueidentifier on PASSMembersWithGeolocation, but is a different type in SQLSatRegistrations.  There are some other data type choices which aren’t ideal (for example, the SQL Saturday number is a float?!), but that’s not a huge deal.
  • What is a huge deal is that some of the postal codes are wrong.  If you look at New Hampshire addresses, the postal codes have leading 0s lopped off.  That came back to burn me in my first analysis, requiring me to re-run my geolocation function.
  • There is no breakout by event; we only see a denormalized junction of SQL Saturday events and attendees.  That can make calculations a bit harder, so I’d like to normalize this a bit.
  • The second-biggest issue is that the data set is, in my opinion, incomplete.  There are a few reasons why:
    • We only have data for people who registered, not for people who attended.  There might be some pattern in people who registered but did not attend, but we won’t be able to see that.  There are also going to be people who attended but did not register; these day-of attendees aren’t going to be in the system, but we let them in anyhow if we have room.  I think those attendees would skew toward “local” rather than “distant” drive times, meaning that you’re not going to travel 500 miles to sneak into an event for which you didn’t register…usually.
    • There is no separation of speakers, sponsors, and “normal” attendees.  Speakers and sponsors travel across the country to visit these things (we’re weird that way), but the big question is, how far do non-speaking, non-sponsoring attendees normally travel?  I have no way of knowing for sure who is which, so I’ll try to answer that in the data cleansing section.
    • This analysis is only through about April of 2016.  There have been several events since, which could affect our analysis, but I don’t have that data.
    • There are lots of missing fields.  For example, there are 9 American SQL Saturdays which had no postal code set.  There are also a lot of attendees with missing data—they didn’t fill in the city or the postal code or the region (state).  There are even attendees with zero information filled out, and those are just going to be useless for our examination.
  • This leads to my biggest issue:  nasty data.
    • There are no controls on data entry.  If you want to set city to “Nope, not gonna tell” then it’ll save that.
    • Furthermore, all of the entries are free-form.  There are no checks to ensure that a postal code is valid, that a city exists, etc.
    • In short, there are a lot of misunderstandings, miscodings, mistrust, and misanthropy in this data set.

Limitations of Analysis

Given the issues in this data set, I have to make a couple tough calls with the data.  First, I’m only going to look at US attendees of US SQL Saturdays.  I wanted to look at US and Canada SQL Saturdays like I did in my last analysis, but I had a lot of geocoding problems with Canadian data, so I had to throw that away.  Also, I needed to throw away a lot of attendee data because I simply didn’t have enough to get a valid latitude and longitude.  So we’ll have to go ahead knowing that our analysis is incomplete and any conclusions we draw will be uncertain—but that’s the nature of statistical analysis, so we’ll trudge on nonetheless.

Doing Data Science

Initial Preparation

Grab the PassDataIndividualTables data set.  We’re going to create a database called PASSData and load the tables which are not named “_xlnm_FilterDatabase.”  Once we have that, I did some basic data cleansing work.  I made UserGUID a uniqueidentifier across the board and changed SatNum to an INT instead of a FLOAT.  Then, I built a couple of tables:  dbo.SQLSaturday and dbo.SQLSaturdayAttendee. First, dbo.SQLSaturday:

CREATE TABLE dbo.SQLSaturday
(
	EventNumber INT NOT NULL,
	EventDate DATE NULL,
	EventName NVARCHAR(50) NOT NULL,
	StreetAddress NVARCHAR(100) NOT NULL,
	City NVARCHAR(100) NOT NULL,
	Country NVARCHAR(100) NOT NULL,
	PostalCode VARCHAR(14) NULL,
	Latitude DECIMAL(8,4) NULL,
	Longitude DECIMAL(8,4) NULL,
	GeoPoint AS GEOGRAPHY::Point(Latitude, Longitude, 4326),
	CONSTRAINT [PK_SQLSaturday] PRIMARY KEY CLUSTERED(EventNumber)
);

INSERT INTO dbo.SQLSaturday
(
	EventNumber,
	EventDate,
	EventName,
	StreetAddress,
	City,
	Country,
	PostalCode
)
SELECT DISTINCT
	s.SatNum,
	ss.EventDate,
	s.EventName,
	CASE
		WHEN s.StreetAddress IS NULL AND s.VenueName = 'N/A' THEN ''
		WHEN s.StreetAddress IS NULL AND s.VenueName <> 'N/A' THEN s.VenueName
		ELSE s.StreetAddress
	END,
	s.City,
	s.Country,
	s.PostalCode
FROM dbo.SQLSatRegistrations s
	OUTER APPLY
	(
		SELECT TOP(1)
			EventDate
		FROM dbo.SQLSatSessions sess
		WHERE
			sess.SatNum = s.SatNum
	) ss
WHERE
	s.Country IN ('United States');

--Tacoma 2009 and Jacksonville 2008 (#20) did not have dates set.  Set them manually.
UPDATE dbo.SQLSaturday SET EventDate = '2008-09-12' WHERE EventNumber = 20 AND EventDate IS NULL;
UPDATE dbo.SQLSaturday SET EventDate = '2009-04-01' WHERE EventNumber = 11 AND EventDate IS NULL;

I added Latitude and Longitude so that I could geocode data, and then a Geography data type so that I can calculate distances. Notice that there were two SQL Saturdays which did not have dates. I’m prety sure that Tacoma (event #11) never took place, but I updated it and Jacksonville’s second 2008 event regardless, just to make sure everything was accurate.

Next up, I want to create a table for attendees.

CREATE TABLE dbo.SQLSaturdayAttendee
(
	UserGUID UNIQUEIDENTIFIER NOT NULL,
	City NVARCHAR(50) NULL,
	Region NVARCHAR(50) NULL,
	Country NVARCHAR(50) NULL,
	PostalCode NVARCHAR(50) NULL,
	NumberAttended INT NOT NULL,
	Latitude DECIMAL(8,4) NULL,
	Longitude DECIMAL(8,4) NULL,
	GeoPoint AS GEOGRAPHY::Point(Latitude, Longitude, 4326),
	CONSTRAINT [PK_SQLSaturdayAttendee] PRIMARY KEY CLUSTERED(UserGUID)
);

INSERT INTO dbo.SQLSaturdayAttendee
(
	UserGUID,
	City,
	Region,
	Country,
	PostalCode,
	NumberAttended
)
SELECT
	p.UserGUID,
	p.City,
	p.Region,
	p.Country,
	p.PostalCode,
	COUNT(1)
FROM dbo.SQLSatRegistrations r 
	INNER JOIN dbo.PASSMembersWithGeolocation p 
		ON r.UserGUID = p.UserGUID 
WHERE
	r.Country in ('United States')
	AND
	(
		(p.city IS NOT NULL AND p.region IS NOT NULL)
		OR (p.postalcode IS NOT NULL)
	)
	AND ISNULL(p.Country, 'United States') = 'United States'
GROUP BY
	p.UserGUID,
	p.City,
	p.Region,
	p.Country,
	p.PostalCode;

--Fix cases when postal codes had leading 0s lopped off.
UPDATE dbo.SQLSaturdayAttendee
SET
	PostalCode = RIGHT('00000' + PostalCode, 5)
WHERE
	LEN(PostalCode) < 5;

I have the same Latitude, Longitude, and GeoPoint attributes, and we’ll make use of those later. I also put on a NumberAttended attribute, which sums up the number of times a person has attended a SQL Saturday. I’m limiting the attendee set to those who attended events in the United States (r.Country) and who have either City+Region or PostalCode set. Then, I’m going to fix postal codes, adding leading 0s to ZIP Codes missing them.

Quick Analysis

We’re far enough along that I can do some quick data set analysis and see what we’ve got.

--Basic data
SELECT COUNT(1) FROM dbo.SQLSatRegistrations;
SELECT COUNT(1) FROM dbo.SQLSatRegistrations WHERE Country IN ('United States');
SELECT COUNT(1) FROM dbo.SQLSatRegistrations r INNER JOIN dbo.SQLSaturdayAttendee a ON r.UserGUID = a.UserGUID;
SELECT COUNT(1) FROM dbo.SQLSaturdayAttendee;

BasicCounts

There are 152,000 registrants in the data set, of which 101,164 registered for United States events.  Our data set contains 33,419 distinct attendees, who make up 65,682 of the 100K United States event registrants.

Geocoding Time!

I’m going to do three separate rounds of geocoding.  First, let’s geocode the SQL Saturday events.  There are only about 300 of them, so I can use the same patterns that I used in my previous post.  So let’s move over to some R code for a bit.

install.packages('RDSTK', repos = "http://cran.us.r-project.org")
install.packages("ggmap")
install.packages("RODBC")
install.packages("dplyr")
install.packages("httr")
install.packages("rjson")
install.packages("lubridate")
library(RDSTK)
library(ggmap)
library(RODBC)
library(dplyr)
library(lubridate)

conn <- odbcDriverConnect("Driver=SQL Server;Server=.;Initial Catalog=PASSData;Provider=SQLNCLI11.1;Integrated Security=SSPI")
sqlsats <- sqlQuery(conn, "SELECT EventNumber, StreetAddress, City, Country, ISNULL(PostalCode, '') AS PostalCode from PASSData.dbo.SQLSaturday")
sqlsats$GeoLookup <- paste(sqlsats$StreetAddress, sqlsats$City, sqlsats$Country, sqlsats$PostalCode, sep = " ")

# Perform Google Maps geocoding.  Google Maps provides 1000 geocoding hits per day for free.
# Our data set is ~300 observations, so we can do that without issue.
sqlsats <- cbind(sqlsats, t(sapply(sqlsats$GeoLookup, geocode, USE.NAMES=FALSE)))

#A few records failed to get lat/long
#I will handle them as one-off updates in SQL.
sqlsats
sqlsats.good <- filter(sqlsats, !is.na(lon))
sqlsats.good
# Via http://stackoverflow.com/questions/14334840/how-to-insert-a-dataframe-into-a-sql-server-table
# Not a great practice, but for 300 rows, it'll do.
cmd <- paste("UPDATE PASSData.dbo.SQLSaturday SET Latitude = ",sqlsats.good$lat,", Longitude = ",sqlsats.good$lon," WHERE EventNumber = ", sqlsats.good$EventNumber, sep="", collapse=";")
result <- sqlQuery(conn, cmd, as.is=TRUE)

The first dozen lines or so are loading lots of packages. I’m not going to use all of them for this first section, but we’ll need them later. Then, we make a connection to SQL Server and grab data from the SQLSaturday table. I create a GeoLookup column to make lookups a bit easier.

The next line, where we do the geocoding, is a really cool thing you can do in R. Basically, I want to take my sqlsats data frame and append latitude and longitude to it. That’s what cbind (column bind) gives us. To understand the t(sapply(…, geocode, …)) bit, you have to work from the inside out. The geocode function is what we use to get latitude and longitude from Google Maps given an address. We want to apply that function to every observation in the sqlsats data frame (in T-SQL parlance, every row in the table), so we use the sapply variant of the apply function, and use the t() function to transpose the results so we can append them to our sqlsats data frame.

From there, I filter out any SQL Satudays missing longitude and update my SQLSaturday table with the longitudes and latitudes we grabbed.

As mentioned, there are some SQL Saturdays missing latitude and longitude, so I had to do lookups manually. There were also a couple of SQL Saturdays which were miscoded because they didn’t have states listed.

--Manual cleanup
UPDATE dbo.SQLSaturday SET Latitude = 30.2782, Longitude = -81.511850 WHERE StreetAddress = '1 UNF Drive';
UPDATE dbo.SQLSaturday SET Latitude = 41.6618, Longitude = -91.5365 WHERE StreetAddress = '101 Jessup Hall';
UPDATE dbo.SQLSaturday SET Latitude = 45.5026, Longitude = -73.5735 WHERE StreetAddress = '2000 McGill College, 5e étage';
--Columbia was marked as W. Columbia in New York, rather than Columbia South Carolina
UPDATE dbo.SQLSaturday SET Latitude = 33.9518, Longitude = -81.1157 WHERE City = 'West Columbia';
--Richmond was miscoded as well for 2010.
UPDATE dbo.SQLSaturday SET Latitude = 37.6488, Longitude = 	-78.7015 WHERE EventNumber = 30;

Next up, I want to geocode the attendee data. I’m going to have to do this a different way for a couple of reasons. First, we’re going to have way more than 1000 attendees, so I can’t use Google Maps. Instead, I’m going to use the OpenStreetView data set to perform free geocoding. It’s slower than the Google API, but at least I don’t have to pay…

I’m going to get two subsets of attendees: those who put in a valid ZIP Code (and I can throw out everything else because typically, ZIP Codes are more distinctive than cities), and those who put in valid-ish city + region combos.

--Postal codes
SELECT DISTINCT 
	a.Country,
	a.PostalCode
FROM dbo.SQLSaturdayAttendee a
WHERE
	a.PostalCode IS NOT NULL;

--Approximately valid cities
SELECT DISTINCT
	a.City,
	a.Region,
	a.Country
FROM dbo.SQLSaturdayAttendee a
WHERE
	a.PostalCode IS NULL
	AND LEN(City) > 2
	AND LEN(Region) > 1
	AND City NOT LIKE '%[0-9]%'				--Some people put in street addresses
	AND Region NOT LIKE '%[0-9]%'			--Some people put in strange values
	--Some people are very regional
	AND Region NOT IN ('Midwest', 'North', 'South', 'North America', 'mid-atlantic', 'East', 'IN', 'this one', 'Mountain West', 'NA', 'N/A')
	AND City NOT LIKE '%@%'					--Some people put in e-mail addresses
	AND City <> 'N/A'						--Some people don't like listing city
	AND City <> 'NA'
ORDER BY
	a.City,
	a.Region;

There are 6544 distinct postal codes in the data set, and 2406 distinct city-region pairings, adding up to 8950 lookups we’ll need to do. These 8950 lookups will apply to upwards of 33,273 attendees (which you can get by removing the DISTINCT clauses). At this point, I want to mention that we’re down to about 30K attendees from our original 93K data set (that is, there were 93,838 unique UserGUIDs in dbo.SQLSatRegistrations). If I were trying to do predictive analysis, I’d start to get really concerned here because we’re throwing away 2/3 of the attendee set. But let’s soldier on, because 30K attendees should still at least give us some idea.

Geocoding this time, as I mentioned, will be a bit trickier. Let’s get to the R:

# From http://stackoverflow.com/questions/22887833/r-how-to-geocode-a-simple-address-using-data-science-toolbox
# My own modifications include the if-else block for lookup failures
geo.dsk <- function(addr){ # single address geocode with data sciences toolkit
  require(httr)
  require(rjson)
  url      <- "http://www.datasciencetoolkit.org/maps/api/geocode/json"
  response <- GET(url,query=list(sensor="FALSE",address=addr))
  json <- fromJSON(content(response,type="text")) if(length(json['results'][[1]]) > 0) {
    loc  <- json['results'][[1]][[1]]$geometry$location
    return(c(address=addr,long=loc$lng, lat= loc$lat))
  } else {
    return(c(address='',long=0,lat=0))
  }
}

zips <- sqlQuery(conn, "SELECT DISTINCT a.Country, a.PostalCode FROM PASSData.dbo.SQLSaturdayAttendee a WHERE a.PostalCode IS NOT NULL;")
zips$CheckString = paste(zips$PostalCode, zips$Country, sep = " ")
zips$lat <- 0
zips$long <- 0

result <- do.call(rbind,lapply(as.character(zips$CheckString),geo.dsk))
zips <- data.frame(result)
zips
head(zips)

sqlSave(conn, zips, tablename="ZipStage")

cities <- sqlQuery(conn, "SELECT DISTINCT a.City, a.Region, ISNULL(a.Country, '') AS Country FROM PASSData.dbo.SQLSaturdayAttendee a WHERE a.PostalCode IS NULL AND LEN(City) > 2 AND LEN(Region) > 1 AND City NOT LIKE '%[0-9]%' AND Region NOT LIKE '%[0-9]%' AND Region NOT IN ('Midwest', 'North', 'South', 'North America', 'mid-atlantic', 'East', 'IN', 'this one', 'Mountain West', 'NA', 'N/A') AND City NOT LIKE '%@%' AND City <> 'N/A' AND City <> 'NA' ORDER BY a.City, a.Region;")
cities$CheckString = paste(cities$City, cities$Region, cities$Country, sep = " ")
cities$lat <- 0
cities$long <- 0

result <- do.call(rbind,lapply(as.character(cities$CheckString),geo.dsk))
cities <- data.frame(result)
cities

sqlSave(conn, cities, tablename="CitiesStage")

I copy-pasted the geo.dsk function from Stack Overflow and added in an if-else statement to check to see if json[‘results’] has values. Basically, we’re hitting the Data Science Toolkit’s geocoding API endpoint with the partial address (postal code or city + state) we have, and then we’re shredding the returned JSON to get latitude and longitude. If the lookup failed for whatever reason, I return a latitude and longitude of 0.

The first data set I pull out is ZIP Codes. I first create a data frame from my SQL query, add a CheckString variable to paste together the address components, and then call geo.dsk to perform lookups for each CheckString in zips. The results get converted into a data frame, which becomes my new zips, and I’ll use RODBC’s sqlSave function to save the data in a staging table that I’ll call ZipStage. We do the same thing for cities.

Now that I have latitudes and longitudes, I’m going to import them from my staging table into the SQLSaturdayAttendee table. Note that I didn’t paste together a massive update statement for the 9000-ish results; instead, I put them in a staging table, as that’s going to be a lot more efficient than doing 9000 update statements.

UPDATE a
SET
	Latitude = z.lat,
	Longitude = z.long
FROM dbo.SQLSaturdayAttendee a
	INNER JOIN master.dbo.ZIPStage z
		ON CONCAT(a.PostalCode, ' ', a.Country) = z.address

UPDATE a
SET
	Latitude = c.lat,
	Longitude = c.long
FROM dbo.SQLSaturdayAttendee a
	INNER JOIN master.dbo.CitiesStage c
		ON CONCAT(a.City, ' ', a.Region, ' ', a.Country) = c.address
		AND a.PostalCode IS NULL

--Manual cleanup of Columbus, Ohio, which was geocoded to somewhere around Berea
UPDATE dbo.SQLSaturdayAttendee SET Latitude = 39.9612, Longitude = -82.9988 WHERE City = 'Columbus' AND Region IN ('Ohio', 'OH') AND PostalCode IS NULL;

One interesting thing I found was that the geocoding service put Columbus, Ohio up near Cleveland, which skewed Columbus results significantly. I’m not sure why it did that, but this leads me to believe that there are probably a few other geocoding errors that I didn’t catch. I don’t think they’re going to alter radically the results, however, so I didn’t spend any more time looking for them.

Who Speaks?

One consideration I had with this data set is that I didn’t want to include speakers and sponsors in the attendee set.  Ideally, I’d keep them separated because their goals and motivations are different.  For this data set, I’m really interested in seeing behavior patterns of non-speaker, non-sponsor attendees.  But how do I figure out who’s an attendee and who’s a speaker when the data set is completely anonymized and there aren’t any flags for speakers or sponsors of events?

Twin Peaks

My expectation is that speaker/sponsor behavior will differ from “normal” attendee behavior.  Here’s a quick data sample to show my conjecture in action.  I’m going to query everybody in ZIP Code 27703 (which is where I live).  See if you can figure out which one of these is me…

SELECT
	*
FROM dbo.SQLSaturdayAttendee a
WHERE
	a.PostalCode = '27703'
ORDER BY
	a.NumberAttended DESC;

27703Distribution

If I try to generalize this phenomenon, I’ll look for how many people attended (or at least registered for) a particular number of SQL Saturday events.  Ideally, what I’d find is a lot of people with 1, 2, or 3 events; then a dip; then an increasing number of people at some point; and finally another dip.

I’m going to use a quick T-SQL query to get the raw data, and then analyze it in R because R is perfect for this type of quick analysis.

#Figure out the cutoff for likely speakers.
cmd <- "SELECT NumberAttended, COUNT(1) AS People FROM PASSData.dbo.SQLSaturdayAttendee GROUP BY NumberAttended ORDER BY NumberAttended;"
attendee.breakdown <- sqlQuery(conn, cmd)
#A vast majority of entrants have been to one SQL Saturday.  Let's take the log of the distribution
#to see the pattern a bit better.
barplot(xlab="Number Of SQL Saturdays Attended", attendee.breakdown$People, names=attendee.breakdown$NumberAttended)
#What we see is an exponential decay pattern.  This is unfortunate for us because we
#were hoping to find a two-peaked distribution.
barplot(xlab="Number Of SQL Saturdays Attended", log(attendee.breakdown$People), names=attendee.breakdown$NumberAttended)

We’re not looking for pretty; we’re looking for fast. If I wanted pretty, I’d pull down ggplot2 and spend a while sprucing up the graphics. Anyhow, here’s what you get when you run the first barplot:

AttendedCount

That’s really hard to read because such a high percentage of attendees have been to just one SQL Saturday. So next up, I’m going to take the log of number of attendees. This will give us a better idea of the underlying distribution of data.

AttendedCountLog

I believe the appropriate response at this point is, “Well, shoot.”  We see a fairly smooth exponential decrease as number of SQL Saturdays attended goes up.  There’s a little blip around the 13-14 mark, but you’d have to do some major squinting to make out a second distribution in the data.  As a result, I don’t have a great bifurcation mark, and so I’m picking a quick and dirty number:  9.  If you’ve been to at least 9 SQL Saturdays, you’ve probably gone from “casual attendee” to speaker (or at least active volunteer).  It’s easier to think of counter-examples in places, like Jacksonville, which have had quite a few more SQL Saturdays than places like Omaha, but I think 9’s about as good a cutoff point as any.

Analyzing Attendees

So now, after about 3000 words, we get the main event.  Who’s really going to these things?  As usual, I’m going to start off with a T-SQL statement which frames my data nicely, and then perform analysis using R.

SELECT
	s.EventNumber,
	s.EventName,
	s.EventDate,
	s.City,
	a.UserGUID,
	a.NumberAttended,
	s.GeoPoint.STDistance(a.GeoPoint) / 1609.344 AS DistanceInMiles
FROM PASSData.dbo.SQLSatRegistrations r
	INNER JOIN PASSData.dbo.SQLSaturday s
		ON s.EventNumber = r.SatNum
	INNER JOIN PASSData.dbo.SQLSaturdayAttendee a
		ON r.UserGUID = a.UserGUID
WHERE
	a.Longitude IS NOT NULL
  AND (s.GeoPoint.STDistance(a.GeoPoint) / 1609.344) < 3000
	AND a.NumberAttended < 9;

This is my data set, which returns 55019 rows. This means that we lost approximately half of our registrants as a result of data cleansing. Again, this should make us quite circumspect about predicting results, but I think we have enough data to do some descriptive analytics without making fools of ourselves.

One stipulation I want to make is that I’m cutting off any entries over 3000 miles. My belief is that nobody is flying cross-country to visit a SQL Saturday (again, excluding speakers and sponsors). The only exception I could see is Hawaii, so I’m probably skewing that result a bit…but even then, most of the long-distance travelers are probably speakers and sponsors going on vacation. If the distance I came up with is more than 3000 miles, I’m just treating it as bad data.

#Get non-speaker attendee distance from each event.
cmd <- "SELECT
	s.EventNumber,
	s.EventName,
	s.EventDate,
	s.City,
	a.UserGUID,
	a.NumberAttended,
	s.GeoPoint.STDistance(a.GeoPoint) / 1609.344 AS DistanceInMiles
FROM PASSData.dbo.SQLSatRegistrations r
	INNER JOIN PASSData.dbo.SQLSaturday s
		ON s.EventNumber = r.SatNum
	INNER JOIN PASSData.dbo.SQLSaturdayAttendee a
		ON r.UserGUID = a.UserGUID
WHERE
	a.Longitude IS NOT NULL
  AND (s.GeoPoint.STDistance(a.GeoPoint) / 1609.344) < 3000
	AND a.NumberAttended < 9;"
attendee <- sqlQuery(conn, cmd)

From here until I get through the results, I’m going to throw in a bunch of images and single lines of code. I think the images describe what the code is doing, so I’ll mostly add color commentary instead of function call descriptions.

boxplot(attendee$DistanceInMiles, data=attendee, main="Attendee Distance Traveled", horizontal=TRUE)

AttendeeDistanceTraveled

This box and whisker plot shows a couple of things: the bottom 75th percentile doesn’t travel that far, and there are a lot of outliers in this data set. As useful as this visualization is, I also want to get a summary of results.

summary(attendee$DistanceInMiles)

TotalSummary

The median distance is 19 miles, and the mean distance is 94 miles. 75% of attendees travel less than 70 miles in this data set. So where do we get 400-600 miles? Maybe it’s the 90th or 95th percentile.

quantile(attendee$DistanceInMiles, c(0.25, 0.5, 0.75, 0.9, 0.95))

TotalQuantile

95% of attendees travel under 400 miles. But how has this changed over time? I’m going to use the year function in lubridate and apply it to my distance in miles plot to group by year.

boxplot(attendee$DistanceInMiles~year(attendee$EventDate), main="Attendee Distance Traveled By Year", horizontal=TRUE)

AttendeeDistanceTraveledByYear

This graphic is packed, but we can see a couple of things. First, the median has dropped a bit since 2007-2008. Second, the 75th percentile has also dropped a bit. I think both of these make a lot of sense: as we get more events, people don’t have to travel as far to attend a SQL Saturday. If the only one in your region is 300 miles away, you can justify going that far once a year for a unique event. But if there are 6 in your region over the course of a year, why not just pick the closest one or two?

Still, I want to see some hard numbers, so I’m going to call the summary function for each year.

tapply(attendee$DistanceInMiles, year(attendee$EventDate), summary)
$`2007`
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
   2.896   13.410   93.150  229.600  182.700 2147.000 

$`2008`
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
   0.2401   10.5600   27.3500  159.9000  118.8000 2515.0000 

$`2009`
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
   0.2845    9.6410   19.2300  116.2000   73.5600 2473.0000 

$`2010`
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
   0.1588   10.1300   20.5500  120.6000   95.0500 2717.0000 

$`2011`
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
   0.6631   10.5800   23.2100  123.3000  106.2000 2686.0000 

$`2012`
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
   0.1913    9.7210   21.3200  113.0000   92.9900 2843.0000 

$`2013`
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
   0.0337    9.3650   19.7400  103.6000   72.5000 2710.0000 

$`2014`
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
   0.0337    8.5420   17.3000   84.3200   60.1100 2694.0000 

$`2015`
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
   0.0337    8.1410   16.5700   74.9300   49.7500 2700.0000 

$`2016`
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
   0.0337    9.3220   19.8800   81.8100   57.3300 2685.0000 

The mean has dropped considerably from over 110-120 miles down to 75-85 since 2014. 75% of attendees now won’t go more than about an hour for an event—or, maybe phrased differently, events are finding plenty of local attendees. Let’s see what it looks like at the 90th and 95th percentiles.

tapply(attendee$DistanceInMiles, year(attendee$EventDate), quantile, type=8, probs=c(0.25, 0.5, 0.75, 0.9, 0.95))
$`2007`
       25%        50%        75%        90%        95% 
  13.29615   93.14978  185.26700  865.47610 1025.75327 

$`2008`
      25%       50%       75%       90%       95% 
 10.56431  27.35487 118.84406 457.74800 823.98637 

$`2009`
       25%        50%        75%        90%        95% 
  9.616829  19.226355  74.048157 204.863135 466.467323 

$`2010`
      25%       50%       75%       90%       95% 
 10.12820  20.55441  95.39654 224.65303 546.36305 

$`2011`
      25%       50%       75%       90%       95% 
 10.57865  23.21103 106.20535 235.10075 571.08523 

$`2012`
       25%        50%        75%        90%        95% 
  9.717188  21.323516  92.995248 250.098624 450.006102 

$`2013`
       25%        50%        75%        90%        95% 
  9.365244  19.742029  72.515673 215.197417 439.364723 

$`2014`
       25%        50%        75%        90%        95% 
  8.541756  17.301598  60.168109 177.018031 316.380895 

$`2015`
       25%        50%        75%        90%        95% 
  8.141225  16.571739  49.746705 155.265559 259.042488 

$`2016`
       25%        50%        75%        90%        95% 
  9.320905  19.878521  57.334977 155.104415 279.424947 

Okay, I mentioned I’d focus mostly on color commentary. Here’s an exception: I specified type=8 for my quantile function. Which type you choose can make a huge difference in calculation. The link recommends type 8 when you don’t have a reason to go with a different calculation, but you can use any of the continuous calculation types.  They will differ a bit (and you can see differences between what summary provided us and what quantiles gives us), but for this data set, I don’t think you get any crazy differences by selecting type 7 or 9.

Finally, this leads me to an interesting question: are we getting more registrants per event now than before? If we’re seeing event growth, attendee growth, and distance shrinkage, that indicates that SQL Saturdays are becoming more popular, drawing in people who previously wouldn’t travel the 20 miles across town to spend a day learning. Back to SQL Server once more.

SELECT
	YEAR(s.EventDate) AS EventYear,
	COUNT(DISTINCT s.EventNumber) AS NumberOfEvents,
	COUNT(1) AS NumberOfRegistrants,
	1.0 * COUNT(1) / COUNT(DISTINCT s.EventNumber) AS AverageAttendance
FROM dbo.SQLSaturday s
	INNER JOIN dbo.SQLSatRegistrations r
		ON s.EventNumber = r.SatNum
GROUP BY
	YEAR(s.EventDate)
ORDER BY
	EventYear DESC;

RegistrantAveragePerYear

This time I included anybody who is in the list, regardless of whether we could geocode that person’s location. Given that 2016 was only about a third of the way over, I’m not sure we can put much credence into those registrant counts. But we have seen an increase over the years. It’s not a major increase, but it is an increase.

Conclusions

I’ve hedged my bets a few times during this post because I know the data’s incomplete, and there’s a risk that the points of the data set I’m missing would conclusively prove me wrong.  Nevertheless, I think the likelihood of those extra data points being radically different than the ~50% that we were able to keep is fairly low, so I’ll make a couple of cautious conclusions.

First, non-speakers and non-sponsors do not appear to travel very far to SQL Saturdays.  Looking at the yearly quantile breakout above, by the time we get to 2014, people have become a bit insular in their SQL Saturday attendance.  The median traveler (after that initial event in Orlando in 2007) has never traveled more than about 20 miles.

Also, I think the 600 mile radius is too large, based on the data set.  Looking at the 95th percentile of travelers, those people haven’t gone more than 400 miles on average since 2013, and the number has dropped down to the 260-280 range.  Also, the 95th percentile, even in a heavily skewed distribution like this one, is looking far out there.  Even if you scale it back to just the 90th percentile (which is still considered an edge case area), attendees since 2014 won’t even go 200 miles.  That means that there could be some overlap between Cleveland and Columbus (which, having visited both the past couple of years, I do see), but there’d be almost no overlap between Cleveland and Cincinnati, and you’d be stretching to see much overlap between Columbus and Louisville.

But my real conclusion is this:  grab the data set and draw your own conclusions.  I’m working through this stuff the same as anybody else could, and we’re in a community full of people who know how to analyze data.  Maybe you’ll find something I missed which either bolsters or undercuts one of my points; either way, I want to know and I want to push you to dig just a little deeper than I did.  (But hopefully what you find won’t completely invalidate my points…)

Become A Better Speaker

That’s a big title.  Expect “(on the margin)” in tiny font after the title.

What Other People Say

Let me start off with some great resources, and then I’ll move to a few things that have helped me in the past.

Troy Hunt

Troy is a master of preparation.  Check out his prep work he did for NDC.  I don’t remember if he linked in either of these posts, but he will schedule tweets for certain parts of his talk because he knows exactly when he’ll get to that point.  Also check out his speaker anti-patterns and fix them.  Of his anti-patterns, I most hate the Reader.  Please don’t read your slides; your audience can read faster than you can speak, and your audience can read the slides later.  Tell engaging stories, tie things together, and give people something they can remember when they do re-read those slides.  You’ll also enjoy his checklist of the 19 things he needs to do before the talk begins.

What I get from Troy:  Practice, practice, practice.  One thing I want to get better at is taping myself and forcing me to listen to it.  There are verbal tics which you can only find when you listen to yourself.

Brent Ozar

Brent is a top marketer in the SQL Server world.  Here’s some great advice from him on delivering technical presentations.  He’s absolutely right about giving people a link at the end telling them where to get additional materials.  He has additional information on how to handle slides, working with the audience, and getting prepped for the session.  I think his music idea is interesting; at a couple all-day cons we’ve hosted locally, the guy who hosts tends to put on some early morning music before festivities begin.  It lightens up the mood considerably and works a lot better than a bunch of introverts staring quietly at their phones for 40 minutes until someone makes a move to begin.

What I get from Brent:  I’ve cribbed a few things from him.  All of my slide decks have two pieces of information at the end:  “To learn more, go here” and “And for help, contact me.”  The first bit then points to a short form address URL I bought (http://CSmore.info) and which redirects you to the longer-form address.  On that link, I include slides, a link to my GitHub repo, and additional links and helpful information.  The second bit on contacting includes my e-mail address and Twitter handle.

Paul Randal

Paul has a great Pluralsight course on communication.  This course isn’t just about presentations, but there are two modules, one on writing presentations and one on giving presentations.  I highly recommend this course.

What I get from Paul:  Practice, prepare, and don’t panic.  Compare against my previous review of this course.  Things will inevitably go wrong, so take spare cables and extras of anything you can fit.  At this point, I even carry a mini projector in case of emergency.  I’ve not needed to use it but there might come a day.

Julia Evans

Julia has a fantastic blog post on improving talks and conferences.  I definitely like her point about understanding your target audience.  Her argument in favor of lightning talks is interesting, and I think for beginners, it’s a good idea.  For more experienced speakers, however, 10 minutes is barely an introduction, and sometimes I want a deep dive.  Those have to be longer talks just by their nature.

Another great point she makes is to give hard talks:  aim for beginners to scrape part of it but for experts to learn from it as well.  I absolutely love Kyle Kingsbury’s work too and he helped me get a handle on distributed systems, but in a way that I could re-read his posts several months later and pick out points I never got before.

What I get from Julia:  Find your motivation and make your talks work better for a broader range of people.  I have one talk in particular on the APPLY operator which has the goal of making sure that pretty much anybody, regardless of how long you’ve been in the field, learns something new.  There are a couple of examples which are easier for new users to understand and a couple of examples which are definitely more advanced but still straightforward enough that a new user can get there (even if it does take a little longer).  Ideally, I’d like all of my talks to be that way.

What I Recommend

Here are a few recommendations that I’ll throw out at you.  I’m going to try not to include too much overlap with the above links, as I really want you to read those posts and watch those videos.

  • Practice!  Practice until you know the first five minutes cold.  There are some presenters who will practice a talk multiple times a day for several days in a row.  I’m not one of those people, but if you’re capable of it, go for it.
  • Record yourself.  Find all of those placeholder words and beat them out of yourself.  I don’t mean just “Uh,” “Er,” “Um,” “Ah,” and so on.  In my case, I have a bad habit of starting sentences with “So…”  I’m working on eliminating that habit.  Recordings keep you honest.
  • Tell interesting stories.  To crib from one of my 18th century main men, Edmund Burke, “Example is the best school of mankind, and they will learn at no other.”  Theory ties your work together, and stories drive the audience.  Stories about failure and recovery from failure are particularly interesting; that’s one of the core tenets of drama.
  • Prep and practice your demos.  If you’re modifying anything (databases, settings, etc.) over the course of your demo, have a revert script at the end or revert your VM.  That way, you won’t forget about it at the end of a great talk, give the talk again later (after you’ve forgotten that you never rolled everything back), and have your demos fail because you forgot.  Not that this has happened to me…
  • Speaking of failure, prepare for failure.
    • Have extra cables.  I have all kinds of adapters for different types of projectors.  I have VGA, DVI (though I’ve only seen one or two projectors which required this), and HDMI adapters for both of my laptops in my bag at all times.
    • Prepare to be offline.  If your talk can be done offline, you should do it that way.  Internet connections at conferences are pretty crappy, and a lot of demo failures can be chalked up to flaky networks.  This means having your slides available locally, having your demo scripts available locally, etc.
    • Have your slides and demo scripts on a spare drive, USB stick, or something.  If all else fails, maybe you can borrow somebody else’s laptop for the talk.  I had to do this once.  It was embarrassing, but I got through it and actually got good scores.  The trick is to adapt, improvise, and overcome.  And you do that with preparation and practice.
    • Have your slides and demo scripts available online.  I know I mentioned assuming that your internet connection will flake out, but if your system flakes out and someone lends you a laptop but can’t accept USB sticks (maybe it’s a company laptop), at least you can grab the slides and code online.
    • If you do need an internet connection, have a MiFi or phone you can tether to your laptop, just in case.  If you have two or three redundant internet sources, the chances of them all failing are much lower than any single one failing.
    • Have a spare laptop if you can.  That’s hard and potentially expensive, but sometimes a computer just goes bye-bye an hour before your presentation.
  • Install updates on your VMs regularly.  Do it two nights before your presentation; that way, if an update nukes your system, you have a day to recover.  Also, it reduces the risk that Windows 10 will pick your presentation as the perfect time to install 700 updates.  Very helpful, that Windows 10 is.
  • When in doubt, draw it out.  I have embraced touchscreens on laptops, bought a nice stylus, and love drawing on the screen.  I think it helps the audience understand where you’re going better than using a laser pointer, and sometimes you don’t have a whiteboard.  If you don’t like touchscreens, ZoomIt still works with a mouse.
  • Speaking of which, learn how to use ZoomIt or some other magnification tool.  Even if you set your fonts bigger (which yes, you need to do), you will want to focus in on certain parts of text or deal with apps like SQL Server Management Studio which have fixed-size sections.

There are tomes of useful information on this topic, so a single blog post won’t have all of the answers, but hopefully this is a start.

Spark, R, And Zeppelin

We are going to combine three of my interests today:  Spark, data analysis, and notebooks.  Unlike last week’s discussion of notebooks, I’m going to use Apache Zeppelin today.  In today’s post, we will load some restaurant data into Spark and then build a notebook to visualize some of that data.

Getting Started

If you have not done so, follow through my Getting Started with Spark guide.  This will get restaurant data loaded into Spark.  I’m going to assume that you’ve created a cluster, loaded the restaurant table (but make sure you have the version with no NAs!), and are ready to create a notebook.

Create A Notebook

Our first step will be to create an R notebook.

CreateNotebook

CreateRNotebook

My goal is to do some of the things that I did in my Touching on Advanced Topics post.  Originally, I wanted to replicate that analysis in its entirety using Zeppelin, but this proved to be pretty difficult, for reasons that I mention below.  As a result, I was only able to do some—but not all—of the anticipated work.  I think a more seasoned R / SparkR practitioner could do what I wanted, but that’s not me, at least not today.

With that in mind, let’s start messing around.

Analysis

The Struggles

I think it might be more useful to start with the various roadblocks I hit while working on this blog post.  SparkR is a different paradigm, and as someone with a tenuous hold on R, it proved to be a lot harder than I wanted. Here are a few of the things I had trouble with, in the hopes that the solutions will save you a bit of time in the future.

Overwritten Functions

If you install all of the packages from the Jupyter notebook, you’ll find that dplyr has a sql command which will override the Spark SQL command and leave you in a bad place if you don’t know what’s going on.  If you get stuck in this bad place, you can specify the function call, and for Spark, the sql function is SparkR::sql.

Unexpected Results

One of the first struggles I had was that certain operations return data frames rather than vectors, for example, mean.  I was able to use Wendy Yu’s great SparkR tutorial to figure out a way to get mean latitude and longitude from my data set.

Lack of Results

Another major problem I had was that Java is much stricter about data types than R, meaning that when I tried to perform some operation against my restaurants table (in which score is defined as a double), Java would fail with an exception saying that NA is not a valid numeric value.  No matter how I wrote my SQL statement, I was unable to ignore that NA data.  I even switched over to Python, thinking it a SparkR issue, but it turns out to be part of Spark.  I’m sure there’s some flag somewhere that I’m not setting which allows me to ignore those bad values, but I haven’t found it yet.

Data Frames Aren’t DataFrames

Another issue I ran into was when I started using ggplot2.  When I tried to hook up my Spark data frame (which is of type DataFrame), ggplot2 spat back the following error:

Error : ggplot2 doesn't know how to deal with data of class DataFrame

Fortunately, Jose Dianes has an article which discusses this exact problem.  You have to collect the distributed DataFrame before ggplot2 can understand what it is.

The Code

Okay, we spent enough time complaining about my failings; now let’s actually show what we can do.  Unlike the Advanced Topics notebook, I’m going to keep this fairly compact and not meander quite as much as I did (for pedagogical purposes, I promise).

Step one is to load the restaurant data:

restaurantsDF <- sql(sqlContext, "SELECT COUNT(1) AS numberOfResults, AVG(score) AS meanscore, ROUND(lat, 1) AS lat, ROUND(long, 1) AS long FROM restaurants GROUP BY ROUND(lat, 1), ROUND(long, 1)")

Note that in this version, I’m going to specify a narrowed-down query with the relevant attributes and filters I want.  I’ve found this to be preferable to the classic R approach of loading all the data with read.csv and importing dplyr and other tools to filter the data after the fact, particularly because of differences in data frame operations, as we’ll see below.

Step two is to install relevant packages:

install.packages("ggplot2", repos = "http://cran.us.r-project.org")
install.packages("ggmap", repos = "http://cran.us.r-project.org")
library(ggplot2)
library(ggmap)
print('Finished loading libraries.')

I only need ggplot2 and ggmap because I did my transformations earlier.

Step three is where I calculate latitude and longitude.  My latitude and longitude are technically off slightly because I’m taking the mean of pre-aggregated lat-long combos rather than the mean of the population, but the difference is so slight that it doesn’t make an appreciable difference.

r.lat <- collect(agg(restaurantsDF, mean=mean(restaurantsDF$lat)))$mean
r.long <- collect(agg(restaurantsDF, mean=mean(restaurantsDF$long)))$mean

With those calculated and collected, I’m able to pass them in as parameters to ggmap’s get_map function, so we can build a Google map.

#Set wakemap to work around the mean latitude and longitude values we calculated above.  We've set the zoom value to 10 so you can see the entire region.
wakemap <- get_map(location = c(lon = r.long, lat = r.lat), zoom = 10, maptype = "roadmap", scale = 2)

Once I have those, I need to convert my Spark DataFrame class object to a local R data frame.  I do that by calling the collect method.  Because I only need three of the four attributes, I’ll show off the select function as well.

restgrp <- collect(select(restaurantsDF, "meanscore", "lat", "long"))

This data frame is something ggplot2 can use.  So let’s run ggmap and fill in results:

# Now we can fill in ggmap with settings.  We want three function calls:
# scale_fill_gradient (to give us a visual cue of restuarants.  Pick good colors for low & high.)
# geom_text (to display the meanscore, giving us precise values.  Round meanscore to 1 spot after decimal)
# geom_tile (to display blocks of color.  Set alpha = 1)
ggmap(wakemap) +
  scale_fill_gradient(low="black",high="orange") +
  geom_text(data = restgrp, aes(x=long, y=lat, fill = meanscore, label = round(meanscore, 1))) +
  geom_tile(data = restgrp, aes(x=long, y=lat, alpha=1, fill=meanscore)) 

The end result looks a bit like this:

RestaurantHeatmap

If you want to get this as a notebook, I’ve made the R script available here.  You can publish Databricks notebooks, but they’re only public for a period of six months and I don’t want to have a broken link in a few months’ time.

Conclusions

Why Use Zeppelin?

Throughout all of this, I didn’t even mention one of the big advantages to using Zeppelin:  the built-in visualization components.  Zeppelin has made it easier for third parties to display graphics, and so visualizing a data set is as simple as running the display command.  For example, here’s a nice-looking histogram of mean restaurant scores by lat-long pair:

PlotMeanScoreHistogram

That’s a one-line operation, and you can change the plot options to perform other visualizations as well.  This type of visualization is production-worthy, so I give the Databricks team plaudits for making this available to us in such an easy-to-use format.

Final Thoughts

To be honest, I wanted this to be a blog post in which I described how easy the conversion from Jupyter and local R to Zeppelin and SparkR would be.  In reality, there’s a gap between the two.  There are some good articles to help you bridge that gap—I mentioned Wendy Yu’s and Jose Dianes’s—but this certainly is not an exercise in copy-paste; there’s a lot more going on here.  Nonetheless, I recommend getting a handle on SparkR, as the ability to perform data analysis on distributed compute is powerful, especially if you can’t get your boss to approve that desktop with 64 GB of RAM…